Extract links from webpage (BeautifulSoup)


Web scraping is the technique to extract data from a website.

The module BeautifulSoup  is designed for web scraping. The BeautifulSoup module can handle HTML and XML. It provides simple method for searching, navigating and modifying the parse tree.

Play Video

Get links from website

The example below prints all links on a webpage:

from BeautifulSoup import BeautifulSoup
import urllib2
import re
 
html_page = urllib2.urlopen("http://arstechnica.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
    print link.get('href')

It downloads the raw html code with the line:

html_page = urllib2.urlopen("http://arstechnica.com")

A BeautifulSoup object is created and we use this object to find all links:

soup = BeautifulSoup(html_page)
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
    print link.get('href')

Extract links from website into array

To store the links in an array you can use:

from BeautifulSoup import BeautifulSoup
import urllib2
import re
 
html_page = urllib2.urlopen("http://arstechnica.com")
soup = BeautifulSoup(html_page)
links = []
 
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
    links.append(link.get('href'))
 
print(links)

Function to extract links from webpage

If you repeatingly extract links you can use the function below:

from BeautifulSoup import BeautifulSoup
import urllib2
import re
 
def getLinks(url):
    html_page = urllib2.urlopen(url)
    soup = BeautifulSoup(html_page)
    links = []
 
    for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
        links.append(link.get('href'))
 
    return links
 
print( getLinks("http://arstechnica.com") )