Extract links from webpage (BeautifulSoup)


Web scraping is the technique to extract data from a website.

The module BeautifulSoup  is designed for web scraping. The BeautifulSoup module can handle HTML and XML. It provides simple method for searching, navigating and modifying the parse tree.

Related courses:

Get links from website

The example below prints all links on a webpage:

from BeautifulSoup import BeautifulSoup
import urllib2
import re
 
html_page = urllib2.urlopen("http://arstechnica.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
    print link.get('href')

It downloads the raw html code with the line:

html_page = urllib2.urlopen("http://arstechnica.com")

A BeautifulSoup object is created and we use this object to find all links:

soup = BeautifulSoup(html_page)
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
    print link.get('href')

Extract links from website into array

To store the links in an array you can use:

from BeautifulSoup import BeautifulSoup
import urllib2
import re
 
html_page = urllib2.urlopen("http://arstechnica.com")
soup = BeautifulSoup(html_page)
links = []
 
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
    links.append(link.get('href'))
 
print(links)

Function to extract links from webpage

If you repeatingly extract links you can use the function below:

from BeautifulSoup import BeautifulSoup
import urllib2
import re
 
def getLinks(url):
    html_page = urllib2.urlopen(url)
    soup = BeautifulSoup(html_page)
    links = []
 
    for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
        links.append(link.get('href'))
 
    return links
 
print( getLinks("http://arstechnica.com") )
Urllib Tutorial Python 3
HTTP - Parse HTML and XHTML