logo

Extract links from webpage (BeautifulSoup)

Web scraping is the technique to extract data from a website.

The module BeautifulSoup is designed for web scraping. The BeautifulSoup module can handle HTML and XML. It provides simple method for searching, navigating and modifying the parse tree.

Related courses:

Get links from website


The example below prints all links on a webpage:


from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("https://arstechnica.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
print link.get('href')

It downloads the raw html code with the line:


html_page = urllib2.urlopen("https://arstechnica.com")

A BeautifulSoup object is created and we use this object to find all links:


soup = BeautifulSoup(html_page)
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
print link.get('href')

Extract links from website into array


To store the links in an array you can use:


from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("https://arstechnica.com")
soup = BeautifulSoup(html_page)
links = []

for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))

print(links)

Function to extract links from webpage


If you repeatingly extract links you can use the function below:


from BeautifulSoup import BeautifulSoup
import urllib2
import re

def getLinks(url):
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page)
links = []

for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))

return links

print( getLinks("https://arstechnica.com") )

BackNext

2 thoughts on “Extract links from webpage (BeautifulSoup)


  1. Timothy Filippone
    - January 18, 2018

    The first codeblock wont work

    File “html.py”, line 8
    print link.get(‘href’)

    1. Frank
      - January 20, 2018

      Change to:

      print(link.get(‘href’))

Leave a Reply

Login disabled