python logo


Tag: web scraping

how to get all page urls from a website

Web scraping is the technique to extract data from a website.

The module BeautifulSoup is designed for web scraping. The BeautifulSoup module can handle HTML and XML. It provides simple method for searching, navigating and modifying the parse tree.

Related course:
Browser Automation with Python Selenium

Get links from website


The example below prints all links on a webpage:


from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("https://arstechnica.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
print link.get('href')

It downloads the raw html code with the line:


html_page = urllib2.urlopen("https://arstechnica.com")

A BeautifulSoup object is created and we use this object to find all links:


soup = BeautifulSoup(html_page)
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
print link.get('href')

Extract links from website into array


To store the links in an array you can use:


from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("https://arstechnica.com")
soup = BeautifulSoup(html_page)
links = []

for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))

print(links)

Function to extract links from webpage


If you repeatingly extract links you can use the function below:


from BeautifulSoup import BeautifulSoup
import urllib2
import re

def getLinks(url):
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page)
links = []

for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))

return links

print( getLinks("https://arstechnica.com") )

Related course:
Browser Automation with Python Selenium