Tag: web scraping

how to get all page urls from a website

Web scraping is the technique to extract data from a website.

The module BeautifulSoup is designed for web scraping. The BeautifulSoup module can handle HTML and XML. It provides simple method for searching, navigating and modifying the parse tree.

Related course:
Browser Automation with Python Selenium

Get links from website

The example below prints all links on a webpage:


from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("https://arstechnica.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
    print link.get('href')

It downloads the raw html code with the line:


html_page = urllib2.urlopen("https://arstechnica.com")

A BeautifulSoup object is created and we use this object to find all links:


soup = BeautifulSoup(html_page)
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
    print link.get('href')

Extract links from website into array

To store the links in an array you can use:


from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("https://arstechnica.com")
soup = BeautifulSoup(html_page)
links = []

for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
    links.append(link.get('href'))

print(links)

Function to extract links from webpage

If you repeatingly extract links you can use the function below:


from BeautifulSoup import BeautifulSoup
import urllib2
import re

def getLinks(url):
    html_page = urllib2.urlopen(url)
    soup = BeautifulSoup(html_page)
    links = []

    for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
        links.append(link.get('href'))

    return links

print( getLinks("https://arstechnica.com") )

Related course:
Browser Automation with Python Selenium