Extract links from webpage (BeautifulSoup)
Web scraping is the technique to extract data from a website.
The module BeautifulSoup is designed for web scraping. The BeautifulSoup module can handle HTML and XML. It provides simple method for searching, navigating and modifying the parse tree.
Related courses:
Get links from website
The example below prints all links on a webpage:from BeautifulSoup import BeautifulSoup
import urllib2
import re
html_page = urllib2.urlopen("https://arstechnica.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
print link.get('href')
It downloads the raw html code with the line:
html_page = urllib2.urlopen("https://arstechnica.com")
A BeautifulSoup object is created and we use this object to find all links:
soup = BeautifulSoup(html_page)
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
print link.get('href')
Extract links from website into array
To store the links in an array you can use:from BeautifulSoup import BeautifulSoup
import urllib2
import re
html_page = urllib2.urlopen("https://arstechnica.com")
soup = BeautifulSoup(html_page)
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))
print(links)
Function to extract links from webpage
If you repeatingly extract links you can use the function below:from BeautifulSoup import BeautifulSoup
import urllib2
import re
def getLinks(url):
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page)
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))
return links
print( getLinks("https://arstechnica.com") )