python parse html
Python hosting: Host, run, and code Python in the cloud!
If you’ve ever been interested in scraping or analyzing web content, then understanding how to parse HTML is crucial. In this tutorial, we’ll delve into various Python libraries that make this process more accessible.
Related Course:
Beautiful Soup - A Comprehensive Python Library for Parsing HTML and XML
Beautiful Soup is revered among web scrapers for its simplicity and ability to handle ill-formed markup. Below, we illustrate how to extract specific content using Beautiful Soup:
Extracting Content from a
div
Element:1
2
3
4
5
6
7from BeautifulSoup import BeautifulSoup
import urllib2
response = urllib2.urlopen('https://en.wikipedia.org/wiki/Python_(programming_language)')
html = response.read()
parsed_html = BeautifulSoup(html)
print parsed_html.body.find('div', attrs={'class':'toc'})This fetches the HTML content within the ‘toc’ div of the Wikipedia article. If you only require the plain text:
1
print parsed_html.body.find('div', attrs={'class':'toc'}).text
Retrieving the Page Title:
1
print parsed_html.head.find('title').text
Capturing Image URLs from a Website:
1
2
3
4
5
6url = 'https://www.arstechnica.com/'
data = urllib2.urlopen(url).read()
soup = BeautifulSoup(data)
links = soup.findAll('img', src=True)
for link in links:
print(link["src"])Fetching All URLs from a Webpage:
1
2
3
4
5
6url = 'https://www.arstechnica.com/'
data = urllib2.urlopen(url).read()
soup = BeautifulSoup(data)
links = soup.findAll('a')
for link in links:
print(link["href"])
PyQuery - Harnessing the Power of jQuery in Python
If you’re familiar with jQuery, PyQuery offers a comfortable transition. It effortlessly retrieves text or HTML content based on your requirements:
1 | from pyquery import PyQuery |
For fetching the page title using PyQuery:1
tag = pq('title')
HTMLParser - A Straightforward Tool for Parsing HTML and XHTML
HTMLParser demands a different approach. Here, logic is placed within the WebParser class, as showcased below:
1 | from HTMLParser import HTMLParser |
As you advance in your web scraping endeavors, remember to always respect the robots.txt
file of websites and adhere to ethical scraping guidelines. Happy parsing!
Leave a Reply:
Thanks Jeremy! I updated the page :-)