python logo

python parse html


Python hosting: Host, run, and code Python in the cloud!

If you’ve ever been interested in scraping or analyzing web content, then understanding how to parse HTML is crucial. In this tutorial, we’ll delve into various Python libraries that make this process more accessible.

Related Course:

Beautiful Soup - A Comprehensive Python Library for Parsing HTML and XML

Beautiful Soup is revered among web scrapers for its simplicity and ability to handle ill-formed markup. Below, we illustrate how to extract specific content using Beautiful Soup:

  1. Extracting Content from a div Element:

    1
    2
    3
    4
    5
    6
    7
    from BeautifulSoup import BeautifulSoup
    import urllib2

    response = urllib2.urlopen('https://en.wikipedia.org/wiki/Python_(programming_language)')
    html = response.read()
    parsed_html = BeautifulSoup(html)
    print parsed_html.body.find('div', attrs={'class':'toc'})

    This fetches the HTML content within the ‘toc’ div of the Wikipedia article. If you only require the plain text:

    1
    print parsed_html.body.find('div', attrs={'class':'toc'}).text
  2. Retrieving the Page Title:

    1
    print parsed_html.head.find('title').text
  3. Capturing Image URLs from a Website:

    1
    2
    3
    4
    5
    6
    url = 'https://www.arstechnica.com/'
    data = urllib2.urlopen(url).read()
    soup = BeautifulSoup(data)
    links = soup.findAll('img', src=True)
    for link in links:
    print(link["src"])
  4. Fetching All URLs from a Webpage:

    1
    2
    3
    4
    5
    6
    url = 'https://www.arstechnica.com/'
    data = urllib2.urlopen(url).read()
    soup = BeautifulSoup(data)
    links = soup.findAll('a')
    for link in links:
    print(link["href"])

PyQuery - Harnessing the Power of jQuery in Python

If you’re familiar with jQuery, PyQuery offers a comfortable transition. It effortlessly retrieves text or HTML content based on your requirements:

1
2
3
4
5
6
7
8
from pyquery import PyQuery    
import urllib2
response = urllib2.urlopen('https://en.wikipedia.org/wiki/Python_(programming_language)')
html = response.read()
pq = PyQuery(html)
tag = pq('div#toc')
print tag.text()
print tag.html()

For fetching the page title using PyQuery:

1
tag = pq('title')

HTMLParser - A Straightforward Tool for Parsing HTML and XHTML

HTMLParser demands a different approach. Here, logic is placed within the WebParser class, as showcased below:

1
2
3
4
5
6
7
8
9
10
11
from HTMLParser import HTMLParser
import urllib2

class WebParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Tag: " + tag

response = urllib2.urlopen('https://en.wikipedia.org/wiki/Python_(programming_language)')
html = response.read()
parser = WebParser()
parser.feed(html)

As you advance in your web scraping endeavors, remember to always respect the robots.txt file of websites and adhere to ethical scraping guidelines. Happy parsing!

← Previous Tutorial
Next Tutorial →






Leave a Reply:




Jeremy Lee Sat, 23 May 2015


from pyquery import PyQuery
import urllib2
response = urllib2.urlopen('http://en.wikipedia.org/wik...
html = response.read()
pq = PyQuery(html)
tag = pq('div#toc')
# print the text of the div
print tag.text()
# print the html of the div
print tag.hmtl() HERE : spelling error should be html

Frank Sun, 24 May 2015

Thanks Jeremy! I updated the page :-)