Tag: http
python parse html
In this article you will learn how to parse the HTML (HyperText Mark-up Language) of a website. There are several Python libraries to achieve that. We will give a demonstration of a few popular ones.
Related courseBeautiful Soup - a python package for parsing HTML and XML
This library is very popular and can even work with malformed markup. To get the contents of a single div, you can use the code below:
from BeautifulSoup import BeautifulSoup |
This will output the HTML code of within the div called ‘toc’ (table of contents) of the wikipedia article. If you want only the raw text use:
print parsed_html.body.find('div', attrs={'class':'toc'}).text |
If you want to get the page title, you need to get it from the head section:
print parsed_html.head.find('title').text |
To grab all images URLs from a website, you can use this code:
from BeautifulSoup import BeautifulSoup |
To grab all URLs from the webpage, use this:
from BeautifulSoup import BeautifulSoup |
PyQuery - a jquery like library for Python
To extract data from the tags we can use PyQuery. It can grab the actual text contents and the html contents, depending on what you need. To grab a tag you use the call pq(‘tag’).
from pyquery import PyQuery |
To get the title simply use:
tag = pq('title') |
HTMLParser - Simple HTML and XHTML parser
The usage of this library is very different. With this library you have to put all your logic in the WebParser class. A basic example of usage below:
from HTMLParser import HTMLParser |
python download file from url
The urllib2 module can be used to download data from the web (network resource access). This data can be a file, a website or whatever you want Python to download. The module supports HTTP, HTTPS, FTP and several other protocols.
In this article you will learn how to download data from the web using Python.
Related courseDownload text
To download a plain text file use this code:
import urllib2 |
We get a response object using the urllib2.urlopen() method, where the parameter is the link. All of the file contents is received using the response.read() method call. After calling this, we have the file data in a Python variable of type string.
Download HTML
This will request the html code from a website. It will output everything to the screen.
import urllib2 |
Download file using Python
You can save the data to disk very easily after downloading the file:
import urllib2 |
The first part of the code downloads the file contents into the variable data:
|
The second part stores it into a file (this file does not need to have the same filename)
|
The ‘w’ parameter creates the file (or overwrites if it exists). You can read more about writing files here.
Related course