Tag: HTTP
Python hosting: Host, run, and code Python in the cloud!
HTTP – Parse HTML and XHTML
In this article you will learn how to parse the HTML (HyperText Mark-up Language) of a website. There are several Python libraries to achieve that. We will give a demonstration of a few popular ones.
Beautiful Soup – a python package for parsing HTML and XML
This library is very popular and can even work with malformed markup. To get the contents of a single div, you can use the code below:
from BeautifulSoup import BeautifulSoup import urllib2 # get the contents response = urllib2.urlopen('http://en.wikipedia.org/wiki/Python_(programming_language)') html = response.read() parsed_html = BeautifulSoup(html) print parsed_html.body.find('div', attrs={'class':'toc'}) |
This will output the HTML code of within the div called ‘toc’ (table of contents) of the wikipedia article. If you want only the raw text use:
print parsed_html.body.find('div', attrs={'class':'toc'}).text |
If you want to get the page title, you need to get it from the head section:
print parsed_html.head.find('title').text |
To grab all images URLs from a website, you can use this code:
from BeautifulSoup import BeautifulSoup import urllib2 url = 'http://www.arstechnica.com/' data = urllib2.urlopen(url).read() soup = BeautifulSoup(data) links = soup.findAll('img', src=True) for link in links: print(link["src"]) |
To grab all URLs from the webpage, use this:
from BeautifulSoup import BeautifulSoup import urllib2 url = 'http://www.arstechnica.com/' data = urllib2.urlopen(url).read() soup = BeautifulSoup(data) links = soup.findAll('a') for link in links: print(link["href"]) |
PyQuery – a jquery like library for Python
To extract data from the tags we can use PyQuery. It can grab the actual text contents and the html contents, depending on what you need. To grab a tag you use the call pq(‘tag’).
from pyquery import PyQuery import urllib2 response = urllib2.urlopen('http://en.wikipedia.org/wiki/Python_(programming_language)') html = response.read() pq = PyQuery(html) tag = pq('div#toc') # print the text of the div print tag.text() # print the html of the div print tag.html() |
To get the title simply use:
tag = pq('title') |
HTMLParser – Simple HTML and XHTML parser
The usage of this library is very different. With this library you have to put all your logic in the WebParser class. A basic example of usage below:
from HTMLParser import HTMLParser import urllib2 # create parse class WebParser(HTMLParser): def handle_starttag(self, tag, attrs): print "Tag: " + tag # get the contents response = urllib2.urlopen('http://en.wikipedia.org/wiki/Python_(programming_language)') html = response.read() # instantiate the parser and fed it some HTML parser = WebParser() parser.feed(html) |
HTTP download file with Python
The urllib2 module can be used to download data from the web (network resource access). This data can be a file, a website or whatever you want Python to download. The module supports HTTP, HTTPS, FTP and several other protocols.
In this article you will learn how to download data from the web using Python.
Related courses
Download text
To download a plain text file use this code:
import urllib2 response = urllib2.urlopen('https://wordpress.org/plugins/about/readme.txt') data = response.read() print(data) |
We get a response object using the urllib2.urlopen() method, where the parameter is the link. All of the file contents is received using the response.read() method call. After calling this, we have the file data in a Python variable of type string.
Download HTML
This will request the html code from a website. It will output everything to the screen.
import urllib2 response = urllib2.urlopen('http://en.wikipedia.org/') html = response.read() print html |
Download file using Python
You can save the data to disk very easily after downloading the file:
import urllib2 response = urllib2.urlopen('https://wordpress.org/plugins/about/readme.txt') data = response.read() # Write data to file filename = "test.txt" file_ = open(filename, 'w') file_.write(data) file_.close() |
The first part of the code downloads the file contents into the variable data:
import urllib2 response = urllib2.urlopen('https://wordpress.org/plugins/about/readme.txt') data = response.read() |
The second part stores it into a file (this file does not need to have the same filename)
# Write data to file filename = "test.txt" file_ = open(filename, 'w') file_.write(data) file_.close() |
The ‘w’ parameter creates the file (or overwrites if it exists). You can read more about writing files here.
Requests: HTTP for Humans
If you want to request data from webservers, the traditional way to do that in Python is using the urllib library. While this library is effective, you could easily create more complexity than needed when building something. Is there another way?
Requests is an Apache2 Licensed HTTP library, written in Python. It’s powered by httplib and urllib3, but it does all the hard work for you.
To install type:
git clone https://github.com/kennethreitz/requests.git cd requests sudo python setup.py install |
The Requests library is now installed. We will list some examples below:
Related course
Web Scraping with Python: Collecting More Data from the Modern Web
Grabbing raw html using HTTP/HTTPS requests
We can now query a website as :
import requests r = requests.get('http://pythonspot.com/') print r.content |
Save it and run with:
python website.py
It will output the raw HTML code.
Download binary image using Python
from PIL import Image from StringIO import StringIO import requests r = requests.get('http://1.bp.blogspot.com/_r-MQun1PKUg/SlnHnaLcw6I/AAAAAAAAA_U$ i = Image.open(StringIO(r.content)) i.show() |
An image retrieved using python
Website status code (is the website online?)
import requests r = requests.get('http://pythonspot.com/') print r.status_code |
This returns 200 (OK). A list of status codes can be found here: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
Retrieve JSON from a webserver
You can easily grab a JSON object from a webserver.
import requests import requests r = requests.get('https://api.github.com/events') print r.json() |
HTTP Post requests using Python
from StringIO import StringIO import requests payload = {'key1': 'value1', 'key2': 'value2'} r = requests.post("http://httpbin.org/post", data=payload) print(r.text) |
SSL verification, verify certificates using Python
from StringIO import StringIO import requests print requests.get('https://github.com', verify=True) |
Extract data from the HTTP response header
With every request you send to a HTTP server, the server will send you some additional data. You can get extract data from an HTTP response using:
#!/usr/bin/env python import requests r = requests.get('http://pythonspot.com/') print r.headers |
This will return the data in JSON format. We can parse the data encoded in JSON format to a Python dict.
#!/usr/bin/env python import requests import json r = requests.get('http://pythonspot.com/') jsondata = str(r.headers).replace('\'','"') headerObj = json.loads(jsondata) print headerObj['server'] print headerObj['content-length'] print headerObj['content-encoding'] print headerObj['content-type'] print headerObj['date'] print headerObj['x-powered-by'] |
Extract data from HTML response
Once you get the data from a server, you can parse it using python string functions or use a library. BeautifulSoup is often used. An example code that gets the page title and links:
from bs4 import BeautifulSoup import requests # get html data r = requests.get('http://stackoverflow.com/') html_doc = r.content # create a beautifulsoup object soup = BeautifulSoup(html_doc) # get title print soup.title # print all links for link in soup.find_all('a'): print(link.get('href')) |