python logo


Category: tutorials

python parse html

In this article you will learn how to parse the HTML (HyperText Mark-up Language) of a website. There are several Python libraries to achieve that. We will give a demonstration of a few popular ones.

Related course

Beautiful Soup - a python package for parsing HTML and XML
This library is very popular and can even work with malformed markup. To get the contents of a single div, you can use the code below:

from BeautifulSoup import BeautifulSoup
import urllib2


# get the contents
response = urllib2.urlopen('https://en.wikipedia.org/wiki/Python_(programming_language)')
html = response.read()

parsed_html = BeautifulSoup(html)
print parsed_html.body.find('div', attrs={'class':'toc'})

This will output the HTML code of within the div called ‘toc’ (table of contents) of the wikipedia article. If you want only the raw text use:

print parsed_html.body.find('div', attrs={'class':'toc'}).text

If you want to get the page title, you need to get it from the head section:

print parsed_html.head.find('title').text

To grab all images URLs from a website, you can use this code:

from BeautifulSoup import BeautifulSoup
import urllib2

url = 'https://www.arstechnica.com/'
data = urllib2.urlopen(url).read()
soup = BeautifulSoup(data)
links = soup.findAll('img', src=True)

for link in links:
print(link["src"])

To grab all URLs from the webpage, use this:

from BeautifulSoup import BeautifulSoup
import urllib2

url = 'https://www.arstechnica.com/'
data = urllib2.urlopen(url).read()
soup = BeautifulSoup(data)
links = soup.findAll('a')

for link in links:
print(link["href"])

PyQuery - a jquery like library for Python
To extract data from the tags we can use PyQuery. It can grab the actual text contents and the html contents, depending on what you need. To grab a tag you use the call pq(‘tag’).

from pyquery import PyQuery    
import urllib2
response = urllib2.urlopen('https://en.wikipedia.org/wiki/Python_(programming_language)')

html = response.read()
pq = PyQuery(html)
tag = pq('div#toc')

# print the text of the div
print tag.text()

# print the html of the div
print tag.html()

To get the title simply use:

tag = pq('title')

HTMLParser - Simple HTML and XHTML parser
The usage of this library is very different. With this library you have to put all your logic in the WebParser class. A basic example of usage below:

from HTMLParser import HTMLParser
import urllib2

# create parse
class WebParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Tag: " + tag

# get the contents
response = urllib2.urlopen('https://en.wikipedia.org/wiki/Python_(programming_language)')
html = response.read()

# instantiate the parser and fed it some HTML
parser = WebParser()
parser.feed(html)

python ftp client

This article will show you how to use the File Transfer Protocol (FTP) with Python from a client side perspective. We use ftplib, a library that implements the FTP protocol. Using FTP we can create and access remote files through function calls.

Related course
Python Programming Bootcamp: Go from zero to hero

Directory listing

FTP is a protocol for transferring files between systems over a TCP network. It was first developed in 1971 and has since been widely adopted as an effective way to share large files over the internet. File Transfer Protocol (often abbreviated FTP) is an application- layer protocol.

We can list the root directory using this little snippet:

import ftplib

ftp = ftplib.FTP("ftp.nluug.nl")
ftp.login("anonymous", "ftplib-example-1")

data = []

ftp.dir(data.append)

ftp.quit()

for line in data:
print "-", line

This will output the directory contents. in a simple console style output. If you want to show a specific directory you must change the directory after connecting with the ftp.cwd(‘/‘) function where the parameter is the directory you want to change to.

import ftplib

ftp = ftplib.FTP("ftp.nluug.nl")
ftp.login("anonymous", "ftplib-example-1")

data = []

ftp.cwd('/pub/') # change directory to /pub/
ftp.dir(data.append)

ftp.quit()

for line in data:
print "-", line

Download file
To download a file we use the retrbinary() function. An example below:

import ftplib
import sys

def getFile(ftp, filename):
try:
ftp.retrbinary("RETR " + filename ,open(filename, 'wb').write)
except:
print "Error"


ftp = ftplib.FTP("ftp.nluug.nl")
ftp.login("anonymous", "ftplib-example-1")

ftp.cwd('/pub/') # change directory to /pub/
getFile(ftp,'README.nluug')

ftp.quit()

Uploading files
We can upload files using the storlines() command. This will upload the file README.nluug in the main directory. If you want to upload in another directory combine it with the cwd() function.

import ftplib
import os

def upload(ftp, file):
ext = os.path.splitext(file)[1]
if ext in (".txt", ".htm", ".html"):
ftp.storlines("STOR " + file, open(file))
else:
ftp.storbinary("STOR " + file, open(file, "rb"), 1024)

ftp = ftplib.FTP("127.0.0.1")
ftp.login("username", "password")

upload(ftp, "README.nluug")

Related course
Python Programming Bootcamp: Go from zero to hero

Other functions
For other functions please refer to the official library documentation.

irc bot