logo


Category: Network

Urllib Tutorial Python 3

Websites can be accessed using the urllib module. You can use the urllib module to interact with any website in the world, no matter if you want to get data, post data or parse data.

If you want to do web scraping or data mining, you can use urllib but it’s not the only option. Urllib will just fetch the data, but if you want to emulate a complete web browser, there’s also a module for that.

Related course:
Web Scraping in Python with BeautifulSoup & Scrapy Framework

python urllib

Download website
We can download a webpages HTML using 3 lines of code:


import urllib.request

html = urllib.request.urlopen('https://arstechnica.com').read()
print(html)

The variable html will contain the webpage data in html formatting. Traditionally a web-browser like Google Chrome visualizes this data.

Web browser
A web-browsers sends their name and version along with a request, this is known as the user-agent. Python can mimic this using the code below. The User-Agent string contains the name of the web browser and version number:

import urllib.request

headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:48.0) Gecko/20100101 Firefox/48.0"

req = urllib.request.Request('https://arstechnica.com', headers = headers)
html = urllib.request.urlopen(req).read()
print(html)

Parsing data
Given a web-page data, we want to extract interesting information. You could use the BeautifulSoup module to parse the returned HTML data.

You can use the BeautifulSoup module to:

There are several modules that try to achieve the same as BeautifulSoup: PyQuery and HTMLParser, you can read more about them here.

Posting data
The code below posts data to a server:


import urllib.request

data = urllib.urlencode({'s': 'Post variable'})
h = httplib.HTTPConnection('https://server:80/')
headers = {"Content-type": "application/x-www-form-urlencoded", "Accept": "text/plain"}
h.request('POST', 'webpage.php', data, headers)
r = h.getresponse()
print(r.read())

Extract links from webpage (BeautifulSoup)

Web scraping is the technique to extract data from a website.

The module BeautifulSoup is designed for web scraping. The BeautifulSoup module can handle HTML and XML. It provides simple method for searching, navigating and modifying the parse tree.

Related course:
Browser Automation with Python Selenium

Get links from website


The example below prints all links on a webpage:


from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("https://arstechnica.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
print link.get('href')

It downloads the raw html code with the line:


html_page = urllib2.urlopen("https://arstechnica.com")

A BeautifulSoup object is created and we use this object to find all links:


soup = BeautifulSoup(html_page)
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
print link.get('href')

Extract links from website into array


To store the links in an array you can use:


from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("https://arstechnica.com")
soup = BeautifulSoup(html_page)
links = []

for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))

print(links)

Function to extract links from webpage


If you repeatingly extract links you can use the function below:


from BeautifulSoup import BeautifulSoup
import urllib2
import re

def getLinks(url):
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page)
links = []

for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))

return links

print( getLinks("https://arstechnica.com") )

Related course:
Browser Automation with Python Selenium

HTTP - Parse HTML and XHTML

In this article you will learn how to parse the HTML (HyperText Mark-up Language) of a website. There are several Python libraries to achieve that. We will give a demonstration of a few popular ones.

Related course

Beautiful Soup - a python package for parsing HTML and XML
This library is very popular and can even work with malformed markup. To get the contents of a single div, you can use the code below:

from BeautifulSoup import BeautifulSoup
import urllib2


# get the contents
response = urllib2.urlopen('https://en.wikipedia.org/wiki/Python_(programming_language)')
html = response.read()

parsed_html = BeautifulSoup(html)
print parsed_html.body.find('div', attrs={'class':'toc'})

This will output the HTML code of within the div called ‘toc’ (table of contents) of the wikipedia article. If you want only the raw text use:

print parsed_html.body.find('div', attrs={'class':'toc'}).text

If you want to get the page title, you need to get it from the head section:

print parsed_html.head.find('title').text

To grab all images URLs from a website, you can use this code:

from BeautifulSoup import BeautifulSoup
import urllib2

url = 'https://www.arstechnica.com/'
data = urllib2.urlopen(url).read()
soup = BeautifulSoup(data)
links = soup.findAll('img', src=True)

for link in links:
print(link["src"])

To grab all URLs from the webpage, use this:

from BeautifulSoup import BeautifulSoup
import urllib2

url = 'https://www.arstechnica.com/'
data = urllib2.urlopen(url).read()
soup = BeautifulSoup(data)
links = soup.findAll('a')

for link in links:
print(link["href"])

PyQuery - a jquery like library for Python
To extract data from the tags we can use PyQuery. It can grab the actual text contents and the html contents, depending on what you need. To grab a tag you use the call pq(‘tag’).

from pyquery import PyQuery    
import urllib2
response = urllib2.urlopen('https://en.wikipedia.org/wiki/Python_(programming_language)')

html = response.read()
pq = PyQuery(html)
tag = pq('div#toc')

# print the text of the div
print tag.text()

# print the html of the div
print tag.html()

To get the title simply use:

tag = pq('title')

HTMLParser - Simple HTML and XHTML parser
The usage of this library is very different. With this library you have to put all your logic in the WebParser class. A basic example of usage below:

from HTMLParser import HTMLParser
import urllib2

# create parse
class WebParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Tag: " + tag

# get the contents
response = urllib2.urlopen('https://en.wikipedia.org/wiki/Python_(programming_language)')
html = response.read()

# instantiate the parser and fed it some HTML
parser = WebParser()
parser.feed(html)

HTTP download file with Python

The urllib2 module can be used to download data from the web (network resource access). This data can be a file, a website or whatever you want Python to download. The module supports HTTP, HTTPS, FTP and several other protocols.

In this article you will learn how to download data from the web using Python.

Related course

Download text


To download a plain text file use this code:

import urllib2
response = urllib2.urlopen('https://wordpress.org/plugins/about/readme.txt')
data = response.read()
print(data)

We get a response object using the urllib2.urlopen() method, where the parameter is the link. All of the file contents is received using the response.read() method call. After calling this, we have the file data in a Python variable of type string.

Download HTML


This will request the html code from a website. It will output everything to the screen.

import urllib2
response = urllib2.urlopen('https://en.wikipedia.org/')
html = response.read()
print html

Download file using Python


You can save the data to disk very easily after downloading the file:

import urllib2
response = urllib2.urlopen('https://wordpress.org/plugins/about/readme.txt')
data = response.read()

# Write data to file
filename = "test.txt"
file_ = open(filename, 'w')
file_.write(data)
file_.close()

The first part of the code downloads the file contents into the variable data:


import urllib2
response = urllib2.urlopen('https://wordpress.org/plugins/about/readme.txt')
data = response.read()

The second part stores it into a file (this file does not need to have the same filename)


# Write data to file
filename = "test.txt"
file_ = open(filename, 'w')
file_.write(data)
file_.close()

The ‘w’ parameter creates the file (or overwrites if it exists). You can read more about writing files here.

Related course

Python network sockets programming tutorial

FTP client in Python

This article will show you how to use the File Transfer Protocol (FTP) with Python from a client side perspective. We use ftplib, a library that implements the FTP protocol. Using FTP we can create and access remote files through function calls.

Related course
Python Programming Bootcamp: Go from zero to hero

Directory listing

FTP is a protocol for transferring files between systems over a TCP network. It was first developed in 1971 and has since been widely adopted as an effective way to share large files over the internet. File Transfer Protocol (often abbreviated FTP) is an application- layer protocol.

We can list the root directory using this little snippet:

import ftplib

ftp = ftplib.FTP("ftp.nluug.nl")
ftp.login("anonymous", "ftplib-example-1")

data = []

ftp.dir(data.append)

ftp.quit()

for line in data:
print "-", line

This will output the directory contents. in a simple console style output. If you want to show a specific directory you must change the directory after connecting with the ftp.cwd(‘/‘) function where the parameter is the directory you want to change to.

import ftplib

ftp = ftplib.FTP("ftp.nluug.nl")
ftp.login("anonymous", "ftplib-example-1")

data = []

ftp.cwd('/pub/') # change directory to /pub/
ftp.dir(data.append)

ftp.quit()

for line in data:
print "-", line

Download file
To download a file we use the retrbinary() function. An example below:

import ftplib
import sys

def getFile(ftp, filename):
try:
ftp.retrbinary("RETR " + filename ,open(filename, 'wb').write)
except:
print "Error"


ftp = ftplib.FTP("ftp.nluug.nl")
ftp.login("anonymous", "ftplib-example-1")

ftp.cwd('/pub/') # change directory to /pub/
getFile(ftp,'README.nluug')

ftp.quit()

Uploading files
We can upload files using the storlines() command. This will upload the file README.nluug in the main directory. If you want to upload in another directory combine it with the cwd() function.

import ftplib
import os

def upload(ftp, file):
ext = os.path.splitext(file)[1]
if ext in (".txt", ".htm", ".html"):
ftp.storlines("STOR " + file, open(file))
else:
ftp.storbinary("STOR " + file, open(file, "rb"), 1024)

ftp = ftplib.FTP("127.0.0.1")
ftp.login("username", "password")

upload(ftp, "README.nluug")

Related course
Python Programming Bootcamp: Go from zero to hero

Other functions
For other functions please refer to the official library documentation.

Read Email, pop3

In this tutorial you will learn how to receive email using the poplib module. The mail server needs to support pop3, but most mail servers do this. The Post Office Protocol (POP3) is for receiving mail only, for sending you will need the SMTP protocol.

pop3-email-server Simplified Mail Server

Related course:
Python Programming Bootcamp: Go from zero to hero

Meta Data


Every email will contain many variables, but these are the most important ones:



Feature Description
message-id unique identifier
from where did the email come from?
to where was the email sent to?
date date
subject Email subject.

Reading Email Example


You can request messages directly from a mail server using the Post Office Protocol (protocol). You do not have to worry about the internal protocol because you can use the poplib module.
Connect and authenticate with the server using:


# connect to server
server = poplib.POP3(SERVER)

# login
server.user(USER)
server.pass_(PASSWORD)




The program below gets 10 emails from the server including mail header


import poplib
import string, random
import StringIO, rfc822

def readMail():
SERVER = "YOUR MAIL SERVER"
USER = "YOUR USERNAME [email protected]"
PASSWORD = "YOUR PASSWORD"

# connect to server
server = poplib.POP3(SERVER)

# login
server.user(USER)
server.pass_(PASSWORD)

# list items on server
resp, items, octets = server.list()

for i in range(0,10):
id, size = string.split(items[i])
resp, text, octets = server.retr(id)

text = string.join(text, "\n")
file = StringIO.StringIO(text)

message = rfc822.Message(file)

for k, v in message.items():
print k, "=", v

readMail()


Related course:
Python Programming Bootcamp: Go from zero to hero


Building an IRC (ro)bot