Urllib Tutorial Python 3

Websites can be accessed using the urllib module. You can use the urllib module to interact with any website in the world, no matter if you want to get data, post data or parse data.

python urllib

Download website
We can download a webpages HTML using 3 lines of code:

import urllib.request
 
html = urllib.request.urlopen('https://arstechnica.com').read()
print(html)

The variable html will contain the webpage data in html formatting. Traditionally a web-browser like Google Chrome visualizes this data.

Web browser
A web-browsers sends their name and version along with a request, this is known as the user-agent. Python can mimic this using the code below. The User-Agent string contains the name of the web browser and version number:

import urllib.request
 
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:48.0) Gecko/20100101 Firefox/48.0"
 
req = urllib.request.Request('https://arstechnica.com', headers = headers)
html = urllib.request.urlopen(req).read()
print(html)

Parsing data
Given a web-page data, we want to extract interesting information. You could use the BeautifulSoup module to parse the returned HTML data.

You can use the BeautifulSoup module to:

There are several modules that try to achieve the same as BeautifulSoup: PyQuery and HTMLParser, you can read more about them here.

Posting data
The code below posts data to a server:

import urllib.request
 
data = urllib.urlencode({'s': 'Post variable'})
h = httplib.HTTPConnection('https://server:80/')
headers = {"Content-type": "application/x-www-form-urlencoded", "Accept": "text/plain"}
h.request('POST', 'webpage.php', data, headers)
r = h.getresponse()
print(r.read())

Extract links from webpage (BeautifulSoup)

Web scraping is the technique to extract data from a website.

The module BeautifulSoup  is designed for web scraping. The BeautifulSoup module can handle HTML and XML. It provides simple method for searching, navigating and modifying the parse tree.

Related courses:

Get links from website

The example below prints all links on a webpage:

from BeautifulSoup import BeautifulSoup
import urllib2
import re
 
html_page = urllib2.urlopen("http://arstechnica.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
    print link.get('href')

It downloads the raw html code with the line:

html_page = urllib2.urlopen("http://arstechnica.com")

A BeautifulSoup object is created and we use this object to find all links:

soup = BeautifulSoup(html_page)
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
    print link.get('href')

Extract links from website into array

To store the links in an array you can use:

from BeautifulSoup import BeautifulSoup
import urllib2
import re
 
html_page = urllib2.urlopen("http://arstechnica.com")
soup = BeautifulSoup(html_page)
links = []
 
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
    links.append(link.get('href'))
 
print(links)

Function to extract links from webpage

If you repeatingly extract links you can use the function below:

from BeautifulSoup import BeautifulSoup
import urllib2
import re
 
def getLinks(url):
    html_page = urllib2.urlopen(url)
    soup = BeautifulSoup(html_page)
    links = []
 
    for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
        links.append(link.get('href'))
 
    return links
 
print( getLinks("http://arstechnica.com") )

HTTP – Parse HTML and XHTML

In this article you will learn how to parse the HTML (HyperText Mark-up Language) of a website. There are several Python libraries to achieve that. We will give a demonstration of a few popular ones.

Beautiful Soup – a python package for parsing HTML and XML
This library is very popular and can even work with malformed markup.   To get the contents of a single div, you can use the code below:

from BeautifulSoup import BeautifulSoup
import urllib2
 
 
# get the contents
response = urllib2.urlopen('http://en.wikipedia.org/wiki/Python_(programming_language)')
html = response.read()
 
parsed_html = BeautifulSoup(html)
print parsed_html.body.find('div', attrs={'class':'toc'})

This will output the HTML code of within the div called ‘toc’ (table of contents) of the wikipedia article.  If you want only the raw text use:

print parsed_html.body.find('div', attrs={'class':'toc'}).text

If you want to get the page title, you need to get it from the head section:

print parsed_html.head.find('title').text

To grab all images URLs from a website, you can use this code:

from BeautifulSoup import BeautifulSoup
import urllib2
 
url = 'http://www.arstechnica.com/'
data = urllib2.urlopen(url).read()
soup = BeautifulSoup(data)
links = soup.findAll('img', src=True)
 
for link in links:
    print(link["src"])

To grab all URLs  from the webpage, use this:

from BeautifulSoup import BeautifulSoup
import urllib2
 
url = 'http://www.arstechnica.com/'
data = urllib2.urlopen(url).read()
soup = BeautifulSoup(data)
links = soup.findAll('a')
 
for link in links:
    print(link["href"])

PyQuery – a jquery like library for Python
To extract data from the tags we can use PyQuery.  It can grab the actual text contents and the html contents, depending on what you need. To grab a tag you use the call pq(‘tag’).

from pyquery import PyQuery    
import urllib2
response = urllib2.urlopen('http://en.wikipedia.org/wiki/Python_(programming_language)')
 
html = response.read()
pq = PyQuery(html)
tag = pq('div#toc')
 
# print the text of the div
print tag.text()
 
# print the html of the div
print tag.html()

To get the title simply use:

tag = pq('title')

HTMLParser – Simple HTML and XHTML parser
The usage of this library is very different. With this library you have to put all your logic in the WebParser class.  A basic example of usage below:

from HTMLParser import HTMLParser
import urllib2
 
# create parse
class WebParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "Tag: " + tag
 
# get the contents
response = urllib2.urlopen('http://en.wikipedia.org/wiki/Python_(programming_language)')
html = response.read()
 
# instantiate the parser and fed it some HTML
parser = WebParser()
parser.feed(html)

HTTP download file with Python

The urllib2 module can be used to download data from the web (network resource access). This data can be a file, a website or whatever you want Python to download. The module supports HTTP, HTTPS, FTP and several other protocols.

In this article you will learn how to download data from the web using Python.

Related courses

Download text

To download a plain text file use this code:

import urllib2
response = urllib2.urlopen('https://wordpress.org/plugins/about/readme.txt')
data = response.read()
print(data)

We get a response object using the urllib2.urlopen() method, where the parameter is the link. All of the file contents is received using the response.read() method call. After calling this, we have the file data in a Python variable of type string.

Download HTML

This will request the html code from a website. It will output everything to the screen.

import urllib2
response = urllib2.urlopen('http://en.wikipedia.org/')
html = response.read()
print html

Download file using Python

You can save the data to disk very easily after downloading the file:

import urllib2
response = urllib2.urlopen('https://wordpress.org/plugins/about/readme.txt')
data = response.read()
 
# Write data to file
filename = "test.txt"
file_ = open(filename, 'w')
file_.write(data)
file_.close()

The first part of the code downloads the file contents into the variable data:

import urllib2
response = urllib2.urlopen('https://wordpress.org/plugins/about/readme.txt')
data = response.read()

The second part stores it into a file (this file does not need to have the same filename)

# Write data to file
filename = "test.txt"
file_ = open(filename, 'w')
file_.write(data)
file_.close()

The ‘w’ parameter creates the file (or overwrites if it exists). You can read more about writing files here.

Python network sockets programming tutorial

In this tutorial you will learn about in network programming. You will learn about the client-server model that is in use for the World Wide Web, E-mail and many other applications.

client server
Client server (with email protocol)
The client server model is a model where there are n clients and one server. The clients make  data requests to a server. The server replies to those messages received. A client can be any device such as  your computer or tablet. Servers are generally dedicated computers which are to be connected 24/7

socket server code

This code will start a simple web server using sockets. It waits for a connection and if a connection is received it will output the bytes received.

#!/usr/bin/env python
 
import socket
 
TCP_IP = '127.0.0.1'
TCP_PORT = 62
BUFFER_SIZE = 20  # Normally 1024, but we want fast response
 
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind((TCP_IP, TCP_PORT))
s.listen(1)
 
conn, addr = s.accept()
print 'Connection address:', addr
while 1:
     data = conn.recv(BUFFER_SIZE)
     if not data: break
     print "received data:", data
     conn.send(data)  # echo
conn.close()

Execute with:

$ python server.py

This opens the web server at port 62. In a second screen, open a client with Telnet. If you use the same machine for the client and server use:

$ telnet 127.0.0.1 62.

If you use another machine as client, type the according IP address of that machine. You can find it with ifconfig.

Everything you write from the client will arrive at the server. The server sends the received messages back. An example output below (Click to enlarge):

socket network client:

The client script below sends a message to the server. The server must be running!

#!/usr/bin/env python
 
import socket
 
 
TCP_IP = '127.0.0.1'
TCP_PORT = 5005
BUFFER_SIZE = 1024
MESSAGE = "Hello, World!"
 
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((TCP_IP, TCP_PORT))
s.send(MESSAGE)
data = s.recv(BUFFER_SIZE)
s.close()
 
print "received data:", data

This client simply mimics the behavior we did in Telnet.

Limitations of the server code
The server code above can only interact with one client.  If you try to connect with a second terminal it simply won’t reply to the new client. To let the server interact with multiple clients you need to use multi-threading. We rebuild the server script to accept multiple client connections:

#!/usr/bin/env python
 
import socket
from threading import Thread
from SocketServer import ThreadingMixIn
 
class ClientThread(Thread):
 
    def __init__(self,ip,port):
        Thread.__init__(self)
        self.ip = ip
        self.port = port
        print "[+] New thread started for "+ip+":"+str(port)
 
 
    def run(self):
        while True:
            data = conn.recv(2048)
            if not data: break
            print "received data:", data
            conn.send(data)  # echo
 
TCP_IP = '0.0.0.0'
TCP_PORT = 62
BUFFER_SIZE = 20  # Normally 1024, but we want fast response
 
 
tcpsock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
tcpsock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
tcpsock.bind((TCP_IP, TCP_PORT))
threads = []
 
while True:
    tcpsock.listen(4)
    print "Waiting for incoming connections..."
    (conn, (ip,port)) = tcpsock.accept()
    newthread = ClientThread(ip,port)
    newthread.start()
    threads.append(newthread)
 
for t in threads:
    t.join()

Application protocol

So far we have simply sent messages back and forth. Every message can have a specific meaning in an application. This is known as the protocol.  The meaning of these messages must be the same on both the sender and receiver side.  The Transport Layer below makes sure that messages are received (TCP). The Internet Layer is the IPv4 protocol.  All we have to define is the Application Layer.

Below we modified the server to accept simple commands (We use the non-threading server for simplicity). We changed the port to 64.  Server code with a protocol:

#!/usr/bin/env python
 
import socket
 
TCP_IP = '127.0.0.1'
TCP_PORT = 64
BUFFER_SIZE = 20  # Normally 1024, but we want fast response
 
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind((TCP_IP, TCP_PORT))
s.listen(1)
 
conn, addr = s.accept()
print 'Connection address:', addr
while 1:
     data = conn.recv(BUFFER_SIZE)
     if not data: break
     print "received data:", data
     #conn.send(data)  # echo
     if "/version" in data:
         conn.send("Demo versionn")
 
     if "/echo" in data:
         data = data.replace("/echo","")
         conn.send(data + "n")
 
conn.close()

Run the server with:

sudo python server.py

A client can then connect with telnet (make sure you pick the right IP):

$ telnet 127.0.0.1 64
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
message
/version
Demo version
/echo Repeat this
 Repeat this

Download sockets Code

FTP client in Python

This article will show you how to use the File Transfer Protocol (FTP)  with Python from a client side perspective.  We use ftplib, a library that implements the FTP protocol.   Using FTP we can create and access remote files through function calls.

Related course
Python Programming Bootcamp: Go from zero to hero

Directory listing
We can list the root directory using this little snippet:

import ftplib
 
ftp = ftplib.FTP("ftp.nluug.nl")
ftp.login("anonymous", "ftplib-example-1")
 
data = []
 
ftp.dir(data.append)
 
ftp.quit()
 
for line in data:
    print "-", line

This will output the directory contents. in a simple console style output. If you want to show a specific directory you must change the directory after connecting with the ftp.cwd(‘/’) function where the parameter is the directory you want to change to.

import ftplib
 
ftp = ftplib.FTP("ftp.nluug.nl")
ftp.login("anonymous", "ftplib-example-1")
 
data = []
 
ftp.cwd('/pub/')         # change directory to /pub/
ftp.dir(data.append)
 
ftp.quit()
 
for line in data:
    print "-", line

Download file
To download a file we use the retrbinary() function. An example below:

import ftplib
import sys
 
def getFile(ftp, filename):
    try:
        ftp.retrbinary("RETR " + filename ,open(filename, 'wb').write)
    except:
        print "Error"
 
 
ftp = ftplib.FTP("ftp.nluug.nl")
ftp.login("anonymous", "ftplib-example-1")
 
ftp.cwd('/pub/')         # change directory to /pub/
getFile(ftp,'README.nluug')
 
ftp.quit()

Uploading files
We can upload files using the storlines() command.  This will upload the file README.nluug in the main directory. If you want to upload in another directory combine it with the cwd() function.

import ftplib
import os
 
def upload(ftp, file):
    ext = os.path.splitext(file)[1]
    if ext in (".txt", ".htm", ".html"):
        ftp.storlines("STOR " + file, open(file))
    else:
        ftp.storbinary("STOR " + file, open(file, "rb"), 1024)
 
ftp = ftplib.FTP("127.0.0.1")
ftp.login("username", "password")
 
upload(ftp, "README.nluug")

Other functions
For other functions please refer to the official library documentation.

Posts navigation

1 2