Category: pro
Python hosting: Host, run, and code Python in the cloud!
Urllib Tutorial Python 3
Websites can be accessed using the urllib module. You can use the urllib module to interact with any website in the world, no matter if you want to get data, post data or parse data.
python urllib
Download website
We can download a webpages HTML using 3 lines of code:
import urllib.request html = urllib.request.urlopen('https://arstechnica.com').read() print(html) |
The variable html will contain the webpage data in html formatting. Traditionally a web-browser like Google Chrome visualizes this data.
Web browser
A web-browsers sends their name and version along with a request, this is known as the user-agent. Python can mimic this using the code below. The User-Agent string contains the name of the web browser and version number:
import urllib.request headers = {} headers['User-Agent'] = "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:48.0) Gecko/20100101 Firefox/48.0" req = urllib.request.Request('https://arstechnica.com', headers = headers) html = urllib.request.urlopen(req).read() print(html) |
Parsing data
Given a web-page data, we want to extract interesting information. You could use the BeautifulSoup module to parse the returned HTML data.
You can use the BeautifulSoup module to:
There are several modules that try to achieve the same as BeautifulSoup: PyQuery and HTMLParser, you can read more about them here.
Posting data
The code below posts data to a server:
import urllib.request data = urllib.urlencode({'s': 'Post variable'}) h = httplib.HTTPConnection('https://server:80/') headers = {"Content-type": "application/x-www-form-urlencoded", "Accept": "text/plain"} h.request('POST', 'webpage.php', data, headers) r = h.getresponse() print(r.read()) |
Extract links from webpage (BeautifulSoup)
Web scraping is the technique to extract data from a website.
The module BeautifulSoup is designed for web scraping. The BeautifulSoup module can handle HTML and XML. It provides simple method for searching, navigating and modifying the parse tree.
Related courses:
Get links from website
The example below prints all links on a webpage:
from BeautifulSoup import BeautifulSoup import urllib2 import re html_page = urllib2.urlopen("http://arstechnica.com") soup = BeautifulSoup(html_page) for link in soup.findAll('a', attrs={'href': re.compile("^http://")}): print link.get('href') |
It downloads the raw html code with the line:
html_page = urllib2.urlopen("http://arstechnica.com") |
A BeautifulSoup object is created and we use this object to find all links:
soup = BeautifulSoup(html_page) for link in soup.findAll('a', attrs={'href': re.compile("^http://")}): print link.get('href') |
Extract links from website into array
To store the links in an array you can use:
from BeautifulSoup import BeautifulSoup import urllib2 import re html_page = urllib2.urlopen("http://arstechnica.com") soup = BeautifulSoup(html_page) links = [] for link in soup.findAll('a', attrs={'href': re.compile("^http://")}): links.append(link.get('href')) print(links) |
Function to extract links from webpage
If you repeatingly extract links you can use the function below:
from BeautifulSoup import BeautifulSoup import urllib2 import re def getLinks(url): html_page = urllib2.urlopen(url) soup = BeautifulSoup(html_page) links = [] for link in soup.findAll('a', attrs={'href': re.compile("^http://")}): links.append(link.get('href')) return links print( getLinks("http://arstechnica.com") ) |
HTTP – Parse HTML and XHTML
In this article you will learn how to parse the HTML (HyperText Mark-up Language) of a website. There are several Python libraries to achieve that. We will give a demonstration of a few popular ones.
Beautiful Soup – a python package for parsing HTML and XML
This library is very popular and can even work with malformed markup. To get the contents of a single div, you can use the code below:
from BeautifulSoup import BeautifulSoup import urllib2 # get the contents response = urllib2.urlopen('http://en.wikipedia.org/wiki/Python_(programming_language)') html = response.read() parsed_html = BeautifulSoup(html) print parsed_html.body.find('div', attrs={'class':'toc'}) |
This will output the HTML code of within the div called ‘toc’ (table of contents) of the wikipedia article. If you want only the raw text use:
print parsed_html.body.find('div', attrs={'class':'toc'}).text |
If you want to get the page title, you need to get it from the head section:
print parsed_html.head.find('title').text |
To grab all images URLs from a website, you can use this code:
from BeautifulSoup import BeautifulSoup import urllib2 url = 'http://www.arstechnica.com/' data = urllib2.urlopen(url).read() soup = BeautifulSoup(data) links = soup.findAll('img', src=True) for link in links: print(link["src"]) |
To grab all URLs from the webpage, use this:
from BeautifulSoup import BeautifulSoup import urllib2 url = 'http://www.arstechnica.com/' data = urllib2.urlopen(url).read() soup = BeautifulSoup(data) links = soup.findAll('a') for link in links: print(link["href"]) |
PyQuery – a jquery like library for Python
To extract data from the tags we can use PyQuery. It can grab the actual text contents and the html contents, depending on what you need. To grab a tag you use the call pq(‘tag’).
from pyquery import PyQuery import urllib2 response = urllib2.urlopen('http://en.wikipedia.org/wiki/Python_(programming_language)') html = response.read() pq = PyQuery(html) tag = pq('div#toc') # print the text of the div print tag.text() # print the html of the div print tag.html() |
To get the title simply use:
tag = pq('title') |
HTMLParser – Simple HTML and XHTML parser
The usage of this library is very different. With this library you have to put all your logic in the WebParser class. A basic example of usage below:
from HTMLParser import HTMLParser import urllib2 # create parse class WebParser(HTMLParser): def handle_starttag(self, tag, attrs): print "Tag: " + tag # get the contents response = urllib2.urlopen('http://en.wikipedia.org/wiki/Python_(programming_language)') html = response.read() # instantiate the parser and fed it some HTML parser = WebParser() parser.feed(html) |
HTTP download file with Python
The urllib2 module can be used to download data from the web (network resource access). This data can be a file, a website or whatever you want Python to download. The module supports HTTP, HTTPS, FTP and several other protocols.
In this article you will learn how to download data from the web using Python.
Related courses
Download text
To download a plain text file use this code:
import urllib2 response = urllib2.urlopen('https://wordpress.org/plugins/about/readme.txt') data = response.read() print(data) |
We get a response object using the urllib2.urlopen() method, where the parameter is the link. All of the file contents is received using the response.read() method call. After calling this, we have the file data in a Python variable of type string.
Download HTML
This will request the html code from a website. It will output everything to the screen.
import urllib2 response = urllib2.urlopen('http://en.wikipedia.org/') html = response.read() print html |
Download file using Python
You can save the data to disk very easily after downloading the file:
import urllib2 response = urllib2.urlopen('https://wordpress.org/plugins/about/readme.txt') data = response.read() # Write data to file filename = "test.txt" file_ = open(filename, 'w') file_.write(data) file_.close() |
The first part of the code downloads the file contents into the variable data:
import urllib2 response = urllib2.urlopen('https://wordpress.org/plugins/about/readme.txt') data = response.read() |
The second part stores it into a file (this file does not need to have the same filename)
# Write data to file filename = "test.txt" file_ = open(filename, 'w') file_.write(data) file_.close() |
The ‘w’ parameter creates the file (or overwrites if it exists). You can read more about writing files here.
Python network sockets programming tutorial
In this tutorial you will learn about in network programming. You will learn about the client-server model that is in use for the World Wide Web, E-mail and many other applications.
The client server model is a model where there are n clients and one server. The clients make data requests to a server. The server replies to those messages received. A client can be any device such as your computer or tablet. Servers are generally dedicated computers which are to be connected 24/7.
socket server code
This code will start a simple web server using sockets. It waits for a connection and if a connection is received it will output the bytes received.
#!/usr/bin/env python import socket TCP_IP = '127.0.0.1' TCP_PORT = 62 BUFFER_SIZE = 20 # Normally 1024, but we want fast response s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.bind((TCP_IP, TCP_PORT)) s.listen(1) conn, addr = s.accept() print 'Connection address:', addr while 1: data = conn.recv(BUFFER_SIZE) if not data: break print "received data:", data conn.send(data) # echo conn.close() |
Execute with:
$ python server.py |
This opens the web server at port 62. In a second screen, open a client with Telnet. If you use the same machine for the client and server use:
$ telnet 127.0.0.1 62. |
If you use another machine as client, type the according IP address of that machine. You can find it with ifconfig.
Everything you write from the client will arrive at the server. The server sends the received messages back. An example output below (Click to enlarge):
socket network client:
The client script below sends a message to the server. The server must be running!
#!/usr/bin/env python import socket TCP_IP = '127.0.0.1' TCP_PORT = 5005 BUFFER_SIZE = 1024 MESSAGE = "Hello, World!" s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.connect((TCP_IP, TCP_PORT)) s.send(MESSAGE) data = s.recv(BUFFER_SIZE) s.close() print "received data:", data |
This client simply mimics the behavior we did in Telnet.
Limitations of the server code
The server code above can only interact with one client. If you try to connect with a second terminal it simply won’t reply to the new client. To let the server interact with multiple clients you need to use multi-threading. We rebuild the server script to accept multiple client connections:
#!/usr/bin/env python import socket from threading import Thread from SocketServer import ThreadingMixIn class ClientThread(Thread): def __init__(self,ip,port): Thread.__init__(self) self.ip = ip self.port = port print "[+] New thread started for "+ip+":"+str(port) def run(self): while True: data = conn.recv(2048) if not data: break print "received data:", data conn.send(data) # echo TCP_IP = '0.0.0.0' TCP_PORT = 62 BUFFER_SIZE = 20 # Normally 1024, but we want fast response tcpsock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) tcpsock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) tcpsock.bind((TCP_IP, TCP_PORT)) threads = [] while True: tcpsock.listen(4) print "Waiting for incoming connections..." (conn, (ip,port)) = tcpsock.accept() newthread = ClientThread(ip,port) newthread.start() threads.append(newthread) for t in threads: t.join() |
Application protocol
So far we have simply sent messages back and forth. Every message can have a specific meaning in an application. This is known as the protocol. The meaning of these messages must be the same on both the sender and receiver side. The Transport Layer below makes sure that messages are received (TCP). The Internet Layer is the IPv4 protocol. All we have to define is the Application Layer.
Below we modified the server to accept simple commands (We use the non-threading server for simplicity). We changed the port to 64. Server code with a protocol:
#!/usr/bin/env python import socket TCP_IP = '127.0.0.1' TCP_PORT = 64 BUFFER_SIZE = 20 # Normally 1024, but we want fast response s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.bind((TCP_IP, TCP_PORT)) s.listen(1) conn, addr = s.accept() print 'Connection address:', addr while 1: data = conn.recv(BUFFER_SIZE) if not data: break print "received data:", data #conn.send(data) # echo if "/version" in data: conn.send("Demo versionn") if "/echo" in data: data = data.replace("/echo","") conn.send(data + "n") conn.close() |
Run the server with:
sudo python server.py |
A client can then connect with telnet (make sure you pick the right IP):
$ telnet 127.0.0.1 64 Trying 127.0.0.1... Connected to 127.0.0.1. Escape character is '^]'. message /version Demo version /echo Repeat this Repeat this |
FTP client in Python
This article will show you how to use the File Transfer Protocol (FTP) with Python from a client side perspective. We use ftplib, a library that implements the FTP protocol. Using FTP we can create and access remote files through function calls.
Related course
Python Programming Bootcamp: Go from zero to hero
Directory listing
We can list the root directory using this little snippet:
import ftplib ftp = ftplib.FTP("ftp.nluug.nl") ftp.login("anonymous", "ftplib-example-1") data = [] ftp.dir(data.append) ftp.quit() for line in data: print "-", line |
This will output the directory contents. in a simple console style output. If you want to show a specific directory you must change the directory after connecting with the ftp.cwd(‘/’) function where the parameter is the directory you want to change to.
import ftplib ftp = ftplib.FTP("ftp.nluug.nl") ftp.login("anonymous", "ftplib-example-1") data = [] ftp.cwd('/pub/') # change directory to /pub/ ftp.dir(data.append) ftp.quit() for line in data: print "-", line |
Download file
To download a file we use the retrbinary() function. An example below:
import ftplib import sys def getFile(ftp, filename): try: ftp.retrbinary("RETR " + filename ,open(filename, 'wb').write) except: print "Error" ftp = ftplib.FTP("ftp.nluug.nl") ftp.login("anonymous", "ftplib-example-1") ftp.cwd('/pub/') # change directory to /pub/ getFile(ftp,'README.nluug') ftp.quit() |
Uploading files
We can upload files using the storlines() command. This will upload the file README.nluug in the main directory. If you want to upload in another directory combine it with the cwd() function.
import ftplib import os def upload(ftp, file): ext = os.path.splitext(file)[1] if ext in (".txt", ".htm", ".html"): ftp.storlines("STOR " + file, open(file)) else: ftp.storbinary("STOR " + file, open(file, "rb"), 1024) ftp = ftplib.FTP("127.0.0.1") ftp.login("username", "password") upload(ftp, "README.nluug") |
Other functions
For other functions please refer to the official library documentation.