python logo


Category: network

python urllib

Interacting with websites programmatically is a foundational aspect of modern data analysis, and Python’s urllib module is a pivotal tool for this. Whether you aim to retrieve, post, or parse data from the web, urllib has got you covered.

However, if you’re venturing into the realms of web scraping or data mining, remember: while urllib is an excellent tool, it’s just part of a broader arsenal. While it fetches the data effectively, emulating more intricate web browser behaviors may require supplementary tools.

📘 Recommended Course:

Dive Deep into Web Scraping: BeautifulSoup & Scrapy Framework in Python

Fetching Web Content: A Primer with urllib

Downloading an entire webpage’s HTML content can be distilled into a few lines:

`python

import urllib.request

html = urllib.request.urlopen(‘https://arstechnica.com').read()
print(html)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

The resulting `html` variable now houses the webpage's content, presented in standard HTML. Browsers like Firefox or Chrome interpret this raw format into the visually engaging pages we interact with daily.

# Emulating Web Browsers: Sending Requests the Smart Way

Every time web browsers communicate with web servers, they transmit a unique "User-Agent" string. This string, essentially a browser fingerprint, helps servers tailor their responses. Python can replicate this behavior:

```python
import urllib.request

headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:48.0) Gecko/20100101 Firefox/48.0"
req = urllib.request.Request('https://arstechnica.com', headers=headers)
html = urllib.request.urlopen(req).read()
print(html)

Beyond Raw Data: Parsing with Precision

Armed with the raw data from a webpage, the logical progression is to extract meaningful information from it. Enter BeautifulSoup, a module tailored for dissecting HTML:

Using BeautifulSoup, you can effortlessly:

Several other modules, like PyQuery and HTMLParser, offer functionalities akin to BeautifulSoup. Dive deeper into their capabilities here.

Posting Data: A Brief Guide

To send data to a server via POST, you can utilize the following approach:

1
2
3
4
5
6
7
8
import urllib.request

data = urllib.urlencode({'s': 'Post variable'})
h = httplib.HTTPConnection('https://server:80/')
headers = {"Content-type": "application/x-www-form-urlencoded", "Accept": "text/plain"}
h.request('POST', 'webpage.php', data, headers)
r = h.getresponse()
print(r.read())

Continue your Python web exploration journey:

how to get all page urls from a website

Web scraping is the technique to extract data from a website.

The module BeautifulSoup is designed for web scraping. The BeautifulSoup module can handle HTML and XML. It provides simple method for searching, navigating and modifying the parse tree.

Related course:
Browser Automation with Python Selenium

Get links from website


The example below prints all links on a webpage:


from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("https://arstechnica.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
print link.get('href')

It downloads the raw html code with the line:


html_page = urllib2.urlopen("https://arstechnica.com")

A BeautifulSoup object is created and we use this object to find all links:


soup = BeautifulSoup(html_page)
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
print link.get('href')

Extract links from website into array


To store the links in an array you can use:


from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("https://arstechnica.com")
soup = BeautifulSoup(html_page)
links = []

for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))

print(links)

Function to extract links from webpage


If you repeatingly extract links you can use the function below:


from BeautifulSoup import BeautifulSoup
import urllib2
import re

def getLinks(url):
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page)
links = []

for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))

return links

print( getLinks("https://arstechnica.com") )

Related course:
Browser Automation with Python Selenium

python parse html

python download file from url

python socket

python ftp client

pop3

irc bot