Pandas

Python Pandas is a Data Analysis Library (high-performance). It contains data structures to make working with structured data and time series easy. Key features are:

  • A DataFrame object: easy data manipulation
  • Read/Write data from various sources: Microsoft Excel, CSV, SQL databases, HDF5
  • Reshaping, slicing, indexing and much more..
Data Analysis with Python Pandas

Articles
Articles related to Data Analysis with Pandas:

Common Data Structures
Common data structures:

  • Tabular data (SQL table, Excel spreadsheet)
  • Time series data
  • Matrix data
  • Statistical data sets

Pandas Data Structures

Pandas, a data analysis library, supports two data structures:

  • Series
  • DataFrames

Related courses
Data Analysis with Python Pandas

Series
A series can be seen as a one-dimensional array.

from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd 
 
d = { 'Netherlands': 17, 'US': 318, 'Canada': 35, 'France': 66,'UK': 64}
population = pd.Series(d)
print(population)

To get a single value use:

print(population['US'])

To get a subset:

print(population[['US','Canada','UK']])

You can also use operators on the series:

print(population[population > 60])

Data Frames
The data frame datastructure is similar to a table.

from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd 
 
d = { 'name': ['Bob','Bart','Bobby'],
      'occupation': ['Lawyer','Programmer','Teacher']}
 
frame = pd.DataFrame(d, columns=['name','occupation'])
print(frame)

which matches this table:

Example Dataframe
Example Dataframe

Data may be defined manually or loaded from:

Pandas Filter

Filtering rows of a DataFrame is an almost mandatory task for Data Analysis with Python. Given a Data Frame, we may not be interested in the entire dataset but only in specific rows.

Related course:
Data Analysis with Python Pandas

Filter using query
A data frames columns can be queried with a boolean expression. Every frame has the module query() as one of its objects members.

We start by importing pandas, numpy and creating a dataframe:

import pandas as pd
import numpy as np
 
data = {'name': ['Alice', 'Bob', 'Charles', 'David', 'Eric'],
        'year': [2017, 2017, 2017, 2017, 2017],
        'salary': [40000, 24000, 31000, 20000, 30000]}
 
df = pd.DataFrame(data, index = ['Acme', 'Acme', 'Bilbao', 'Bilbao', 'Bilbao'])
 
print(df)

This will create the data frame containing:

dataframe

After creation of the Data Frame, we call the query method with a boolean expression. This expression is based on the column names that we defined as ‘ABCD’. The query method will return a new filtered data frame.

df_filtered = df.query('salary>30000')
print(df_filtered)

This will return:

filter

Total code of data frame creation and filter using boolean expression:

import pandas as pd
import numpy as np
 
data = {'name': ['Alice', 'Bob', 'Charles', 'David', 'Eric'],
        'year': [2017, 2017, 2017, 2017, 2017],
        'salary': [40000, 24000, 31000, 20000, 30000]}
 
df = pd.DataFrame(data, index = ['Acme', 'Acme', 'Bilbao', 'Bilbao', 'Bilbao'])
 
print(df)
print('----------')
 
df_filtered = df.query('salary>30000')
print(df_filtered)

Filter by indexing, chain methods
Instead of queries, we can use in-dices.
We do that by using an array index with boolean expressions:

df_filtered = df[(df.salary >= 30000) & (df.year == 2017)]
print(df_filtered)

This will return:

filter-index

Pandas groupby

DataFrames can be summarized using the groupby method. In this article we’ll give you an example of how to use the groupby method.

Related course:
Data Analysis with Python Pandas

Pandas groupby
Start by importing pandas, numpy and creating a data frame. Our data frame contains simple tabular data:

pandas dataframe

In code the same table is:

import pandas as pd
import numpy as np
 
df1 = pd.DataFrame( { 
    "Name" : ["Alice", "Ada", "Mallory", "Mallory", "Billy" , "Mallory"] , 
    "City" : ["Sydney", "Sydney", "Paris", "Sydney", "Sydney", "Paris"]} )

We can then summarize the data using the groupby method:

print df1.groupby(["City"])[['Name']].count()

This will count the frequency of each city and return a new data frame:

pandas groupby

The total code being:

import pandas as pd
import numpy as np
 
df1 = pd.DataFrame( { 
    "Name" : ["Alice", "Ada", "Mallory", "Mallory", "Billy" , "Mallory"] , 
    "City" : ["Sydney", "Sydney", "Paris", "Sydney", "Sydney", "Paris"]} )
 
 
df2 = df1.groupby(["City"])[['Name']].count()
print(df2)

Read xls with Pandas

Pandas, a data analysis library, has native support for loading excel data (xls and xlsx).
The method read_excel loads xls data into a Pandas dataframe:

read_excel(filename)

If you have a large excel file you may want to specify the sheet:

df = pd.read_excel(file, sheetname='Elected presidents')

Related course
Data Analysis with Python Pandas

Read excel with Pandas
The code below reads excel data into a Python dataset (the dataset can be saved below).

from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd 
 
file = r'data/Presidents.xls'
df = pd.read_excel(file)
print(df['Occupation'])

The dataframe can be used, as shown in the example below:

from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd 
 
file = r'data/Presidents.xls'
df = pd.read_excel(file)
 
# remove messy data
df = df[df['Years in office'] != 'n/a']
 
# show data
print('Min: ', df['Years in office'].min())
print('Max: ', df['Years in office'].max())
print('Sum: ', df['Years in office'].sum())

Dataset
For purpose of demonstration, you can use the dataset from: depaul.edu.

xls dataset
A large dataset stored in XLS format

SQLite Database with Pandas

An SQLite database can be read directly into Python Pandas (a data analysis library). In this article we’ll demonstrate loading data from an SQLite database table into a Python Pandas Data Frame. We’ll also briefly cover the creation of the sqlite database table using Python.

Related course
Data Analysis with Python Pandas

SQLite dataset

We create a simple dataset using this code:

import sqlite3 as lite
import sys
 
con = lite.connect('population.db')
 
with con:
    cur = con.cursor()    
    cur.execute("CREATE TABLE Population(id INTEGER PRIMARY KEY, country TEXT, population INT)")
    cur.execute("INSERT INTO Population VALUES(NULL,'Germany',81197537)")
    cur.execute("INSERT INTO Population VALUES(NULL,'France', 66415161)")
    cur.execute("INSERT INTO Population VALUES(NULL,'Spain', 46439864)")
    cur.execute("INSERT INTO Population VALUES(NULL,'Italy', 60795612)")
    cur.execute("INSERT INTO Population VALUES(NULL,'Spain', 46439864)")

It creates the SQLite database containing one table with dummy data.

sqlite-dataset
SQLite dataset created from script

Sqlite to Python Panda Dataframe

An SQL query result can directly be stored in a panda dataframe:

import pandas as pd
import sqlite3
 
conn = sqlite3.connect('population.db')
query = "SELECT country FROM Population WHERE population > 50000000;"
 
df = pd.read_sql_query(query,conn)
 
for country in df['country']:
    print(country)

We connect to the SQLite database using the line:

conn = sqlite3.connect('population.db')

The line that converts SQLite data to a Panda data frame is:

df = pd.read_sql_query(query,conn)

where query is a traditional SQL query.

The dataframe (df) will contain the actual data.

Posts navigation

1 2