python logo


Category: Pandas

Data Analysis with Pandas (Guide)

Python Pandas is a Data Analysis Library (high-performance). It contains data structures to make working with structured data and time series easy. Key features are:

  • A DataFrame object: easy data manipulation
  • Read/Write data from various sources: Microsoft Excel, CSV, SQL databases, HDF5
  • Reshaping, slicing, indexing and much more..

Related course: Data Analysis with Python Pandas

Articles
Articles related to Data Analysis with Pandas:


Common Data Structures
Common data structures:
  • Tabular data (SQL table, Excel spreadsheet)
  • Time series data
  • Matrix data
  • Statistical data sets

Pandas Data Structures

Pandas Filter

Filtering rows of a DataFrame is an almost mandatory task for Data Analysis with Python. Given a Data Frame, we may not be interested in the entire dataset but only in specific rows.

Related course:
Data Analysis with Python Pandas

Filter using query
A data frames columns can be queried with a boolean expression. Every frame has the module query() as one of its objects members.


We start by importing pandas, numpy and creating a dataframe:


import pandas as pd
import numpy as np

data = {'name': ['Alice', 'Bob', 'Charles', 'David', 'Eric'],
'year': [2017, 2017, 2017, 2017, 2017],
'salary': [40000, 24000, 31000, 20000, 30000]}

df = pd.DataFrame(data, index = ['Acme', 'Acme', 'Bilbao', 'Bilbao', 'Bilbao'])

print(df)

This will create the data frame containing:


dataframe

After creation of the Data Frame, we call the query method with a boolean expression. This expression is based on the column names that we defined as ‘ABCD’. The query method will return a new filtered data frame.


df_filtered = df.query('salary>30000')
print(df_filtered)

This will return:


filter

Total code of data frame creation and filter using boolean expression:


import pandas as pd
import numpy as np

data = {'name': ['Alice', 'Bob', 'Charles', 'David', 'Eric'],
'year': [2017, 2017, 2017, 2017, 2017],
'salary': [40000, 24000, 31000, 20000, 30000]}

df = pd.DataFrame(data, index = ['Acme', 'Acme', 'Bilbao', 'Bilbao', 'Bilbao'])

print(df)
print('----------')

df_filtered = df.query('salary>30000')
print(df_filtered)

Filter by indexing, chain methods
Instead of queries, we can use in-dices.
We do that by using an array index with boolean expressions:


df_filtered = df[(df.salary >= 30000) & (df.year == 2017)]
print(df_filtered)

This will return:

filter-index

Related course:
Data Analysis with Python Pandas

Pandas GroupBy: Group Data in Python

DataFrames data can be summarized using the groupby() method. In this article we’ll give you an example of how to use the groupby method.

This tutorial assumes you have some basic experience with Python pandas, including data frames, series and so on. If you are new to Pandas, I recommend taking the course below.

Related course:
Data Analysis with Python and Pandas: Go from zero to hero

What does groupby do?

The idea of groupby() is pretty simple: create groups of categories and apply a function to them. Groupby has a process of splitting, applying and combining data.

  • splitting: the data is split into groups
  • applying: a function is applied to each group
  • combining: the results (data of different datasets is combined after applying groupby, resulting in a new data structure)

If you programmed databases (SQL) before, you may be familiar with a query like this:

SELECT city, count(name)
FROM df
GROUP BY city
ORDER BY city;

Pandas groupby does a similar thing. But then you’d type

print(df1.groupby(["City"])[['Name']].count())

Installing pandas

If you don’t have the pandas data analysis module installed, you can run the commands:

$ python -m venv pandas-example
$ source ./pandas-example/bin/activate
$ python -m pip install pandas

This sets up a virtual environment and install the pandas module inside it.

Pandas groupby example

Start by importing pandas, numpy and creating a data frame.

import pandas as pd

Our data frame contains simple tabular data:

pandas dataframe

In code the same table is:


import pandas as pd
import numpy as np

df1 = pd.DataFrame( {
"Name" : ["Alice", "Ada", "Mallory", "Mallory", "Billy" , "Mallory"] ,
"City" : ["Sydney", "Sydney", "Paris", "Sydney", "Sydney", "Paris"]} )

You can then summarize the data using the groupby method. In our example there are two columns: Name and City.

The function .groupby() takes a column as parameter, the column you want to group on.
Then define the column(s) on which you want to do the aggregation.

print df1.groupby(["City"])[['Name']].count()

This will count the frequency of each city and return a new data frame:

pandas groupby

The total code being:


import pandas as pd
import numpy as np

df1 = pd.DataFrame( {
"Name" : ["Alice", "Ada", "Mallory", "Mallory", "Billy" , "Mallory"] ,
"City" : ["Sydney", "Sydney", "Paris", "Sydney", "Sydney", "Paris"]} )


df2 = df1.groupby(["City"])[['Name']].count()
print(df2)

Example 2

The groupby() operation can be applied to any pandas data frame.
Lets do some quick examples.

The data frame below defines a list of animals and their speed measurements.

>>> df = pd.DataFrame({'Animal': ['Elephant','Cat','Cat','Horse','Horse','Cheetah', 'Cheetah'], 'Speed': [20,30,27,50,45,70,66]})
>>> df
Animal Speed
0 Elephant 20
1 Cat 30
2 Horse 50
3 Cheetah 70
>>>

You can group by animal and the average speed.

>>> df.groupby(['Animal']).mean()
Speed
Animal
Cat 28.5
Cheetah 68.0
Elephant 20.0
Horse 47.5
>>>

If you have multiple columns in your table like so:

>>> df = pd.DataFrame({'Animal': ['Elephant','Cat','Cat','Horse','Horse','Cheetah', 'Cheetah'], 
... 'Speed': [20,30,26,50,45,70,60],
... 'Length': [8,0.5,0.6,2,2.1,1.8,1.7]})

Then you can add the column like this:

>>> df.groupby("Animal")["Speed"].mean()
>>> df.groupby("Animal")["Length"].mean()

Example 3: Real World Data

The Iris flower data set contains data on several flower species and their measurements. You can load it the whole data set from a csv file like this:

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

You can read any csv file with the .read_csv() function like this, directly from the web.

To give you some insight into the dataset data:

   sepal_length  sepal_width  petal_length  petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

You can retrieve the same with:

>>> iris.head()

You can easily retrieve the minimum and maximum of a column

>>> iris['sepal_width'].min()
2.0
>>> iris['sepal_width'].max()
4.4
>>>

You’ve seen the basic groupby before. You can apply groupby while finding the average sepal width.

This then returns the average sepal width for each species.

>>> iris.groupby("species")["sepal_width"].mean()
species
setosa 3.428
versicolor 2.770
virginica 2.974
Name: sepal_width, dtype: float64
>>>

If you want the minimum value for each sepal width and species, you’d use:

>>> iris.groupby("species")["sepal_width"].min()
species
setosa 2.3
versicolor 2.0
virginica 2.2
Name: sepal_width, dtype: float64
>>>

Conclusion

We’ve covered the groupby() function extensively. You can now apply the function to any data frame, regardless of wheter its a toy dataset or a real world dataset.

You’ve learned:

  • how to load a real world data set in Pandas (from the web)
  • how to apply the groupby function to that real world data.

If you are interested in learning more about Pandas, check out this course:
Data Analysis with Python and Pandas: Go from zero to hero

Read xls with Pandas

Pandas, a data analysis library, has native support for loading excel data (xls and xlsx).
The method read_excel loads xls data into a Pandas dataframe:


read_excel(filename)

If you have a large excel file you may want to specify the sheet:


df = pd.read_excel(file, sheetname='Elected presidents')

Related course
Data Analysis with Python Pandas

Read excel with Pandas
The code below reads excel data into a Python dataset (the dataset can be saved below).


from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd

file = r'data/Presidents.xls'
df = pd.read_excel(file)
print(df['Occupation'])

The dataframe can be used, as shown in the example below:


from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd

file = r'data/Presidents.xls'
df = pd.read_excel(file)

# remove messy data
df = df[df['Years in office'] != 'n/a']

# show data
print('Min: ', df['Years in office'].min())
print('Max: ', df['Years in office'].max())
print('Sum: ', df['Years in office'].sum())

Dataset
For purpose of demonstration, you can use the dataset from: depaul.edu.

xls dataset A large dataset stored in XLS format

SQLite Database with Pandas

An SQLite database can be read directly into Python Pandas (a data analysis library). In this article we’ll demonstrate loading data from an SQLite database table into a Python Pandas Data Frame. We’ll also briefly cover the creation of the sqlite database table using Python.

Related course
Data Analysis with Python Pandas

SQLite dataset


We create a simple dataset using this code:


import sqlite3 as lite
import sys

con = lite.connect('population.db')

with con:
cur = con.cursor()
cur.execute("CREATE TABLE Population(id INTEGER PRIMARY KEY, country TEXT, population INT)")
cur.execute("INSERT INTO Population VALUES(NULL,'Germany',81197537)")
cur.execute("INSERT INTO Population VALUES(NULL,'France', 66415161)")
cur.execute("INSERT INTO Population VALUES(NULL,'Spain', 46439864)")
cur.execute("INSERT INTO Population VALUES(NULL,'Italy', 60795612)")
cur.execute("INSERT INTO Population VALUES(NULL,'Spain', 46439864)")

It creates the SQLite database containing one table with dummy data.

sqlite-dataset SQLite dataset created from script

Sqlite to Python Panda Dataframe


An SQL query result can directly be stored in a panda dataframe:


import pandas as pd
import sqlite3

conn = sqlite3.connect('population.db')
query = "SELECT country FROM Population WHERE population > 50000000;"

df = pd.read_sql_query(query,conn)

for country in df['country']:
print(country)

We connect to the SQLite database using the line:


conn = sqlite3.connect('population.db')

The line that converts SQLite data to a Panda data frame is:


df = pd.read_sql_query(query,conn)

where query is a traditional SQL query.

The dataframe (df) will contain the actual data.

Visualize data with Pandas

Get the xls data for this tutorial from: depaul.edu. This dataset contains a list of US presidents, associated parties,profession and more.

pandas-data Python Pandas Dataset

Related course
Data Analysis with Python Pandas

Beautiful Plots with Pandas
We can plot data of this large excel file with a few lines of code. We select the column “Occupation” for this demonstration using:


df['Occupation'].

Full code:


from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd

file = r'data/Presidents.xls'
df = pd.read_excel(file)

# plot data
colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral','red','green','blue','orange','white','brown']
df['Occupation'].value_counts().plot(kind='pie',title='Occupation by President',colors=colors)
plt.show()

occupation by president

Data cleaning and plotting
We can also plot the popularity by president. Before we do that, we need to clean some data:

data-cleaning Data cleaning with Pandas

Some of the cells do not contain numerical values, we can either remove them or replace them:


#df = df.replace('NA()',0)
df = df[df['% popular'] != 'NA()']

To plot the popularity:


from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd

file = r'data/Presidents.xls'
df = pd.read_excel(file)

# plot data
#df = df.replace('NA()',0)
df = df[df['% popular'] != 'NA()']

print( df['% popular'] )
df['% popular'].plot(kind='hist', bins=8, title='Popularity by President', facecolor='blue', alpha=0.5, normed=1)
plt.show()

popularity-by-president Popularity by President

Pandas read CSV

Pandas is a data analaysis module. It provides you with high-performance, easy-to-use data structures and data analysis tools.

In this article you will learn how to read a csv file with Pandas.

Related course
Data Analysis with Python Pandas

Read CSV with Python Pandas
We create a comma seperated value (csv) file:


Names,Highscore,
Mel,8,
Jack,5,
David,3,
Peter,6,
Maria,5,
Ryan,9,

Imported in excel that will look like this:

pandas-datasetPython Pandas example dataset

The data can be read using:


from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd

file = r'highscore.csv'
df = pd.read_csv(file)
print(df)

The first lines import the Pandas module. The read_csv method loads the data in a a Pandas dataframe that we named df.

Dataframes
A dataframe can be manipulated using methods, the minimum and maximum can easily be extracted:


from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd

file = r'highscore.csv'
df = pd.read_csv(file)
print('Max', df['Highscore'].max())
print('Min', df['Highscore'].min())

pandas-shell Pandas on a dataset

The dataset in this example is very small, but a dataset can easily contain thousands or millions of records.

Read Excel with Pandas

Excel files can be read using the Python module Pandas. In this article we will read excel files using Pandas.

Related course:
Data Analysis with Python Pandas

Read Excel column names
We import the pandas module, including ExcelFile. The method read_excel() reads the data into a Pandas Data Frame, where the first parameter is the filename and the second parameter is the sheet.

The list of columns will be called df.columns.


import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile

df = pd.read_excel('File.xlsx', sheetname='Sheet1')

print("Column headings:")
print(df.columns)

Using the data frame, we can get all the rows below an entire column as a list. To get such a list, simply use the column header


print(df['Sepal width'])

Read Excel data
We start with a simple Excel file, a subset of the Iris dataset.

excel - python pandas

To iterate over the list we can use a loop:


for i in df.index:
print(df['Sepal width'][i])

We can save an entire column into a list:


listSepalWidth = df['Sepal width']
print(listSepalWidth[0])

We can simply take entire columns from an excel sheet:


import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile

df = pd.read_excel('File.xlsx', sheetname='Sheet1')

sepalWidth = df['Sepal width']
sepalLength = df['Sepal length']
petalLength = df['Petal length']

Related course:
Data Analysis with Python Pandas

Write Excel with Pandas

Excel files can be created in Python using the module Pandas. In this article we will show how to create an excel file using Python.

Related course:
Data Analysis with Python Pandas

Write Excel
We start by importing the module pandas. From the module we import ExcelWriter and ExcelFile.
The next step is to create a data frame. In the data frame we put a list, with the name of the list as the first argument:


df = pd.DataFrame({'a':[1,3,5,7,4,5,6,4,7,8,9],
'b':[3,5,6,2,4,6,7,8,7,8,9]})

In this short code snippet, ‘a’ and ‘b’ are columns’. The second is the list of data the these columns will contain.

Then we write data frame to an excel file using:


writer = ExcelWriter('Pandas-Example2.xlsx')
df.to_excel(writer,'Sheet1',index=False)
writer.save()

The complete code:


import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import numpy as np

df = pd.DataFrame({'a':[1,3,5,7,4,5,6,4,7,8,9],
'b':[3,5,6,2,4,6,7,8,7,8,9]})

writer = ExcelWriter('Pandas-Example2.xlsx')
df.to_excel(writer,'Sheet1',index=False)
writer.save()

The output of this code:


write excel pandas