Pandas Data Structures

Pandas, a data analysis library, supports two data structures:

  • Series
  • DataFrames

Related courses
Data Analysis in Python with Pandas

Series
A series can be seen as a one-dimensional array.

from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd 
 
d = { 'Netherlands': 17, 'US': 318, 'Canada': 35, 'France': 66,'UK': 64}
population = pd.Series(d)
print(population)

To get a single value use:

print(population['US'])

To get a subset:

print(population[['US','Canada','UK']])

You can also use operators on the series:

print(population[population > 60])

Data Frames
The data frame datastructure is similar to a table.

from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd 
 
d = { 'name': ['Bob','Bart','Bobby'],
      'occupation': ['Lawyer','Programmer','Teacher']}
 
frame = pd.DataFrame(d, columns=['name','occupation'])
print(frame)

which matches this table:

Example Dataframe
Example Dataframe

Data may be defined manually or loaded from:

Pandas Filter

Filtering rows of a DataFrame is an almost mandatory task for Data Analysis with Python. Given a Data Frame, we may not be interested in the entire dataset but only in specific rows.

Related course:
Data Analysis in Python with Pandas

Filter using query
A data frames columns can be queried with a boolean expression. Every frame has the module query() as one of its objects members.

We start by importing pandas, numpy and creating a dataframe:

import pandas as pd
import numpy as np
 
data = {'name': ['Alice', 'Bob', 'Charles', 'David', 'Eric'],
        'year': [2017, 2017, 2017, 2017, 2017],
        'salary': [40000, 24000, 31000, 20000, 30000]}
 
df = pd.DataFrame(data, index = ['Acme', 'Acme', 'Bilbao', 'Bilbao', 'Bilbao'])
 
print(df)

This will create the data frame containing:

dataframe

After creation of the Data Frame, we call the query method with a boolean expression. This expression is based on the column names that we defined as ‘ABCD’. The query method will return a new filtered data frame.

df_filtered = df.query('salary>30000')
print(df_filtered)

This will return:

filter

Total code of data frame creation and filter using boolean expression:

import pandas as pd
import numpy as np
 
data = {'name': ['Alice', 'Bob', 'Charles', 'David', 'Eric'],
        'year': [2017, 2017, 2017, 2017, 2017],
        'salary': [40000, 24000, 31000, 20000, 30000]}
 
df = pd.DataFrame(data, index = ['Acme', 'Acme', 'Bilbao', 'Bilbao', 'Bilbao'])
 
print(df)
print('----------')
 
df_filtered = df.query('salary>30000')
print(df_filtered)

Filter by indexing, chain methods
Instead of queries, we can use in-dices.
We do that by using an array index with boolean expressions:

df_filtered = df[(df.salary >= 30000) & (df.year == 2017)]
print(df_filtered)

This will return:

filter-index

Pandas groupby

DataFrames can be summarized using the groupby method. In this article we’ll give you an example of how to use the groupby method.

Related course:
Data Analysis in Python with Pandas

Pandas groupby
Start by importing pandas, numpy and creating a data frame. Our data frame contains simple tabular data:

pandas dataframe

In code the same table is:

import pandas as pd
import numpy as np
 
df1 = pd.DataFrame( { 
    "Name" : ["Alice", "Ada", "Mallory", "Mallory", "Billy" , "Mallory"] , 
    "City" : ["Sydney", "Sydney", "Paris", "Sydney", "Sydney", "Paris"]} )

We can then summarize the data using the groupby method:

print df1.groupby(["City"])[['Name']].count()

This will count the frequency of each city and return a new data frame:

pandas groupby

The total code being:

import pandas as pd
import numpy as np
 
df1 = pd.DataFrame( { 
    "Name" : ["Alice", "Ada", "Mallory", "Mallory", "Billy" , "Mallory"] , 
    "City" : ["Sydney", "Sydney", "Paris", "Sydney", "Sydney", "Paris"]} )
 
 
df2 = df1.groupby(["City"])[['Name']].count()
print(df2)

Read xls with Pandas

Pandas, a data analysis library, has native support for loading excel data (xls and xlsx).
The method read_excel loads xls data into a Pandas dataframe:

read_excel(filename)

If you have a large excel file you may want to specify the sheet:

df = pd.read_excel(file, sheetname='Elected presidents')

Related course
Data Analysis in Python with Pandas

Read excel with Pandas
The code below reads excel data into a Python dataset (the dataset can be saved below).

from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd 
 
file = r'data/Presidents.xls'
df = pd.read_excel(file)
print(df['Occupation'])

The dataframe can be used, as shown in the example below:

from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd 
 
file = r'data/Presidents.xls'
df = pd.read_excel(file)
 
# remove messy data
df = df[df['Years in office'] != 'n/a']
 
# show data
print('Min: ', df['Years in office'].min())
print('Max: ', df['Years in office'].max())
print('Sum: ', df['Years in office'].sum())

Dataset
For purpose of demonstration, you can use the dataset from: depaul.edu.

xls dataset
A large dataset stored in XLS format

Visualize data with Pandas

Get the xls data for this tutorial from: depaul.edu. This dataset contains a list of US presidents, associated parties,profession and more.

pandas-data
Python Pandas Dataset

Related course
Data Analysis in Python with Pandas

Beautiful Plots with Pandas
We can plot data of this large excel file with a few lines of code. We select the column “Occupation” for this demonstration using:

df['Occupation'].

Full code:

from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd 
 
file = r'data/Presidents.xls'
df = pd.read_excel(file)
 
# plot data
colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral','red','green','blue','orange','white','brown']
df['Occupation'].value_counts().plot(kind='pie',title='Occupation by President',colors=colors)
plt.show()

occupation by president

Data cleaning and plotting
We can also plot the popularity by president. Before we do that, we need to clean some data:

data-cleaning
Data cleaning with Pandas

Some of the cells do not contain numerical values, we can either remove them or replace them:

#df = df.replace('NA()',0)
df = df[df['% popular'] != 'NA()']

To plot the popularity:

from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd 
 
file = r'data/Presidents.xls'
df = pd.read_excel(file)
 
# plot data
#df = df.replace('NA()',0)
df = df[df['% popular'] != 'NA()']
 
print( df['% popular'] )
df['% popular'].plot(kind='hist', bins=8, title='Popularity by President', facecolor='blue', alpha=0.5, normed=1)
plt.show()
popularity-by-president
Popularity by President

Pandas read CSV

Pandas is a data analaysis module. It provides you with high-performance, easy-to-use data structures and data analysis tools.

In this article you will learn how to read a csv file with Pandas.

Related course
Data Analysis in Python with Pandas

Read CSV with Python Pandas
We create a comma seperated value (csv) file:

Names,Highscore,
Mel,8,
Jack,5,
David,3,
Peter,6,
Maria,5,
Ryan,9,

Imported in excel that will look like this:

pandas-dataset
Python Pandas example dataset

The data can be read using:

from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd 
 
file = r'highscore.csv'
df = pd.read_csv(file)
print(df)

The first lines import the Pandas module. The read_csv method loads the data in a a Pandas dataframe that we named df.

Dataframes
A dataframe can be manipulated using methods, the minimum and maximum can easily be extracted:

from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd 
 
file = r'highscore.csv'
df = pd.read_csv(file)
print('Max', df['Highscore'].max())
print('Min', df['Highscore'].min())
pandas-shell
Pandas on a dataset

The dataset in this example is very small, but a dataset can easily contain thousands or millions of records.