Pandas Filter


Filtering rows of a DataFrame is an almost mandatory task for Data Analysis with Python. Given a Data Frame, we may not be interested in the entire dataset but only in specific rows.

Related course:
Data Analysis in Python with Pandas

Filter using query
A data frames columns can be queried with a boolean expression. Every frame has the module query() as one of its objects members.

We start by importing pandas, numpy and creating a dataframe:

import pandas as pd
import numpy as np
 
data = {'name': ['Alice', 'Bob', 'Charles', 'David', 'Eric'],
        'year': [2017, 2017, 2017, 2017, 2017],
        'salary': [40000, 24000, 31000, 20000, 30000]}
 
df = pd.DataFrame(data, index = ['Acme', 'Acme', 'Bilbao', 'Bilbao', 'Bilbao'])
 
print(df)

This will create the data frame containing:

dataframe

After creation of the Data Frame, we call the query method with a boolean expression. This expression is based on the column names that we defined as ‘ABCD’. The query method will return a new filtered data frame.

df_filtered = df.query('salary>30000')
print(df_filtered)

This will return:

filter

Total code of data frame creation and filter using boolean expression:

import pandas as pd
import numpy as np
 
data = {'name': ['Alice', 'Bob', 'Charles', 'David', 'Eric'],
        'year': [2017, 2017, 2017, 2017, 2017],
        'salary': [40000, 24000, 31000, 20000, 30000]}
 
df = pd.DataFrame(data, index = ['Acme', 'Acme', 'Bilbao', 'Bilbao', 'Bilbao'])
 
print(df)
print('----------')
 
df_filtered = df.query('salary>30000')
print(df_filtered)

Filter by indexing, chain methods
Instead of queries, we can use in-dices.
We do that by using an array index with boolean expressions:

df_filtered = df[(df.salary >= 30000) & (df.year == 2017)]
print(df_filtered)

This will return:

filter-index


Pandas Data Structures
Pandas groupby