- read

Introduction to Pandas for data analysis with Python

Hashim Puthiyakath 78

This is a introduction to pandas library to help you to get started and try the basic data analysis procedures.

Getting started

If you have already installed Pandas in the computer, you can use the following line to import it to your Python project. If you don’t have it already installed, install it with pip first.

import pandas as pd

pandas is generally imported as pd. It’s the general convention. So, it is better to follow it.

Series

Series refers to 1 dimensional data. It’s similar to array or list.

source: geeksforgeeks

How to create a series

Sometimes, you may not have a dataset ready. So, you may need to create one manually. Here is how to create a pandas series.

import pandas as pd
import numpy as np
# Creating empty series
ser = pd.Series()
print(ser)
# simple array
data = np.array(['g', 'e', 'e', 'k', 's'])
ser = pd.Series(data)
print(ser)

Pandas is based on numpy library. So, you may also need to import numpy library while working with pandas. numpy is usually imported as np. In the above code, you first create a numpy array and then make it a pandas series.

Dataframe

Pandas dataframe is two-dimensional dataset. The data is arranged rows and columns. It’s like a table or a spreadsheet in Excel.

source: geeksforgeeks

How to create dataframe manually

people = {
"first": ["Corey", 'Jane', 'John'],
"last": ["Schafer", 'Doe', 'Doe'],
"email": ["[email protected]", '[email protected]', '[email protected]']
}
df = pd.DataFrame(people)

How to import a CSV file as pandas dataframe

You may need to install CSV file as pandas dataframe for data analysis. Here is how to do it.

import pandas as pd
df = pd.read_csv('filepath.csv')

Usually, csv is imported and saved to a variable called df.

Explore dataframe

Here is a list of pandas functions that help you to explore and understand the dataframe.

Get the number of rows and columns in the panda dataframe

df.shape

Get column names and data types of each column

df.info()

Display all columns names in pandas output

When you use the df.info() method, sometimes, you will not be able to see the all column titles. Pandas limit the number to 20 for easy viewing. This default setting can be changed with the following setting.

pd.set_option('display.max_columns', 85)

The number 85 in the above code says to pandas to display 85 column titles instead of 20. You can change this value according to you dataset.

You can also use this for rows.

pd.set_option('display.max_rows', 285)

This is not generally used because there will be thousands of rows in a dataframe and scrolling through that is not much useful.

View only the top rows of dataframe

pd.head()

This will output the first 5 rows of the dataframe. This is useful to get an idea of what the dataset look like. You can give a number as the parameter for head() function to control the number of rows displayed.

View only the bottom rows of dataframe

pd.tail()

Display values from a particular column

df['email']
df.email
df[['last', 'email']]

Get the title of the columns

df.columns

Get specific values form the dataframe

For example, you may want to slice a particular part of the data set. You can to it with the iloc method

#get a row 
df.iloc[0]
#get a dataframe
df.iloc[[0, 1], 2]

The 0,1 refers to the rows and 2 refers to the columns

Instead of the using the column index numbers, you can also use the column names.

Instead of iloc, you can also use loc to get the rows. loc allows us to use index labels as the arguments.

df.loc[[0, 1], ['email', 'last']]

Count values

Count frequency of unique values in a column

df['date'].value_counts().plot()

another way to count unique dates if the date is not already parsed.

pd.to_datetime(df['date']).value_counts().plot()

Set index while importing dataframe

df = pd.read_csv('data/survey_results_public.csv', index_col='Respondent')

The above code helps us to set the Respondent column as the row identifier.

Change the number of maximum rows and columns displayed

pd.set_option('display.max_columns', 85)
pd.set_option('display.max_rows', 85)

Filter rows based on column

filt = (df['last'] == 'Doe')
df[filt]
# another variation
dfen = df[df['language'] == 'en']
# another method
filt = (df['last'] == 'Doe')
df.loc[filt]

In the second code above, all rows with value Doe for column ‘last’ is filteredl

In the second code above, all rows with en as value for language column is filtered and saved into dfen variable.

The third method gives us more control. It will allow us to get only particular columns.

# another method
filt = (df['last'] == 'Doe')
df.loc[filt, 'email']

The above code will output only emails, instead of all columns

Multiple conditions in filters

And operator

filt = (df['last'] == 'Doe') & (df['first'] == 'John')
df.loc[filt, 'email']

Or operator

filt = (df['last'] == 'Doe') | (df['first'] == 'Jane')
df.loc[filt, 'email']

Negate the filter — Opposite filter

Adding the ~ character before the filter will help us to reverse the filter operation.

filt = (df['last'] == 'Doe') | (df['first'] == 'Jane')
df.loc[~filt, 'email']

the above code will give the opposite of filt.

Filter for numeric values

For numeric data, you may want to filter cases based on if the numeric value is greater than, equal to, or less than a particular value.

For example, find cases with salary higher than 8,000

high_salary = (df['salary'] > 8000) 
df.loc[high_salary]

You can also limit the columns in the output

high_salary = (df['salary'] > 8000) 
df.loc[high_salary, ['country', 'salary', 'age']]

Filter rows if a column contains any one of the given values

If you have dataset with respondents form all over the world, but you want to filter those who are form US, UK, India, and Canada. We can filter this using or operator. But there is another easy way.

countries = ['US', 'UK', 'India', 'Canada']
filt = df['country'].isin(countries)
df.loc[filt]

Filtering with string methods

Sometimes we may want to filter cases based on if a column contains a particular word. This is useful for multiple choice questions.

filt = df['LanguagesWorkedWith'].str.contains['Python']
df.loc.[filt, ['LanguagesWorkedWith', 'Country', 'Age']]

If there is a missing value or NaN in the dataset, this may throw an error. To fix this, you can tell pandas to ignore NaN with the following arguement.

filt = df['LanguagesWorkedWith'].str.contains('Python', na=False)
df.loc.[filt, ['LanguagesWorkedWith', 'Country', 'Age']]
filt = df['LanguageWorkedWith'].str.contains('Python', na=False)

Time Series

Parse date while importing csv as pandas dataframe

d_parser = lambda x: pd.datetime.strptime(x, '%d/%m/%y')
df = pd.read_csv('data/TwitterSample.csv', parse_dates=['date'], date_parser=d_parser)

Filter based on date

filt = (df['Date'] >= pd.to_datetime('2019-01-01')) & (df['Date'] < pd.to_datetime('2020-01-01'))
df.loc[filt]
df.set_index('Date', inplace=True)df['2019']df['2020-01':'2020-02']

Analysis based on date

df['2020-01':'2020-02']['Close'].mean()df['2020-01-01']['High'].max()highs = df['High'].resample('D').max()df.resample('W').mean()df.resample('W').agg({'Close': 'mean', 'High': 'max', 'Low': 'min', 'Volume': 'sum'})

The .agg() helps us to set different kinds of analysis for different columns