pandas

Gotchas of pandas

Remarks#

Gotcha in general is a construct that is although documented, but not intuitive. Gotchas produce some output that is normally not expected because of its counter-intuitive character.

Pandas package has several gotchas, that can confuse someone, who is not aware of them, and some of them are presented on this documentation page.

Detecting missing values with np.nan

If you want to detect missings with

df=pd.DataFrame({'col':[1,np.nan]})
df==np.nan

you will get the following result:

col
0    False
1    False

This is because comparing missing value to anything results in a False - instead of this you should use

df=pd.DataFrame({'col':[1,np.nan]})   
df.isnull()

which results in:

col
0    False
1    True

Integer and NA

Pandas don’t support missing in attributes of type integer. For example if you have missings in the grade column:

df= pd.read_csv("data.csv", dtype={'grade': int}) 
error: Integer column has NA values

In this case you just should use float instead of integers or set the object dtype.

Automatic Data Alignment (index-awared behaviour)

If you want to append a series of values [1,2] to the column of dataframe df, you will get NaNs:

import pandas as pd

series=pd.Series([1,2])
df=pd.DataFrame(index=[3,4])
df['col']=series
df

   col
3    NaN
4    NaN

because setting a new column automatically aligns the data by the indexe, and your values 1 and 2 would get the indexes 0 and 1, and not 3 and 4 as in your data frame:

df=pd.DataFrame(index=[1,2])
df['col']=series
df

   col
1      2.0
2      NaN

If you want to ignore index, you should set the .values at the end:

df['col']=series.values

   col
3    1
4    2

This modified text is an extract of the original Stack Overflow Documentation created by the contributors and released under CC BY-SA 3.0 This website is not affiliated with Stack Overflow