Gotchas of pandas
Remarks#
Gotcha in general is a construct that is although documented, but not intuitive. Gotchas produce some output that is normally not expected because of its counter-intuitive character.
Pandas package has several gotchas, that can confuse someone, who is not aware of them, and some of them are presented on this documentation page.
Detecting missing values with np.nan
If you want to detect missings with
df=pd.DataFrame({'col':[1,np.nan]})
df==np.nanyou will get the following result:
col
0 False
1 FalseThis is because comparing missing value to anything results in a False - instead of this you should use
df=pd.DataFrame({'col':[1,np.nan]})
df.isnull()which results in:
col
0 False
1 TrueInteger and NA
Pandas don’t support missing in attributes of type integer. For example if you have missings in the grade column:
df= pd.read_csv("data.csv", dtype={'grade': int})
error: Integer column has NA valuesIn this case you just should use float instead of integers or set the object dtype.
Automatic Data Alignment (index-awared behaviour)
If you want to append a series of values [1,2] to the column of dataframe df, you will get NaNs:
import pandas as pd
series=pd.Series([1,2])
df=pd.DataFrame(index=[3,4])
df['col']=series
df
col
3 NaN
4 NaNbecause setting a new column automatically aligns the data by the indexe, and your values 1 and 2 would get the indexes 0 and 1, and not 3 and 4 as in your data frame:
df=pd.DataFrame(index=[1,2])
df['col']=series
df
col
1 2.0
2 NaNIf you want to ignore index, you should set the .values at the end:
df['col']=series.values
col
3 1
4 2