Using pandas extended types
Historically, pandas is bound to numpy arrays and its limitations :
* integer and bool types cannot store missing data, indeed, np.nan
is of type np.float
* non-numeric types such as categorical, datetime are not natively supported.
Indeed, internally, pandas relies on numpy arrays to efficiently store and perform operations on the data.
With the recent versions of pandas, it is possible to define custom types. In order to avoid having to extensively update the internal pandas code with each new extensions, it is now possible to define :
* an extension type which describes the data type and can be registered for creation via a string (i.e. ...astype("example")
)
* an extension array which is a class that handles the datatype. There is no real restriction on its construction though it must be convertible to a numpy array in order to make it work with the functions implemented in pandas. It is limited to one dimension.
Some implementations are already implemented in pandas. For example, IntegerNA is a pandas extension that can handle missing integer data without casting to float (and loosing the precision).
In order to use IntergerNA, you need to specify the type as "Int64" with a capital "i"
import pandas as pd
import numpy as np
s = pd.Series([1, np.nan, 2])
print(s)
s_ext = pd.Series([1, np.nan, 2],dtype="Int64")
print(s_ext)
0 1.0
1 NaN
2 2.0
dtype: float64
0 1.0
1 NaN
2 2.0
dtype: Int64
It is possible to leverage this extension to avoid running out of memory. For instance, we can store data using the "UInt16" type, this will avoid having to cast to float64.
Other uses might be handling special data types such as : * ip adresses (see cyberpandas) * gps locations * etc...
sources : * https://pandas.pydata.org/pandas-docs/stable/extending.html * PyData LA 2018 Extending Pandas with Custom Types - Will Ayd