Skip to content

BUG: pd.NA doesn't pickle/unpickle faithfully #31847

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tsoernes opened this issue Feb 10, 2020 · 12 comments · Fixed by #32104
Closed

BUG: pd.NA doesn't pickle/unpickle faithfully #31847

tsoernes opened this issue Feb 10, 2020 · 12 comments · Fixed by #32104
Labels
Bug IO Pickle read_pickle, to_pickle NA - MaskedArrays Related to pd.NA and nullable extension arrays
Milestone

Comments

@tsoernes
Copy link

Code Sample, a copy-pastable example if possible

In [5]: df['Gold Categories'].count()
Out[5]: 135218

In [6]: df['Gold Categories'].isna().sum()
Out[6]: 0

In [7]: df['Gold Categories'].iloc[256]
Out[7]: <NA>

In [8]: pd.isna(df['Gold Categories'].iloc[256])
Out[8]: False

In [9]: type(df['Gold Categories'].iloc[256])
Out[9]: pandas._libs.missing.NAType

In [10]: pd.__version__
Out[10]: '1.0.1'

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.16-200.fc30.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : nb_NO.UTF-8
LOCALE : nb_NO.UTF-8

pandas : 1.0.1
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.0
pip : 19.3.1
setuptools : 41.6.0.post20191030
Cython : 0.29.13
pytest : 5.2.2
hypothesis : None
sphinx : 2.2.1
blosc : None
feather : None
xlsxwriter : 1.2.2
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.9.0
pandas_datareader: None
bs4 : 4.8.1
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 2.2.3
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : 5.2.2
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.10
tables : 3.5.2
tabulate : 0.8.5
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.2
numba : 0.46.0

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Feb 10, 2020

Could you include a reproducible example? I could not reproduce this on master:

>>> pd.DataFrame({'Gold categories': [pd.NA]})['Gold categories'].iloc[0]                                                                                                                                   
<NA>

>>> pd.isna(pd.DataFrame({'Gold categories': [pd.NA]})['Gold categories'].iloc[0])                                                                                                                          
True

@MarcoGorelli MarcoGorelli added the Needs Info Clarification about behavior needed to assess issue label Feb 10, 2020
@tsoernes
Copy link
Author

tsoernes commented Feb 11, 2020

@MarcoGorelli I can upload a sample, if that will suffice, but I do not know how to reproduce

@MarcoGorelli
Copy link
Member

@tsoernes if you run the two lines I posted above, do you get the same output?

@tsoernes
Copy link
Author

@MarcoGorelli Yes

@tsoernes
Copy link
Author

tsoernes commented Feb 11, 2020

Here is a sample of that column with 2 rows. It is a zipped pickle file (Github only allows zips).


In [167]: na_test = read_pickle('/tmp/na_test.pickle')
Loaded 2 entries (a Series) from /tmp/na_test.pickle (2020-02-11 21:51)

In [168]: na_test.isna().sum()
Out[173]: 0

In [174]: na_test

Out[176]: 
268    [Fintech]
269         <NA>
Name: Gold Categories, dtype: object

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Feb 11, 2020

@tsoernes I'm afraid we can't accept raw pickle files in bug reports, as they could be unsafe. Please remove the attachment from your message :)

Could you please paste the output of na_test.to_dict(), so we can copy-and-paste it and reproduce the issue?

@tsoernes
Copy link
Author

na_test.to_dict()
Out[195]: {268: ['Fintech'], 269: <NA>}

@MarcoGorelli
Copy link
Member

Thanks @tsoernes

TBH I still can't reproduce the issue though

>>> pd.Series({268: ['Fintech'], 269: pd.NA}).isna()                                                        
268    False
269     True
dtype: bool

@tsoernes
Copy link
Author

tsoernes commented Feb 11, 2020 via email

@jorisvandenbossche
Copy link
Member

When pickling/unpickling, I can reproduce this:

In [40]: s = pd.Series({268: ['Fintech'], 269: pd.NA})                                                                                                                                                             

In [41]: s.isna()                                                                                                                                                                                                  
Out[41]: 
268    False
269     True
dtype: bool

In [42]: s.to_pickle('test_na_pickle.pkl')                                                                                                                                                                         

In [43]: s2 = pd.read_pickle('test_na_pickle.pkl')                                                                                                                                                                 

In [44]: s2.isna()                                                                                                                                                                                                 
Out[44]: 
268    False
269    False
dtype: bool

In [45]: type(s2.values[1])                                                                                                                                                                                        
Out[45]: pandas._libs.missing.NAType

In [46]: s2.values[1] is pd.NA                                                                                                                                                                                     
Out[46]: False

So apparently, when unpickling, it doesn't return the same singleton.

@jorisvandenbossche jorisvandenbossche added Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays IO Pickle read_pickle, to_pickle and removed Needs Info Clarification about behavior needed to assess issue labels Feb 11, 2020
@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Feb 11, 2020
@jorisvandenbossche
Copy link
Member

So we should probably explicitly implement methods for pickling/unpickling on the NA class

@jorisvandenbossche jorisvandenbossche modified the milestones: Contributions Welcome, 1.0.2 Feb 11, 2020
@jorisvandenbossche jorisvandenbossche changed the title pd.NA not inluded in isna() BUG: pd.NA doesn't pickle/unpickle faithfully Feb 11, 2020
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Feb 19, 2020
According to
https://docs.python.org/3/library/pickle.html#object.__reduce__,

> If a string is returned, the string should be interpreted as the name
> of a global variable. It should be the object’s local name relative to
> its module; the pickle module searches the module namespace to determine
> the object’s module. This behaviour is typically useful for singletons.

Closes pandas-dev#31847
@mephph
Copy link

mephph commented Feb 19, 2020

A simple example of the problem:

In [1]: import pandas as pd

In [2]: pd.DataFrame([[pd.NA]]).to_pickle('na_problem.pkl')

In [3]: df = pd.read_pickle('na_problem.pkl')

In [4]: df.isna()

Out[4]: 
       0
0  False

In [5]: id(df.loc[0, 0]), id(pd.NA)

Out[5]: (140393643089760, 140393944655632)

This can also cause exceptions when working with dtypes other than object.

In [1]: import pandas as pd

In [2]: pd.DataFrame([[pd.NA]], dtype='string').to_pickle('na_problem.pkl')

In [3]: pd.read_pickle('na_problem.pkl').head()
Out[3]: ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
... removed for brevity
/home/mephph/.local/lib/python3.7/site-packages/pandas/core/arrays/string_.py in _validate(self)
    168         """Validate that we only store NA or strings."""
    169         if len(self._ndarray) and not lib.is_string_array(self._ndarray, skipna=True):
--> 170             raise ValueError("StringArray requires a sequence of strings or pandas.NA")
    171         if self._ndarray.dtype != "object":
    172             raise ValueError(

ValueError: StringArray requires a sequence of strings or pandas.NA

The following function replaces the incorrect NA values in place. It operates one column at-a-time to preserve dtypes. flake8 complains about comparing types rather than using isinstance, but I find this easier to read.

def fix_wrong_na(df):
    for column in df.columns:
        isna_mask = df[column].apply(type) == type(pd.NA)
        df[column][isna_mask] = pd.NA

jorisvandenbossche pushed a commit that referenced this issue Mar 2, 2020
According to
https://docs.python.org/3/library/pickle.html#object.__reduce__,

> If a string is returned, the string should be interpreted as the name
> of a global variable. It should be the object’s local name relative to
> its module; the pickle module searches the module namespace to determine
> the object’s module. This behaviour is typically useful for singletons.

Closes #31847
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Pickle read_pickle, to_pickle NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants