BUG: pd.NA doesn't pickle/unpickle faithfully #31847

tsoernes · 2020-02-10T11:29:03Z

Code Sample, a copy-pastable example if possible

In [5]: df['Gold Categories'].count()
Out[5]: 135218

In [6]: df['Gold Categories'].isna().sum()
Out[6]: 0

In [7]: df['Gold Categories'].iloc[256]
Out[7]: <NA>

In [8]: pd.isna(df['Gold Categories'].iloc[256])
Out[8]: False

In [9]: type(df['Gold Categories'].iloc[256])
Out[9]: pandas._libs.missing.NAType

In [10]: pd.__version__
Out[10]: '1.0.1'

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.16-200.fc30.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : nb_NO.UTF-8
LOCALE : nb_NO.UTF-8

pandas : 1.0.1
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.0
pip : 19.3.1
setuptools : 41.6.0.post20191030
Cython : 0.29.13
pytest : 5.2.2
hypothesis : None
sphinx : 2.2.1
blosc : None
feather : None
xlsxwriter : 1.2.2
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.9.0
pandas_datareader: None
bs4 : 4.8.1
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 2.2.3
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : 5.2.2
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.10
tables : 3.5.2
tabulate : 0.8.5
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.2
numba : 0.46.0

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2020-02-10T15:05:49Z

Could you include a reproducible example? I could not reproduce this on master:

>>> pd.DataFrame({'Gold categories': [pd.NA]})['Gold categories'].iloc[0]                                                                                                                                   
<NA>

>>> pd.isna(pd.DataFrame({'Gold categories': [pd.NA]})['Gold categories'].iloc[0])                                                                                                                          
True

tsoernes · 2020-02-11T19:41:47Z

@MarcoGorelli I can upload a sample, if that will suffice, but I do not know how to reproduce

MarcoGorelli · 2020-02-11T19:49:03Z

@tsoernes if you run the two lines I posted above, do you get the same output?

tsoernes · 2020-02-11T19:50:05Z

@MarcoGorelli Yes

tsoernes · 2020-02-11T19:53:56Z

Here is a sample of that column with 2 rows. It is a zipped pickle file (Github only allows zips).


In [167]: na_test = read_pickle('/tmp/na_test.pickle')
Loaded 2 entries (a Series) from /tmp/na_test.pickle (2020-02-11 21:51)

In [168]: na_test.isna().sum()
Out[173]: 0

In [174]: na_test

Out[176]: 
268    [Fintech]
269         <NA>
Name: Gold Categories, dtype: object

MarcoGorelli · 2020-02-11T20:52:25Z

@tsoernes I'm afraid we can't accept raw pickle files in bug reports, as they could be unsafe. Please remove the attachment from your message :)

Could you please paste the output of na_test.to_dict(), so we can copy-and-paste it and reproduce the issue?

tsoernes · 2020-02-11T21:47:44Z

na_test.to_dict()
Out[195]: {268: ['Fintech'], 269: <NA>}

MarcoGorelli · 2020-02-11T21:57:43Z

Thanks @tsoernes

TBH I still can't reproduce the issue though

>>> pd.Series({268: ['Fintech'], 269: pd.NA}).isna()                                                        
268    False
269     True
dtype: bool

tsoernes · 2020-02-11T23:07:15Z

I can't either, when going via a dictionary.

…

On Tue, Feb 11, 2020 at 11:57 PM Marco Gorelli ***@***.***> wrote: Thanks @tsoernes <https://github.com/tsoernes> TBH I still can't reproduce the issue though >>> pd.Series({268: ['Fintech'], 269: pd.NA}).isna() 268 False 269 True dtype: bool — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#31847?email_source=notifications&email_token=ABTX3RBRDVT7VZ5DSRTPSYTRCMNNRA5CNFSM4KSMLNH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELOHURQ#issuecomment-584874566>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABTX3RADPUPEHTYF7YCSB3LRCMNNRANCNFSM4KSMLNHQ> .

jorisvandenbossche · 2020-02-11T23:48:19Z

When pickling/unpickling, I can reproduce this:

In [40]: s = pd.Series({268: ['Fintech'], 269: pd.NA})                                                                                                                                                             

In [41]: s.isna()                                                                                                                                                                                                  
Out[41]: 
268    False
269     True
dtype: bool

In [42]: s.to_pickle('test_na_pickle.pkl')                                                                                                                                                                         

In [43]: s2 = pd.read_pickle('test_na_pickle.pkl')                                                                                                                                                                 

In [44]: s2.isna()                                                                                                                                                                                                 
Out[44]: 
268    False
269    False
dtype: bool

In [45]: type(s2.values[1])                                                                                                                                                                                        
Out[45]: pandas._libs.missing.NAType

In [46]: s2.values[1] is pd.NA                                                                                                                                                                                     
Out[46]: False

So apparently, when unpickling, it doesn't return the same singleton.

jorisvandenbossche · 2020-02-11T23:49:55Z

So we should probably explicitly implement methods for pickling/unpickling on the NA class

According to https://docs.python.org/3/library/pickle.html#object.__reduce__, > If a string is returned, the string should be interpreted as the name > of a global variable. It should be the object’s local name relative to > its module; the pickle module searches the module namespace to determine > the object’s module. This behaviour is typically useful for singletons. Closes pandas-dev#31847

mephph · 2020-02-19T22:15:31Z

A simple example of the problem:

In [1]: import pandas as pd

In [2]: pd.DataFrame([[pd.NA]]).to_pickle('na_problem.pkl')

In [3]: df = pd.read_pickle('na_problem.pkl')

In [4]: df.isna()

Out[4]: 
       0
0  False

In [5]: id(df.loc[0, 0]), id(pd.NA)

Out[5]: (140393643089760, 140393944655632)

This can also cause exceptions when working with dtypes other than object.

In [1]: import pandas as pd

In [2]: pd.DataFrame([[pd.NA]], dtype='string').to_pickle('na_problem.pkl')

In [3]: pd.read_pickle('na_problem.pkl').head()
Out[3]: ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
... removed for brevity
/home/mephph/.local/lib/python3.7/site-packages/pandas/core/arrays/string_.py in _validate(self)
    168         """Validate that we only store NA or strings."""
    169         if len(self._ndarray) and not lib.is_string_array(self._ndarray, skipna=True):
--> 170             raise ValueError("StringArray requires a sequence of strings or pandas.NA")
    171         if self._ndarray.dtype != "object":
    172             raise ValueError(

ValueError: StringArray requires a sequence of strings or pandas.NA

The following function replaces the incorrect NA values in place. It operates one column at-a-time to preserve dtypes. flake8 complains about comparing types rather than using isinstance, but I find this easier to read.

def fix_wrong_na(df):
    for column in df.columns:
        isna_mask = df[column].apply(type) == type(pd.NA)
        df[column][isna_mask] = pd.NA

According to https://docs.python.org/3/library/pickle.html#object.__reduce__, > If a string is returned, the string should be interpreted as the name > of a global variable. It should be the object’s local name relative to > its module; the pickle module searches the module namespace to determine > the object’s module. This behaviour is typically useful for singletons. Closes #31847

MarcoGorelli added the Needs Info Clarification about behavior needed to assess issue label Feb 10, 2020

jorisvandenbossche added Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays IO Pickle read_pickle, to_pickle and removed Needs Info Clarification about behavior needed to assess issue labels Feb 11, 2020

jorisvandenbossche added this to the Contributions Welcome milestone Feb 11, 2020

jorisvandenbossche modified the milestones: Contributions Welcome, 1.0.2 Feb 11, 2020

jorisvandenbossche changed the title ~~pd.NA not inluded in isna()~~ BUG: pd.NA doesn't pickle/unpickle faithfully Feb 11, 2020

MarcoGorelli mentioned this issue Feb 19, 2020

Unpickled StringArray with pd.NA raises ValueError #32095

Closed

TomAugspurger mentioned this issue Feb 19, 2020

BUG: Pickle NA objects #32104

Merged

jorisvandenbossche closed this as completed in #32104 Mar 2, 2020

astrojuanlu mentioned this issue Nov 12, 2021

Attractor identity changes when pickling & unpickling poliastro/poliastro#1395

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pd.NA doesn't pickle/unpickle faithfully #31847

BUG: pd.NA doesn't pickle/unpickle faithfully #31847

tsoernes commented Feb 10, 2020

INSTALLED VERSIONS

MarcoGorelli commented Feb 10, 2020 •

edited

Loading

tsoernes commented Feb 11, 2020 •

edited

Loading

MarcoGorelli commented Feb 11, 2020

tsoernes commented Feb 11, 2020

tsoernes commented Feb 11, 2020 •

edited

Loading

MarcoGorelli commented Feb 11, 2020 •

edited

Loading

tsoernes commented Feb 11, 2020

MarcoGorelli commented Feb 11, 2020

tsoernes commented Feb 11, 2020 via email

jorisvandenbossche commented Feb 11, 2020

jorisvandenbossche commented Feb 11, 2020

mephph commented Feb 19, 2020

BUG: pd.NA doesn't pickle/unpickle faithfully #31847

BUG: pd.NA doesn't pickle/unpickle faithfully #31847

Comments

tsoernes commented Feb 10, 2020

Code Sample, a copy-pastable example if possible

Output of pd.show_versions()

INSTALLED VERSIONS

MarcoGorelli commented Feb 10, 2020 • edited Loading

tsoernes commented Feb 11, 2020 • edited Loading

MarcoGorelli commented Feb 11, 2020

tsoernes commented Feb 11, 2020

tsoernes commented Feb 11, 2020 • edited Loading

MarcoGorelli commented Feb 11, 2020 • edited Loading

tsoernes commented Feb 11, 2020

MarcoGorelli commented Feb 11, 2020

tsoernes commented Feb 11, 2020 via email

jorisvandenbossche commented Feb 11, 2020

jorisvandenbossche commented Feb 11, 2020

mephph commented Feb 19, 2020

Output of `pd.show_versions()`

MarcoGorelli commented Feb 10, 2020 •

edited

Loading

tsoernes commented Feb 11, 2020 •

edited

Loading

tsoernes commented Feb 11, 2020 •

edited

Loading

MarcoGorelli commented Feb 11, 2020 •

edited

Loading