Skip to content

Cannot unpickle data frame made with 0.19.2 after upgrade to 0.20.1 #16474

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mhooreman opened this issue May 24, 2017 · 14 comments
Closed

Cannot unpickle data frame made with 0.19.2 after upgrade to 0.20.1 #16474

mhooreman opened this issue May 24, 2017 · 14 comments
Labels
IO Data IO issues that don't fit into a more specific label Usage Question

Comments

@mhooreman
Copy link

mhooreman commented May 24, 2017

Hello,

Problem description

When we create a data frame with pandas ≤ 0.19.2 and pickle it (using pickle.dump), it is not possible to unpickle it using pandas 0.20.1.

# Using pandas 0.19.2
import pandas as pd
import pickle as pkl
data = pd.DataFrame({'x': [1, 2]})
pkl.dump(data, open("data_pd_0.19.2.pkl", "wb"))
# After upgrade to pandas 0.20.1
import pandas as pd
import pickle as pkl
pkl.load(open("data_pd_0.19.2.pkl", "rb"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pandas.indexes'

First analysis

  • It seems that pandas.indexes has been refactored to pandas.core.indexes.
  • I don't know if there are other such incompatible changes

Proposal

It would be great to have:

  • A deprecation warning when unpicking old data frame
  • Load old data frame supported but automatically converted to the new format, so that we can upgrade by pickling the unpickled data frames

Thanks a lot for your help,
Best regards.

@jreback
Copy link
Contributor

jreback commented May 24, 2017

Big red box, is clear that pd.read_pickle is the pickle reader and makes things backward compatible. Further whatsnew notes have a quite large section on what changed here

sure a direct call will work to pickle.loads, but this is not guaranteeed across versions.

@jreback jreback closed this as completed May 24, 2017
@jreback jreback added IO Data IO issues that don't fit into a more specific label Usage Question labels May 24, 2017
@jreback jreback added this to the No action milestone May 24, 2017
@matjazk
Copy link

matjazk commented Jun 1, 2017

Going from panda 0.18.1 to 0.20.1 I encountered the same problem when loading with joblib. joblib.load fails with exactly the same error:
ImportError: No module named 'pandas.indexes'

When you fix this (see the first workaround), there is an error
AttributeError: module 'pandas.core.base' has no attribute 'FrozenNDArray'

After workaround 2, files load. It seems that in my case this is more of a question for joblib devs.

Two (ugly) workarounds:

import sys
# 1
import pandas.core.indexes 
sys.modules['pandas.indexes'] = pandas.core.indexes
# 2
import pandas.core.base, pandas.core.indexes.frozen
setattr(sys.modules['pandas.core.base'],'FrozenNDArray', pandas.core.indexes.frozen.FrozenNDArray)

@jreback
Copy link
Contributor

jreback commented Jun 1, 2017

see the above and simply use pd.read_pickle

@matjazk
Copy link

matjazk commented Jun 1, 2017

I would if I could. But... I have a complex class (consisting of numpy objects, pandas series and dataframes, dictionaries ...), stored in a compressed joblib archive, so pd.read_pickle is of no use to me. As I said, this might be useful for joblib developers as for now it is impossible to load any joblib archive created when pandas < 0.20. I first had to downgrade pandas and now I'm using the above workarounds.

@jorisvandenbossche
Copy link
Member

@matjazk Would you like to open an issue at joblib for this?

@matjazk
Copy link

matjazk commented Jun 1, 2017

Already did and passed @jreback's suggestion.

@mhooreman
Copy link
Author

Thanks. pd.read_pickle works, but, for your information, it is extremely slow - see benchmark.
I've made a script to pd.read_pickle and then pd.to_pickle every file.
benchmark

@jorisvandenbossche
Copy link
Member

@mhooreman the timings of "reading old" look suspiciously consistent with "writing". Are you sure you timed the correct thing?

@mhooreman
Copy link
Author

mhooreman commented Jun 8, 2017 via email

@jreback
Copy link
Contributor

jreback commented Jun 8, 2017

@mhooreman of course its slower. its falling back to the python based unpickler which is much more flexible. so you can either have fast or correctness. you get to choose.

@TheodoreZhao
Copy link

TheodoreZhao commented Jul 2, 2017

I got the same problem when unpickling the data in pandas 0.20.2. I have used df.to_pickle() to pickle my dataframe in pandas 0.19.2 but failed to unpickle it using pandas.read_pickle() in pandas 0.20.2. I got the error message

ImportError: No module named 'pandas.indexes'

pandas.read_pickle() and pickle.load() both generate this error message.

@jorisvandenbossche
Copy link
Member

@TheodoreZhao If you have this error with read_pickle as well, please open a new issue with a reproducible example.

@seandickert
Copy link

@jreback similar to @matjazk, pd.read_pickle doesn't work if you're using pickle.loads to load a string (retrieved from some store other than the filesystem). Can pd.read_pickle be updated to handle a file-like object rather than just a path?

@jreback
Copy link
Contributor

jreback commented Feb 21, 2018

its an open issue: #5924

if you want to submit a PR to do this, its not difficult.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label Usage Question
Projects
None yet
Development

No branches or pull requests

6 participants