Skip to content

BUG: v14.0 Error when unpickling DF with non-unique column multiindex #7329

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
code-of-kpp opened this issue Jun 4, 2014 · 26 comments · Fixed by #7370
Closed

BUG: v14.0 Error when unpickling DF with non-unique column multiindex #7329

code-of-kpp opened this issue Jun 4, 2014 · 26 comments · Fixed by #7370
Labels
Compat pandas objects compatability with Numpy or Python functions
Milestone

Comments

@code-of-kpp
Copy link

>>> d = pandas.Series({('1ab','2'): 3, ('1ab',3):4}, )
>>> d = pandas.concat([d,d])
>>> d = pandas.concat([d,d], axis=1) 
>>> pickle.loads(pickle.dumps(d))
       0  1
1ab 3  4  4
    2  3  3
    3  4  4
    2  3  3
>>> pickle.loads(pickle.dumps(d.T))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/pickle.py", line 1382, in loads
    return Unpickler(file).load()
  File "/usr/lib64/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/lib64/python2.7/pickle.py", line 1217, in load_build
    setstate(state)
  File "venv/lib/python2.7/site-packages/pandas/core/internals.py", line 2063, in __setstate__
    placement=self.axes[0].get_indexer(items))
  File "venv/lib/python2.7/site-packages/pandas/core/index.py", line 3200, in get_indexer
    raise Exception('Reindexing only valid with uniquely valued Index '
Exception: Reindexing only valid with uniquely valued Index objects
@jreback
Copy link
Contributor

jreback commented Jun 4, 2014

cc @immerrr

@immerrr
Copy link
Contributor

immerrr commented Jun 4, 2014

The problem is that the pickle only contains block items which are not enough to tell which item must go where if they're non-unique. Luckily, there's no need to share items/ref_items anymore and thus blocks can be pickled/unpickled as usual objects. I hope I'll get some time later tonight to prepare a pull request with new pickle format.

@jreback jreback added this to the 0.14.1 milestone Jun 4, 2014
@jreback jreback added the Compat label Jun 4, 2014
@jreback
Copy link
Contributor

jreback commented Jun 4, 2014

@immerrr sounds good

@metakermit
Copy link
Contributor

Can we hope for a new release with this fix anytime soon? I'm hitting this bug in 0.14.1.

@jreback
Copy link
Contributor

jreback commented Jul 16, 2014

this is fixed in 0.14.1

what seems to be the problem?

@metakermit
Copy link
Contributor

Yes, I see that it's still affecting me with the current master version. So, I'm trying to pd.read_pickle this pickle of a time series with duplicate indices and am getting the above Exception.

Here is the pickle in a gist or if it's difficult to get a binary file from a gist on Dropbox.

@jreback
Copy link
Contributor

jreback commented Jul 16, 2014

how old is the pickle?

@metakermit
Copy link
Contributor

I created it in Pandas 0.13.1 and I can read it normally with that version. I could serialise it to some different format, but the values are some custom objects, so pickling is really the fastest way. When I debug through the code, the offending line seems to be:

self.axes[0].get_indexer(items)

which seems odd, as the index is stored both in items and in self.axes[0]. The object seems OK, aside from the duplicate values.

ipdb> items
<class 'pandas.tseries.index.DatetimeIndex'>
[2010-01-02 03:00:00, ..., 2010-03-30 17:00:00]
Length: 124, Freq: None, Timezone: None

@jreback
Copy link
Contributor

jreback commented Jul 16, 2014

this issue is not fixable in that older pickles that are prior to this fix are not unpickle able

it will going forward allow pickles with dup indicies to work though

you can try picking that commit and putting it in an older install then repickling

or export from an older version of pandas in another format

their is simply not enough information in the older pickles to recreate unambiguously

@jreback
Copy link
Contributor

jreback commented Jul 16, 2014

much better is to put your ts in a frame
then reset_index and pickle
that will work

@metakermit
Copy link
Contributor

Ah, OK. I will try doing that. Thanks!

@immerrr
Copy link
Contributor

immerrr commented Jul 16, 2014

Hmm, both cases mentioned in this issue concerned series rather than frame/panel containers. I've double checked, in 0.13 it is an error to pickle/unpickle a dataframe with non-unique columns:

In [7]: df = pd.DataFrame(np.arange(15).reshape(5,3), columns=['a','b','a'])

In [8]: pd.__version__
Out[8]: '0.13.1'

In [9]: df = pd.DataFrame(np.arange(15).reshape(5,3), columns=['a','b','a'])

In [10]: import pickle

In [11]: pickle.loads(pickle.dumps(df))
Out[11]: <repr(<pandas.core.frame.DataFrame at 0x7f8c827b6410>) failed: TypeError: 'NoneType' object has no attribute '__getitem__'>

but it's OK-ish for series:

In [14]: s = pd.Series(np.arange(3), index=['a','b','a'])

In [15]: pickle.loads(pickle.dumps(s))
Out[15]: 
a    0
b    1
a    2
dtype: int64

In [16]: s
Out[16]: 
a    0
b    1
a    2
dtype: int64

It should be relatively easy to unbreak compatibility for that particular use case provided that there's a guarantee that SingleBlockManager's items were always in sync with those of its Block, at least when restricted to public API only. That should be true for my implementation that went into 0.14, but I'm not that fluent in earlier version of that code. @jreback, can you confirm or deny that?

@jreback
Copy link
Contributor

jreback commented Jul 16, 2014

@immerrr It think that is the case (they are always in sync). In fact as long as the dups don't cross block boundaries I think you could assume that.

FYI, also test with pd.read_pickle as well as pickle.loads

@code-of-kpp
Copy link
Author

I confirm - unpickling old (created with 0.13.1) data with 0.14.1 leads to

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

@immerrr
Copy link
Contributor

immerrr commented Jul 18, 2014

@jreback, what's the routine to add pickle data for an old release?

@jreback
Copy link
Contributor

jreback commented Jul 18, 2014

@immerrr
Copy link
Contributor

immerrr commented Jul 18, 2014

So, the routine is to check out 0.13.1, add the necessary series to be pickled, re-generate them files and then discard the modified script? My spidey senses are tingling...

@jreback
Copy link
Contributor

jreback commented Jul 18, 2014

no, add the changes to master (in that script). That script should be completely portable across versions. Then run the script on 0.13.1 (in say a virtual environemento. I just copy the script and run it somewhere it has no depedencies other than the installed pandas), take the generated output and put it in the appropriate place. This will be run as an additional test.

So you generate pickles asof that version.

@immerrr
Copy link
Contributor

immerrr commented Jul 18, 2014

That script is portable, but unpickling data generated by that script is not since adding frames with non-unique columns, so the test will fail both in 0.13.1 and later on (for the reasons discussed in this issue). Or do I get it wrong?

@immerrr
Copy link
Contributor

immerrr commented Jul 18, 2014

And yes, missed that part:

no, add the changes to master (in that script)

The changes are already there, both series and frames with non-unique items are generated AFAIR, just 0.13 pickles are not re-generated.

@jreback
Copy link
Contributor

jreback commented Jul 18, 2014

I agree, and that's the problem!

I think its just an incompatible break that cannot do anything about. Maybe have a note in the documentation?

@immerrr
Copy link
Contributor

immerrr commented Jul 18, 2014

Well, like I said, series case can be fixed rather easily. It's just that it won't be tested for back compatibility for future releases unless 0.13 pickles are re-generated. So, putting an untested fix now will grant (at least) three more month for users to migrate to 0.14.1.

Or I could put that particular use case into legacy_pickle directory on its own, that would look ugly, but should work. Ugh, that pickle support code is so inflexible...

@jreback
Copy link
Contributor

jreback commented Jul 18, 2014

@immerrr if you can put a back-compat fix in place, then simply add the fix (and put the test in generate_legacy_pickles) for testing (with 0.13.1). that is what you are proposing right?

@immerrr
Copy link
Contributor

immerrr commented Jul 18, 2014

The test is already in generate_legacy_pickles, but I cannot re-run that as-is on 0.13.1 because that will generate non-unpicklable frames and panels, too.

The solutions I see in no particular order are:

  • don't fix, write it down explicitly in a release note
  • put a fix, but don't add a test, pickle formats are rarely touched so this may buy quite some time for users to migrate
  • put a fix, checkout 0.13 and add dup-series to the generator (or equivalently, take current one and remove dup-frame and dup-panel ones), re-generate 0.13 legacy pickles, discard the changes to the generator
  • put a fix, write a pickle containing the dup-series only by hand and put it with the rest of legacy pickles, add an explanatory README for other developers
  • put a fix, fix/rewrite pickle testing infrastructure embracing the fact that some cases may be broken in earlier versions, re-generate all pickles in new infrastructure

None of the solutions is perfect, but I'd vote for improving user experience even if it means taking a hit in terms of maintainability.

@jreback
Copy link
Contributor

jreback commented Jul 18, 2014

you don't need to do any of that.

just modify the generator (in master), to make it generate a dup-series (and not frame/panel) if we aren't saying that they are back-compat (you could put something in the generator to say only generate the pickles for frame/panel if its run with a pandas >= 0.14.1). Then generate a 0.13.1 pickle. Done. It should fail until the fix is in master, then after the fix it should pass.

@immerrr
Copy link
Contributor

immerrr commented Jul 18, 2014

Yeah, that should do the trick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants