-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: v14.0 Error when unpickling DF with non-unique column multiindex #7329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @immerrr |
The problem is that the pickle only contains block items which are not enough to tell which item must go where if they're non-unique. Luckily, there's no need to share items/ref_items anymore and thus blocks can be pickled/unpickled as usual objects. I hope I'll get some time later tonight to prepare a pull request with new pickle format. |
@immerrr sounds good |
Can we hope for a new release with this fix anytime soon? I'm hitting this bug in 0.14.1. |
this is fixed in 0.14.1 what seems to be the problem? |
Yes, I see that it's still affecting me with the current master version. So, I'm trying to Here is the pickle in a gist or if it's difficult to get a binary file from a gist on Dropbox. |
how old is the pickle? |
I created it in Pandas 0.13.1 and I can read it normally with that version. I could serialise it to some different format, but the values are some custom objects, so pickling is really the fastest way. When I debug through the code, the offending line seems to be:
which seems odd, as the index is stored both in items and in self.axes[0]. The object seems OK, aside from the duplicate values.
|
this issue is not fixable in that older pickles that are prior to this fix are not unpickle able it will going forward allow pickles with dup indicies to work though you can try picking that commit and putting it in an older install then repickling or export from an older version of pandas in another format their is simply not enough information in the older pickles to recreate unambiguously |
much better is to put your ts in a frame |
Ah, OK. I will try doing that. Thanks! |
Hmm, both cases mentioned in this issue concerned series rather than frame/panel containers. I've double checked, in 0.13 it is an error to pickle/unpickle a dataframe with non-unique columns: In [7]: df = pd.DataFrame(np.arange(15).reshape(5,3), columns=['a','b','a'])
In [8]: pd.__version__
Out[8]: '0.13.1'
In [9]: df = pd.DataFrame(np.arange(15).reshape(5,3), columns=['a','b','a'])
In [10]: import pickle
In [11]: pickle.loads(pickle.dumps(df))
Out[11]: <repr(<pandas.core.frame.DataFrame at 0x7f8c827b6410>) failed: TypeError: 'NoneType' object has no attribute '__getitem__'> but it's OK-ish for series: In [14]: s = pd.Series(np.arange(3), index=['a','b','a'])
In [15]: pickle.loads(pickle.dumps(s))
Out[15]:
a 0
b 1
a 2
dtype: int64
In [16]: s
Out[16]:
a 0
b 1
a 2
dtype: int64 It should be relatively easy to unbreak compatibility for that particular use case provided that there's a guarantee that SingleBlockManager's items were always in sync with those of its Block, at least when restricted to public API only. That should be true for my implementation that went into 0.14, but I'm not that fluent in earlier version of that code. @jreback, can you confirm or deny that? |
@immerrr It think that is the case (they are always in sync). In fact as long as the dups don't cross block boundaries I think you could assume that. FYI, also test with |
I confirm - unpickling old (created with 0.13.1) data with 0.14.1 leads to
|
@jreback, what's the routine to add pickle data for an old release? |
So, the routine is to check out 0.13.1, add the necessary series to be pickled, re-generate them files and then discard the modified script? My spidey senses are tingling... |
no, add the changes to master (in that script). That script should be completely portable across versions. Then run the script on 0.13.1 (in say a virtual environemento. I just copy the script and run it somewhere it has no depedencies other than the installed pandas), take the generated output and put it in the appropriate place. This will be run as an additional test. So you generate pickles asof that version. |
That script is portable, but unpickling data generated by that script is not since adding frames with non-unique columns, so the test will fail both in 0.13.1 and later on (for the reasons discussed in this issue). Or do I get it wrong? |
And yes, missed that part:
The changes are already there, both series and frames with non-unique items are generated AFAIR, just 0.13 pickles are not re-generated. |
I agree, and that's the problem! I think its just an incompatible break that cannot do anything about. Maybe have a note in the documentation? |
Well, like I said, series case can be fixed rather easily. It's just that it won't be tested for back compatibility for future releases unless 0.13 pickles are re-generated. So, putting an untested fix now will grant (at least) three more month for users to migrate to 0.14.1. Or I could put that particular use case into legacy_pickle directory on its own, that would look ugly, but should work. Ugh, that pickle support code is so inflexible... |
@immerrr if you can put a back-compat fix in place, then simply add the fix (and put the test in generate_legacy_pickles) for testing (with 0.13.1). that is what you are proposing right? |
The test is already in generate_legacy_pickles, but I cannot re-run that as-is on 0.13.1 because that will generate non-unpicklable frames and panels, too. The solutions I see in no particular order are:
None of the solutions is perfect, but I'd vote for improving user experience even if it means taking a hit in terms of maintainability. |
you don't need to do any of that. just modify the generator (in master), to make it generate a dup-series (and not frame/panel) if we aren't saying that they are back-compat (you could put something in the generator to say only generate the pickles for frame/panel if its run with a pandas >= 0.14.1). Then generate a 0.13.1 pickle. Done. It should fail until the fix is in master, then after the fix it should pass. |
Yeah, that should do the trick. |
The text was updated successfully, but these errors were encountered: