Skip to content

BUG: compat_pickle should not modify global namespace #5661

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 8, 2013

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Dec 7, 2013

turns out was modifying the python pickle just by importing pandas

when sub classing have to copy a mutable property before modifying

http://stackoverflow.com/questions/20444593/pandas-compiled-from-source-default-pickle-behavior-changed

@acorbe
Copy link
Contributor

acorbe commented Dec 8, 2013

Hi,

I updated my question on SO (http://stackoverflow.com/questions/20444593/pandas-compiled-from-source-default-pickle-behavior-changed). I applied the patch, although things are not working yet.

Thanks for your support.

@jreback
Copy link
Contributor Author

jreback commented Dec 8, 2013

pls provide a link to your file. 100mb is no big deal. did u pickle this is 0.12? give me an example of what u r doing

@acorbe
Copy link
Contributor

acorbe commented Dec 8, 2013

Hi
pandas version (from canopy package manager)

Size: 7.32 MB
Version: 0.12.0
Build: 2
Dependencies:
 numpy 1.7.1
 python_dateutil
 pytz 2011n

  md5: 7dd4385bed058e6ac15b0841b312ae35

this is a link to the file https://www.dropbox.com/s/6g4ej7ru5244e35/pickle_L1cor_fd.pic

I am not quite sure this job alone can do the job.

What is pickled is a list of dicts, One or more dicts entries are DataFram; then there is more.

I can try to extract some minimal working thing from my code in case it is necessary.

As I wrote on SO, after changing the code I have this issue

In [4]: pickle.load(open('pickle_L1cor_s1.pic','rb'))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-88719f8f9506> in <module>()
----> 1 pickle.load(open('pickle_L1cor_s1.pic','rb'))

/home/acorbe/Canopy/appdata/canopy-1.1.0.1371.rh5-x86_64/lib/python2.7/pickle.pyc in load(file)
   1376
   1377 def load(file):
-> 1378     return Unpickler(file).load()
   1379
   1380 def loads(str):

/home/acorbe/Canopy/appdata/canopy-1.1.0.1371.rh5-x86_64/lib/python2.7/pickle.pyc in load(self)
    856             while 1:
    857                 key = read(1)
--> 858                 dispatch[key](self)
    859         except _Stop, stopinst:
    860             return stopinst.value

/home/acorbe/Canopy/appdata/canopy-1.1.0.1371.rh5-x86_64/lib/python2.7/pickle.pyc in             load_reduce(self)
   1131         args = stack.pop()
   1132         func = stack[-1]
-> 1133         value = func(*args)
   1134         stack[-1] = value
   1135     dispatch[REDUCE] = load_reduce

TypeError: _reconstruct: First argument must be a sub-type of ndarray

Let me know.
Thanks

jreback added a commit that referenced this pull request Dec 8, 2013
BUG: compat_pickle should not modify global namespace
@jreback jreback merged commit 98e48ca into pandas-dev:master Dec 8, 2013
@jreback
Copy link
Contributor Author

jreback commented Dec 8, 2013

I updated the SO question. You can just

pd.read_pickle(file_name) and it will work.

@ruidc
Copy link
Contributor

ruidc commented Jan 13, 2014

couldn't/shouldn't the unpickler handle both versions internally to keep external compatibility?

@jreback
Copy link
Contributor Author

jreback commented Jan 13, 2014

it does (it tries the unmodified version first then with modifications if that fails)

@ruidc
Copy link
Contributor

ruidc commented Jan 13, 2014

but there's still problems when using just pickle.loads(data) with a 0.12 pickle:

TypeError: ('_reconstruct: First argument must be a sub-type of ndarray', <built-in function _reconstruct>, (<class 'pandas.core.series.Series'>, (0,), 'b'))

we shouldn't have to now go to a file (or file-like) object and use read_pickle.

@jreback
Copy link
Contributor Author

jreback commented Jan 13, 2014

no way around that unless I modify the global namespace; just use pd.read_pickle instead; see the linked issue

@jreback
Copy link
Contributor Author

jreback commented Jan 13, 2014

see here as well: (and its in 0.13 whatsnew) http://pandas.pydata.org/pandas-docs/dev/io.html#io-pickle

@ruidc
Copy link
Contributor

ruidc commented Jan 13, 2014

Oh, I saw it, just thought there should be a way to avoid it.

@sadruddin
Copy link

To put it another way, how should one now read a pickle contained in a string? A common case for that is a pickle stored in a BLOB in a database, does one really have to first save the data in a file to be able to use pd.read_pickle?

@jreback
Copy link
Contributor Author

jreback commented Jan 13, 2014

pickle didn't before read from a file previously, you can just wrap it in a StringIO. You could make an issue for this if you want (to enable that enhancement).

e.g.

pd.read_pickle(StringIO(data))

@ruidc
Copy link
Contributor

ruidc commented Jan 13, 2014

which is what pickle.loads does internally - I'm more concerned about having to change our widespread existing pickle.loads() calls to pd.read_pickle, thinking that there would be a call to the pandas Unpickler somewhere in this process but there doesn't appear to be a way out without touching global namespace.

@jreback
Copy link
Contributor Author

jreback commented Jan 13, 2014

you could patch it internally if you want, esentially reverse this fix (see compat/pickle_compat.py)
Its basically a 1-line change (just remove the copy.copy).

@ruidc
Copy link
Contributor

ruidc commented Jan 13, 2014

or perhaps a pandas option to turn it back on? what were the negative consequences of having it that way?

@jreback
Copy link
Contributor Author

jreback commented Jan 13, 2014

we could add an option (would have to happen in 0.13.1 though). Their was a 'bug' because of how I was calling it.

More generally its 'bad practice' to monkey patch another module at run-time.

will create an issue for an option and for pickle read from a string (that is sort of a separate issue though)

@ruidc
Copy link
Contributor

ruidc commented Jan 14, 2014

Having that done in read_pickle is not much benefit, the real issue for me is pickle.load/s breaking for old pickles.

I tried your suggestion of commenting out the copy.copy and it had no impact:

  File "C:\Python27\lib\pickle.py", line 1382, in loads
    return Unpickler(file).load()
  File "C:\Python27\lib\pickle.py", line 858, in load
    dispatch[key](self)
  File "C:\Dev\src\pandas\pandas\compat\pickle_compat.py", line 29, in load_reduce
    value = func(*args)
TypeError: _reconstruct: First argument must be a sub-type of ndarray

I don't understand all the mechanics but saw your discussion with Wes on http://grokbase.com/t/python/pandas-dev/134n2f0xw2/pickle-is-evil that these are recurring issues with pickle stability and limitations in the way python does it. As it's numpy complaining, is there no way to improve the situation from there?

Otherwise we will probably be forced to update our pickled objects when this pops up.

@jreback
Copy link
Contributor Author

jreback commented Jan 14, 2014

yep....saving pickles is not a particularly efficient / good idea, but I know people do it. I always use HDF.

why can't you use read_pickle? this will read old pickles, if not pls post a reproducible example

@ruidc
Copy link
Contributor

ruidc commented Jan 14, 2014

well, that's what I'm trying to do at the moment:

  File "C:\Dev\intranet\rdc_test\test_unpickle.py", line 13, in corestone_loads
    return pandas.read_pickle(StringIO.StringIO(s))
  File "C:\Dev\src\pandas\pandas\io\pickle.py", line 49, in read_pickle
    return try_read(path)
  File "C:\Dev\src\pandas\pandas\io\pickle.py", line 45, in try_read
    with open(path, 'rb') as fh:
TypeError: coercing to Unicode: need string or buffer, instance found

where s is the pickled object string.
To elaborate, we have pickles which may include pandas objects, and use cPickle.loads to save and load them into database, so am attempting to replace our calls to cPickle.loads with a wrapper to test for this and call read_pickle.

@ruidc
Copy link
Contributor

ruidc commented Jan 14, 2014

...perhaps we can have a read_pickle equivalent that takes a file-like object? Or is this what you intended in Issue #5924 ?

@jreback
Copy link
Contributor Author

jreback commented Jan 14, 2014

yes....that is what #5924 is for....its pretty easy to do though

read_pickle(StringIO.StringIO(string))

@ruidc
Copy link
Contributor

ruidc commented Jan 14, 2014

well, in case this helps anyone else, we'll end up monkey patching cPickle.loads instead in sitecustomize.py:

import cStringIO
import pandas
print pandas.__version__
from pandas.compat import pickle_compat
import cPickle

original_loads = cPickle.loads

def our_loads(s):
    try:
        return original_loads(s)
    except TypeError as e:
        if "_reconstruct: First argument must be a sub-type of ndarray" in str(e):
            return pickle_compat.load(cStringIO.StringIO(s), compat=True)
        else:
            raise

cPickle.loads = our_loads   

in combination with raising a warning to at least to buy us time to replace our 0.12 pickles

@jreback
Copy link
Contributor Author

jreback commented Jan 14, 2014

ok...looks reasonable

@jreback
Copy link
Contributor Author

jreback commented Jan 15, 2014

@ruidc not sure if you saw this new feature in 0.13: http://pandas.pydata.org/pandas-docs/dev/io.html#io-msgpack. Pretty competetive with pickle and easier on compat.

@ruidc
Copy link
Contributor

ruidc commented Jan 15, 2014

Thanks.  I think we'll wait for the stability this time around:)

.

@jreback
Copy link
Contributor Author

jreback commented Jan 15, 2014

@ruidc haha! though that's the selling point of msgpack...that its backwards compatible (e.g. it doesn't care about sub-classing for example)...

@ruidc
Copy link
Contributor

ruidc commented Jan 15, 2014

sounds great to me, will just wait until it's no longer experimental in pandas - can you tell i've been bitten from sailing too close to the wind lately? it's good to see pandas making great progress and using best-in-breed solutions though.


From: jreback [email protected]
To: pydata/pandas [email protected]
Cc: ruidc [email protected]
Sent: Wednesday, 15 January 2014, 19:32
Subject: Re: [pandas] BUG: compat_pickle should not modify global namespace (#5661)

@ruidc haha! though that's the selling point of msgpack...that its backwards compatible (e.g. it doesn't care about sub-classing for example)...

Reply to this email directly or view it on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants