-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Serializing Pandas Functions #12021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Why do you need to serialize the functions? If pandas is on the other machine, that shouldn't be necessary? |
I need to communicate to a process on the other machine which pandas function to run. |
@mrocklin: I'm not seeing Some adjustments to |
So the problem with the
how is the |
In [1]: import pandas as pd
In [2]: import cloudpickle
In [3]: s = pd.Series([1, 2, 3])
In [4]: sum2 = cloudpickle.loads(cloudpickle.dumps(pd.Series.sum))
In [5]: pd.Series.sum(s)
Out[5]: 6
In [6]: sum2(s)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-49effaf88fb4> in <module>()
----> 1 sum2(s)
TypeError: unbound method stat_func() must be called with NoneType instance as first argument (got Series instance instead) This I think was more an error with cloudpickle than with pandas. This was fixed in a recent cloudpickle PR. |
So in your example above its serializing an unbound method.
You can serialize a bound method then it works
|
In this case I want to serialize an unbound method. |
so this is a |
It can be handled in either. Like Mike said earlier, a little bit of help from Pandas can go a long way here. It's important to make sure that all functions maintain metadata like |
no problem with changing things. maybe @mmckerns has an example of how you are constructing the closures. maybe we aren't setting something up correctly. (we are setting
|
Not sure what
I guess, in short, it's not so much about how Messing with |
The problem is more extensive. I suspect you would need to be more careful about how you wrap things. Here is a fail case not handled by that PR. In [1]: import pandas as pd
In [2]: from cloudpickle import dumps, loads
In [3]: pd.Series.sum
Out[3]: <function pandas.core.generic._make_stat_function.<locals>.stat_func>
In [4]: loads(dumps(pd.Series.sum))
Out[4]: <function stat_func> |
I checkout 0.2.1 of
|
Try Python 3 |
so if you have
how do I set the name instead of |
Looks like the attributes |
@kawochen do you have a reference on how/what do with the |
yes I think you just change |
@mrocklin I updated #12372 can you see if this works for you? I could make this anything actually. What do you think
|
In [1]: import pandas as pd
pd
In [2]: pd.__version__
Out[2]: '0.18.0rc1+28.gdcfadad'
In [3]: from cloudpickle import dumps, loads
In [4]: pd.Series.sum
Out[4]: <function pandas.core.generic._make_stat_function.<locals>.sum>
In [5]: loads(dumps(pd.Series.sum))
Out[5]: <function stat_func>
In [6]: s = pd.Series([1, 2, 3])
In [7]: pd.Series.sum(s)
Out[7]: 6
In [8]: loads(dumps(pd.Series.sum))(s)
Out[8]: 6
In [9]: pd.Series.sum.__name__
Out[9]: 'sum'
In [10]: pd.Series.sum.__module__
Out[10]: 'pandas.core.generic'
In [11]: pd.Series.sum.__qualname__
Out[11]: '_make_stat_function.<locals>.sum'
In [12]: loads(dumps(pd.Series.sum)).__name__
Out[12]: 'stat_func'
In [13]: loads(dumps(pd.Series.sum)).__module__
In [14]: loads(dumps(pd.Series.sum)).__qualname__
Out[14]: 'stat_func' |
ok latest push does this, but still not sure what
|
I just tried adding on |
I added breakpoints inside of cloudpickle to see where this would end up. Arrived at the Although it also ended up there seven times for one call to Cloudpickle is only 700 lines. I think it's worth skimming it to see the kinds of things they look for. |
yep, it looks at first glance that the module is good if
|
yeh, I think we are not setting it up correctly to save a closure that is actually an instancemethod. It tries to save like a module level function I think. |
If you want to see what |
@mmckerns FYI I've run into strange behavior when using Providing traces though is really slick. |
thanks @mmckerns ok will get to this prob next week. In any event it seems that what we are doing is using a static function which we then assign to a class at run-time (so its a method now at least in py3, in py2 have this bound nonsense...). I am thinking that on deserialization it cannot be found (even though I am settting the is that a reasonable description? |
|
@jreback: I think you have a reasonable picture of it. Note that the trace starts by going into |
@mrocklin: I'm guessing you mean that |
I merged the fixes for lmk if anything else. |
In recent efforts using Pandas on multiple machines I've found that some of the functions are tricky to serialize. Apparently this might be due to runtime generation. Here are a few examples of serialization breaking, occasionally in unpleasant ways:
Lest you think that this is just a problem with pickle (which has many flaws),
dill
, a much more robust function serialization library, also fails (the failure here is py35 only.) (cc @mmckerns)In this particular case though
cloudpickle
will work.Other functions have this problem as well. Consider the series methods:
In this case, concerningly
cloudpickle
completes, but returns a wrong result:I've been able to fix some of these in cloudpipe/cloudpickle#46 but generally speaking I'm running into a number of problems here. It would be useful if, during the generation of these functions we could at least pay attention to assigning metadata like
__name__
correctly. This one in particular confused me for a while:What would help?
The text was updated successfully, but these errors were encountered: