-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DEPR: Special casing of NumPy and Python builtin functions #53425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
A few questions to start the discussion:
|
I hope not - as a pandas user it had never occurred to me to pass The OP of #53426 demonstrates different floating point results with
Edit: The example below fails because it uses Code
|
I think #52042 means the DataFrameGroupBy cases here will be emitting FutureWarnings anyway, so affected users will need to update their code regardless. |
I would be +1 for not doing this special casing |
I was a bit incorrect in #53425 (comment); I've updated that comment (see the crossed out line) and added this as another example of not-great current behavior to the OP. |
I think I can remember that this may have been done for performance reasons, i.e. the builtin methods are faster. Could you check on that? |
Yes - the builtin methods are quite a bit faster. But users can still use them by passing e.g. |
I tried this: >>> df = pd.DataFrame({"a": [1,1,1,2,2], "b": [4,5,6,4,pd.NA]}, dtype="int32[pyarrow]")
>>> df.agg(np.sum, axis=0)
a 7
b 19
dtype: int64 Seems to work after removing calls to I think IK'm probably ok with it being deprecated though. |
Ack! My original example used |
Cool 👍 . |
After this deprecation has been enforced: Just a heads up that this needs to be fixed before v3.0. Implementing #54747 v2 (I will push today) will fix this though, but you should probably be aware of this just in case something stops #54747. |
@topper-123 - how are you concluding this? If I enforce the deprecation here as well as the one in #53325, then I get:
Am I doing something wrong? |
Currently various code paths detects whether a user passes e.g. Python's builtin
sum
ornp.sum
and replaces the operation with the pandas equivalent. I think we shouldn't alias a potentially valid UDF to the pandas version; this prevents users from using the NumPy version of the method if that is indeed what they want. This can lead to some surprising behavior: for example, NumPyddof
defaults to 0 whereas pandas defaults to 1. We also only do the switch when there are no args/kwargs provided. Thus:It can also lead to different floating point behavior:
However, using the Python builtin functions on DataFrames can give some surprising behavior:
If we remove special casing of these builtins, then apply/agg/transform will also exhibit this behavior. In particular, whether a DataFrame is broken up into Series internally to compute the operation will determine whether we get a result like
2
orb
above.However, I think this is still okay to do as the change to users is straightforward: use
"max"
instead ofmax
.I've opened #53426 to see what impact this would have on our test suite.
cc @topper-123 @mroeschke @jbrockmendel
The text was updated successfully, but these errors were encountered: