-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Improve the apply/map APIs #61128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
take |
One reason to prefer strings is the ability to expand on the options in the future, but I can't see a use for that here. +1 on
This was also proposed in #40112. Chatting privately with some core developers, there was some opposition that the change from
This was indeed the road map and the reason why cc @Dr-Irv |
IMHO, this issue has a mix of proposals for cleaning up the pandas API:
I'm a big fan of having clean API's. But I have to ask whether we should separate this issue into the 4 issues above (with (2) already being covered), so we can then make decisions on each? |
Agree with @Dr-Irv. I think this issue has been good as a temperature check, but worth opening specific issues if they don't exist, so we can have a more focus discussion when needed. I'll take care of it. I may not open them all at once but as I intend to work on things. Thanks all for the great feedback. |
In mature projects like pandas, where a vast and diverse user base relies on established behaviors, the trade-off between modern design and legacy support is always challenging. While the drive for a cleaner and more consistent API is a worthy goal, these proposals must be balanced against the risks of breaking backward compatibility and the existing ecosystem’s familiarity with the current approach. It is important to note that even modest API changes can introduce subtle bugs or require significant code refactoring across projects, which underscores the need for careful evaluation. Previous discussions indicate that core developers are cautious about changes that offer only modest gains yet would impose a heavy migration burden on existing codebases. So on balance, @datapythonista, your strategy of consolidating these issues into a single, unified proposal appears wise. When examined individually, each change might seem insufficient to justify the associated churn; however, when combined, these incremental improvements can collectively yield a significantly more consistent and intuitive API, thereby justifying the transition. The downside is this could lead to decisions that favor aggregate gains while overlooking important details that would have emerged in a more focused debate. So opening sub issues is a great idea, but probably also worth keeping this one as the master/tracker? |
Thanks for the feedback. I think many cases considered here have no compatibility issues, like adding the missing I think the changes that do have compatibility implications have a very nice transition path. Worth or not is difficult to tell. But I think we've been saying for years don't use apply since it's slow, and now we may have the opportunity to make it fast in some cases, with non-numpy UDFs and JIT compilation. Very opinionated, but in my mind this could increase usage by one or two factors, and that requires a nice intuitive API. Assuming this was true (and I understand others will have much lower or different expectation on the potential of these functions), to me it's worth the clean up. But since I think people are happy with the general, let's have the discussions in an individual basis, and see how far we want to get with the changes. |
The APIs for the
apply
andmap
methods seem to not be ideal. Those APIs were created in the very early days of pandas, and both pandas and Python are very different, and we have much more experience, and different environment such as type checking and others.A good first example is the
na_action
parameter ofmap
. I assume it was designed thinking that different actions could be applied when dealing with missing values in an elementwise operation. In practice, more than 15 years later, none has been implemented. And the resulting API is in my opinion far from ideal:This also makes type checking unnecessarily complex. A better API would be using just a boolean
skip_na
orignore_na
:Another example is the inconsistency with
args
andkwargs
. Some functions have both, some have just kwargs, we've been recently adding few missing... Also, when existsargs
is a regular parameter, whilekwargs
is a**
parameter, which is by itself inconsistent, and also confusing, with the number of parameters having slightly increased. For example:I don't think even advanced pandas users would be able to easily tell what parameters will be passed to the function. A much clearer API would be:
I think in this call it's immediate for users to know what are
apply
arguments, and whenfunc
arguments.Another inconsistency is the
arg
/func
parameter inSeries.map
andDataFrame.map
. While the functions are conceptually the same, just applying the operator to either aSeries
or aDataFram
, the signature and the behavior slightly changes, asSeries
will accept a dictionary, andDataFrame
won't. Given that a dictionary can be converted to a function by just appending.get
to it, I think it'd be better to make function consistently accept Python callables or numpy ufuncs.Finally, the methods have their evolution, including the existance and deletion of
applymap
, but at this point is also probably a good idea to deprecate the legacy behavior ofSeries.apply
behaving asSeries.map
depending on the value ofby_row
, which is the default. This is a bit tricky for backward compatibility reasons, but I think it eventually needs to be done, as it makes the API very counter-intuitive.map
being always elementwise, andapply
being always axis-wise, will make users life much easier, and the usage much easier to learn and explain.We can also discuss about
result_type
andby_row
inDataFrame.apply
, which are very hard to understand.The text was updated successfully, but these errors were encountered: