Skip to content

Improve the apply/map APIs #61128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
datapythonista opened this issue Mar 15, 2025 · 6 comments
Open

Improve the apply/map APIs #61128

datapythonista opened this issue Mar 15, 2025 · 6 comments
Assignees
Labels
API - Consistency Internal Consistency of API/Behavior API Design Apply Apply, Aggregate, Transform, Map Needs Discussion Requires discussion from core team before further action

Comments

@datapythonista
Copy link
Member

The APIs for the apply and map methods seem to not be ideal. Those APIs were created in the very early days of pandas, and both pandas and Python are very different, and we have much more experience, and different environment such as type checking and others.

A good first example is the na_action parameter of map. I assume it was designed thinking that different actions could be applied when dealing with missing values in an elementwise operation. In practice, more than 15 years later, none has been implemented. And the resulting API is in my opinion far from ideal:

df.map(func, na_action=False)
df.map(func, na_action="ignore")

This also makes type checking unnecessarily complex. A better API would be using just a boolean skip_na or ignore_na:

df.map(func, skip_na=False)
df.map(func, skip_na=True)
df.map(func, skip_na=action == "ignore")

Another example is the inconsistency with args and kwargs. Some functions have both, some have just kwargs, we've been recently adding few missing... Also, when exists args is a regular parameter, while kwargs is a ** parameter, which is by itself inconsistent, and also confusing, with the number of parameters having slightly increased. For example:

df.apply(func, 0, result_type=None, result_format="reduction", engine=numba.njit, engine_params={"val": 0})

I don't think even advanced pandas users would be able to easily tell what parameters will be passed to the function. A much clearer API would be:

df.apply(func, args=("reduction",), kwargs={"engine_params": {"val": 0}}, axis=0, result_type=None, engine=numba.njit)

I think in this call it's immediate for users to know what are apply arguments, and when func arguments.

Another inconsistency is the arg / func parameter in Series.map and DataFrame.map. While the functions are conceptually the same, just applying the operator to either a Series or a DataFram, the signature and the behavior slightly changes, as Series will accept a dictionary, and DataFrame won't. Given that a dictionary can be converted to a function by just appending .get to it, I think it'd be better to make function consistently accept Python callables or numpy ufuncs.

Finally, the methods have their evolution, including the existance and deletion of applymap, but at this point is also probably a good idea to deprecate the legacy behavior of Series.apply behaving as Series.map depending on the value of by_row, which is the default. This is a bit tricky for backward compatibility reasons, but I think it eventually needs to be done, as it makes the API very counter-intuitive. map being always elementwise, and apply being always axis-wise, will make users life much easier, and the usage much easier to learn and explain.

We can also discuss about result_type and by_row in DataFrame.apply, which are very hard to understand.

@datapythonista datapythonista added the Apply Apply, Aggregate, Transform, Map label Mar 15, 2025
@HoqueUM
Copy link

HoqueUM commented Apr 1, 2025

take

@rhshadrach
Copy link
Member

rhshadrach commented Apr 14, 2025

A better API would be using just a boolean skip_na or ignore_na

One reason to prefer strings is the ability to expand on the options in the future, but I can't see a use for that here. +1 on skipna (and not skip_na for API consistency).

Another example is the inconsistency with args and kwargs.

This was also proposed in #40112. Chatting privately with some core developers, there was some opposition that the change from **kwargs to kwargs would be too noisy for little gain. I am not personally of that opinion, but was the reason I stopped pushing for it.

but at this point is also probably a good idea to deprecate the legacy behavior of Series.apply behaving as Series.map depending on the value of by_row, which is the default. This is a bit tricky for backward compatibility reasons, but I think it eventually needs to be done, as it makes the API very counter-intuitive. map being always elementwise, and apply being always axis-wise, will make users life much easier, and the usage much easier to learn and explain.

This was indeed the road map and the reason why by_row was ever added in the first place - to make the transition smoother. However that plan came to a halt when there were strong objections of the code churn that making uses move from Series.apply to Series.map would cause. The inconsistencies between Series.apply and DataFrame.apply cannot be resolved without this.

cc @Dr-Irv

@rhshadrach rhshadrach added API Design Needs Discussion Requires discussion from core team before further action API - Consistency Internal Consistency of API/Behavior labels Apr 14, 2025
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Apr 14, 2025

IMHO, this issue has a mix of proposals for cleaning up the pandas API:

  1. Creating consistency on how NA handling is done, e.g., na_action, skip_na, etc.
  2. Creating consistency in use of args and kwargs (which I think is what API: Signature of UDF methods #40112 is about)
  3. Inconsistency between Series.map and DataFrame.map in the arguments that are accepted
  4. Deprecating the legacy behavior of Series.apply behaving as Series.map depending on the value of by_row

I'm a big fan of having clean API's.

But I have to ask whether we should separate this issue into the 4 issues above (with (2) already being covered), so we can then make decisions on each?

@datapythonista
Copy link
Member Author

Agree with @Dr-Irv. I think this issue has been good as a temperature check, but worth opening specific issues if they don't exist, so we can have a more focus discussion when needed. I'll take care of it. I may not open them all at once but as I intend to work on things. Thanks all for the great feedback.

@simonjayhawkins
Copy link
Member

Agree with @Dr-Irv. I think this issue has been good as a temperature check, but worth opening specific issues if they don't exist, so we can have a more focus discussion when needed.

In mature projects like pandas, where a vast and diverse user base relies on established behaviors, the trade-off between modern design and legacy support is always challenging.

While the drive for a cleaner and more consistent API is a worthy goal, these proposals must be balanced against the risks of breaking backward compatibility and the existing ecosystem’s familiarity with the current approach. It is important to note that even modest API changes can introduce subtle bugs or require significant code refactoring across projects, which underscores the need for careful evaluation.

Previous discussions indicate that core developers are cautious about changes that offer only modest gains yet would impose a heavy migration burden on existing codebases.

So on balance, @datapythonista, your strategy of consolidating these issues into a single, unified proposal appears wise. When examined individually, each change might seem insufficient to justify the associated churn; however, when combined, these incremental improvements can collectively yield a significantly more consistent and intuitive API, thereby justifying the transition.

The downside is this could lead to decisions that favor aggregate gains while overlooking important details that would have emerged in a more focused debate. So opening sub issues is a great idea, but probably also worth keeping this one as the master/tracker?

@datapythonista
Copy link
Member Author

Thanks for the feedback. I think many cases considered here have no compatibility issues, like adding the missing args to methods which don't have them. This has actually been done recently in a method when a user needed it.

I think the changes that do have compatibility implications have a very nice transition path. Worth or not is difficult to tell. But I think we've been saying for years don't use apply since it's slow, and now we may have the opportunity to make it fast in some cases, with non-numpy UDFs and JIT compilation. Very opinionated, but in my mind this could increase usage by one or two factors, and that requires a nice intuitive API. Assuming this was true (and I understand others will have much lower or different expectation on the potential of these functions), to me it's worth the clean up.

But since I think people are happy with the general, let's have the discussions in an individual basis, and see how far we want to get with the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior API Design Apply Apply, Aggregate, Transform, Map Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

5 participants