Skip to content

ENH: Allow s.map(d, na_action='raise') #60482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

kopytjuk
Copy link
Contributor

@kopytjuk kopytjuk commented Dec 3, 2024

Addresses #14210

Often users use s.map() blindly (without checking the original series beforehand due to their expectations) and are surprised by downstream errors which are caused by the implicit replacement of values to np.nan if they are not part of the mapping dictionary:

import pandas as pd

s = pd.Series(["cat", "dog", "mouse"])
d = {"cat": "kitten", "dog": "puppy"}

res = s.map(d)

print(res)
# 0    kitten
# 1     puppy
# 2       NaN
# dtype: object

This example here also shows how a typo in the input data can cause unwanted trouble.

This PR adds a possibility to raise an error via s.map(d, na_option='raise'), if the mapping does not cover all values in the array.

Copy link
Contributor

github-actions bot commented Jan 3, 2025

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jan 3, 2025
@kopytjuk kopytjuk force-pushed the add-raise-to-series-map branch from 756b49f to 5efce7b Compare March 7, 2025 08:05
@kopytjuk kopytjuk changed the title ENH: Implement map_array(na_action='raise') ENH: Implement s.map(d, na_action='raise') Mar 7, 2025
@kopytjuk kopytjuk changed the title ENH: Implement s.map(d, na_action='raise') ENH: Allow s.map(d, na_action='raise') Mar 7, 2025
@kopytjuk
Copy link
Contributor Author

kopytjuk commented Mar 7, 2025

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Still interested!

@kopytjuk
Copy link
Contributor Author

kopytjuk commented Mar 8, 2025

Sorry for pinging @mroeschke - can you please spend a quick glimpse on that, if the design and the goal is worth following ... thank you in advance!

@datapythonista datapythonista added Apply Apply, Aggregate, Transform, Map and removed Stale labels Apr 20, 2025
@datapythonista
Copy link
Member

Sorry @kopytjuk nobody checked this earlier. While I'm ok to find a solution to the use case being addressed here (raise when the dict in map doesn't contain all the values in the series/dataframe), I'm -1 in the approach here.

The main reason is that I think we should get rid of the na_action parameter. Which I find cumbersome, and reading the issue descriptions confirms that not even very experienced pandas core devs understand what the parameter does without proper research.

Second, na_action refers to the action to so when missing values are found in the data. So, na_action="ignore" means that missing values should be ignored and not passed to the function. If things are logical and consistent, to me na_action="raise" would mean that if missing values are found in the original data, we would raise. Which is different to what's being implemented here. Adding even more confusion to an already confusing API.

I'm personally not a big fan of adding a new parameter to .map. In particular because .map is already two functions in one, mapping with a dictionary, and applying a function elementwise, and the new parameter would be tricky when considered for both cases. But still this would be a better option than using na_action.

I think the best solution here is that you have a validation after the map to ensure not missing values are there. I don't think there is a pure pandas one-liner to do it, and maybe adding it is not a bad idea. But I think there are third party libraries to ensure data consistency in pandas, maybe that's the way to go.

What do you think?

@kopytjuk
Copy link
Contributor Author

kopytjuk commented Apr 20, 2025

Thanks for your reply @datapythonista!

If things are logical and consistent, to me na_action="raise" would mean that if missing values are found in the original data, we would raise. Which is different to what's being implemented here.

I agree on that, the "na_action" is not a fitting argument name for what I implemented, since the argument relates to the original (series) data and not the keys of the mapping.

I think the best solution here is that you have a validation after the map to ensure not missing values are there. I don't think there is a pure pandas one-liner to do it, and maybe adding it is not a bad idea. But I think there are third party libraries to ensure data consistency in pandas, maybe that's the way to go.

I agree on that as well, libraries like pandera are often used in data engineering use-cases to ensure that the resulting dataframe (after a chain of operations) follows a certain schema (column names, datatypes and existence of NaNs).

In addition to your points you made: a novice user won't activate the proposed na_action="raise" argument anyway, which he does not understand (as you stated, the understanding of the mechanics na_action is not straighforward) ...

So in summary, after weeks of distance to my proposal, I agree - I am also "-1" on the na_action="raise" approach.

Do you think, raising a warning would be a good alternative approach? Especially in explorative contexts (in which a developer looks at the data in a notebook) such a warning would be helpful, to make sure that the user does not miss a value in the mapping. I think warnings help to develop a more stable code, because people start looking into their data more deeply and take a look at their code, especially if parts of it are LLM generated.

What do you guys think about the "warning-only-without-any-additional-arguments" approach in case the mapping does not cover the original data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants