Skip to content

DOC: DataFrame.replace() regex contract inconsistent with usage #55570

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
schivmeister opened this issue Oct 18, 2023 · 2 comments · Fixed by #55762
Closed
1 task done

DOC: DataFrame.replace() regex contract inconsistent with usage #55570

schivmeister opened this issue Oct 18, 2023 · 2 comments · Fixed by #55762
Labels
Docs Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@schivmeister
Copy link

schivmeister commented Oct 18, 2023

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.replace.html

Documentation problem

From https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.replace.html I quote:

regex : bool or same types as to_replace, default False
Whether to interpret to_replace and/or value as regular expressions. If this is True then to_replace must be a string. Alternatively, this could be a regular expression or a list, dict, or array of regular expressions in which case to_replace must be None.

I want to bring your attention to the first condition:

If this is True then to_replace must be a string.

However, in all the Pandas versions I've tried, this doesn't appear to be the case. The to_replace argument need not be a string when supplying regex=True, and providing a nested dict in to_replace appears to work as expected.

MWE:

import pandas as pd

# Create a sample DataFrame
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': ['apple', 'banana', 'cherry', 'date', 'elderberry']
}
df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Test regex replacements using a nested dictionary
nested_dict = {
    'C': {
        r'^[a-d]': 'fruit_',  # Replace words starting with 'a' to 'd'
        r'[^aeiou]+$': 'berry',  # Replace words not ending in a vowel with 'berry'
    }
}

# Apply regex replacements using the nested dictionary
df.replace(to_replace=nested_dict, regex=True, inplace=True)

# Display the modified DataFrame with regex replacements
print("\nModified DataFrame:")
print(df)

Suggested fix for documentation

Update documentation or API behaviour.

@schivmeister schivmeister added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 18, 2023
@jpgianfaldoni
Copy link
Contributor

jpgianfaldoni commented Oct 25, 2023

@schivmeister I found this test that passes regex = True and a nested dict just as you pointed in your issue, so the problem seems to be on the documentation.

image

Im not 100% sure yet, but maybe the documentation should be:

Whether to interpret to_replace and/or value as regular expressions. If this is True then to_replace keys must be a string.

Any thoughts on this theory?

@schivmeister
Copy link
Author

@jpgianfaldoni thanks for digging that out. I think you are on the right track about the wording.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants