Skip to content

BUG: Series(category).replace to maintain order of categories #51057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 1, 2023

Conversation

lukemanley
Copy link
Member

No whatsnew as this is a regression on main only.

@lukemanley lukemanley added Bug Categorical Categorical Data Type labels Jan 29, 2023
# GH51016
dtype = pd.CategoricalDtype([0, 1, 2], ordered=True)
ser = pd.Series([0, 1, 2], dtype=dtype)
result = ser.replace(0, 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test where a missing value is replaced?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any. Series(category).replace doesn't actually work for replacing missing values - see below. I checked 1.5 and it doesn't replace in 1.5 either.

In [13]: import pandas as pd

In [14]: import numpy as np

In [15]: dtype = pd.CategoricalDtype(["A", "B"])

In [16]: ser = pd.Series(["A", None], dtype=dtype)

In [17]: ser
Out[17]: 
0      A
1    NaN
dtype: category
Categories (2, object): ['A', 'B']

In [18]: ser.replace(np.nan, "C")           # <- this doesn't actually replace the missing value
Out[18]: 
0      A
1    NaN
dtype: category
Categories (2, object): ['A', 'B']

new_categories = Index(ser.drop_duplicates(keep="first"))

# GH51016: maintain order of existing categories
idxr = cat.categories.get_indexer_for(all_values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do all this if ordered=False?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.5x maintains the existing order of categories regardless of ordered so I'd lean towards keeping that behavior. The new ordering logic is barely measurable perf-wise relative to the whole replace operation.

@mroeschke mroeschke added this to the 2.0 milestone Feb 1, 2023
@mroeschke mroeschke merged commit 6478e70 into pandas-dev:main Feb 1, 2023
@mroeschke
Copy link
Member

Thanks @lukemanley

@lukemanley lukemanley deleted the categorical-replace-order branch February 23, 2023 01:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: replace is changing the order of ordered categories
2 participants