Skip to content

BUG/PERF: Series.replace with dtype="category" #49404

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 18, 2023

Conversation

lukemanley
Copy link
Member

Refactor of Categorical._replace to fix a few bugs with Series(..., dtype="category").replace and improve performance.

BUG 1: overlap between to_replace and value:

Series([1, 2, 3], dtype="category").replace({1:2, 2:3, 3:4})


# main:

0    4
1    4
2    4
dtype: category
Categories (1, int64): [4]


# PR:

0    2
1    3
2    4
dtype: category
Categories (3, int64): [2, 3, 4]

BUG 2: losing nullable dtypes of underlying categories:

Series(["a", "b"], dtype="string").astype("category").replace("b", "c")


# main:

0    a
1    c
dtype: category
Categories (2, object): ['a', 'c']


# PR:

0    a
1    c
dtype: category
Categories (2, string): [a, c]

Perf improvements:

import pandas as pd
import numpy as np

arr = np.repeat(np.arange(1000), 1000)
ser = pd.Series(arr, dtype="category")

%timeit ser.replace(np.arange(200), 5)

681 ms ± 9.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)   <- main
11 ms ± 690 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <- PR

"""

@lukemanley lukemanley added Bug Refactor Internal refactoring of code Performance Memory or execution speed performance Categorical Categorical Data Type labels Oct 31, 2022
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Dec 14, 2022
@mroeschke mroeschke removed the Stale label Dec 19, 2022
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM merge when ready @jbrockmendel

@jbrockmendel
Copy link
Member

hopefully ill get time to take a good look this week but dont wait for me if this is a blocker.

@mroeschke mroeschke merged commit a063af0 into pandas-dev:main Jan 18, 2023
@mroeschke
Copy link
Member

Thanks @lukemanley

@mroeschke mroeschke added this to the 2.0 milestone Jan 18, 2023
phofl added a commit that referenced this pull request Jan 18, 2023
@phofl
Copy link
Member

phofl commented Jan 18, 2023

This broke the ci, opened #50848 to revert if there is no quick-win here

phofl added a commit that referenced this pull request Jan 18, 2023
Revert "BUG/PERF: Series.replace with dtype="category" (#49404)"

This reverts commit a063af0.
pooja-subramaniam pushed a commit to pooja-subramaniam/pandas that referenced this pull request Jan 25, 2023
…0848)

Revert "BUG/PERF: Series.replace with dtype="category" (pandas-dev#49404)"

This reverts commit a063af0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Performance Memory or execution speed performance Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants