Skip to content

REF: avoid object-casting in _get_codes_for_values #45117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 30, 2021

Conversation

jbrockmendel
Copy link
Member

Index.get_indexer_for ends up going through the same hashtable-based lookup anyway. We add a little constructor overhead, but avoid object-dtype casting in cases where the common-dtype is e.g. numeric.

@jreback
Copy link
Contributor

jreback commented Dec 29, 2021

wow, how's performance?

@jreback jreback added Categorical Categorical Data Type Refactor Internal refactoring of code labels Dec 30, 2021
@jbrockmendel
Copy link
Member Author

how's performance?

Looks like a small improvement; i expect we could come up with cases that would go the other way:

import numpy as np
import pandas as pd

cats = np.arange(100)
values = np.arange(200).repeat(1000)

%timeit pd.Categorical(values, categories=cats)
911 µs ± 99.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <- master
880 µs ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <- PR

%timeit pd.Categorical(values, categories=["a", "b", "c"])
9.95 ms ± 66.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- master
8.8 ms ± 239 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)    # <- PR

objvals = np.repeat(np.array(["a", "b", "c", "d"]), 1000)

%timeit pd.Categorical(objvals, categories=["a", "b", "c"])
478 µs ± 9.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <- master
443 µs ± 4.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <- PR

@jbrockmendel
Copy link
Member Author

CI fail unrelated

@jreback jreback added this to the 1.4 milestone Dec 30, 2021
@jreback jreback merged commit d2a9a22 into pandas-dev:master Dec 30, 2021
@jbrockmendel jbrockmendel deleted the ref-ensure_data branch December 30, 2021 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants