Skip to content

BUG: RecursionError with CategoricalIndex.get_indexer #42089

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 17, 2021

Conversation

jbrockmendel
Copy link
Member

@jreback jreback added the Categorical Categorical Data Type label Jun 17, 2021
@jreback jreback added this to the 1.3 milestone Jun 17, 2021
@jreback
Copy link
Contributor

jreback commented Jun 17, 2021

cool. merge & backport on green.

@jreback jreback merged commit 1a3daf4 into pandas-dev:master Jun 17, 2021
@jreback
Copy link
Contributor

jreback commented Jun 17, 2021

@meeseeksdev backport 1.3.x

@lumberbot-app
Copy link

lumberbot-app bot commented Jun 17, 2021

Something went wrong ... Please have a look at my logs.

@jbrockmendel jbrockmendel deleted the bug-cat-get_idnexer branch June 17, 2021 21:58
jreback pushed a commit that referenced this pull request Jun 17, 2021
@jorisvandenbossche
Copy link
Member

This fix seems to have caused a large slowdown on the get_indexer benchmarks: https://pandas.pydata.org/speed/pandas/#indexing.CategoricalIndexIndexing.time_get_indexer_list?python=3.8&Cython=0.29.21&p-index='monotonic_incr'&commits=cf5852bf-fce7f9eb

The regression overview (https://pandas.pydata.org/speed/pandas/#regressions?sort=1&dir=desc) lists it as a 1000x slowdown, but that's only because #42042 first improved the performance a lot (which might be a bit suspicious?). Compared to the timing before that, it's only 4-5x slowdown. With the below code, I see locally a ~9x slowdown on master compared to 1.2.5.

import string, itertools
data_unique = pd.CategoricalIndex(
            ["".join(perm) for perm in itertools.permutations(string.printable, 3)]
)
cat_list = ["a", "c"]

%timeit data_unique.get_indexer(cat_list)
52.8 ms ± 5.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <-- pandas 1.2.5
417 ms ± 22.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <-- master

I think it has to do with the fact that before we called the Engine.get_indexer on the codes, while now in the base class version we do that with the .categories, which means in this case that both self and target are cast to object dtype and thus use the Engine.get_indexer for object dtype.

@jorisvandenbossche
Copy link
Member

I opened #42249 to keep track of this.

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: categorical failures
3 participants