PERF: Faster CategoricalIndex from categorical #17513

TomAugspurger · 2017-09-13T16:55:24Z

Master:

In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)]; s = pd.Series(arr).astype('category')

In [3]: %timeit pd.CategoricalIndex(s)
69.4 ms ± 946 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

HEAD

In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)]; s = pd.Series(arr).astype('category')

In [3]: %timeit pd.CategoricalIndex(s)
9.06 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Is that a 7000x speedup, or did I mess up the math?

codecov · 2017-09-13T19:25:29Z

Codecov Report

Merging #17513 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #17513      +/-   ##
==========================================
- Coverage   91.22%    91.2%   -0.02%     
==========================================
  Files         163      163              
  Lines       49586    49588       +2     
==========================================
- Hits        45233    45227       -6     
- Misses       4353     4361       +8

Flag	Coverage Δ
#multiple	`88.99% <100%> (ø)`	⬆️
#single	`40.2% <50%> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/category.py	`98.55% <100%> (ø)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.77% <0%> (-0.1%)`	⬇️
pandas/core/indexes/datetimes.py	`95.53% <0%> (+0.09%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0097cb7...7cad4f8. Read the comment docs.

TomAugspurger · 2017-09-13T19:52:11Z

pandas/core/indexes/category.py

@@ -130,6 +130,9 @@ def _create_categorical(self, data, categories=None, ordered=None):
        -------
        Categorical
        """
+        if isinstance(data, ABCSeries) and is_categorical_dtype(data):


The isinstance(data, ABCSeries) is maybe a bit too restrictive. This rules out, e.g. CategoricalIndex. But at moment the CategoricalIndex constructor takes a different (faster) path, so it wouldn't benefit from this change anyway.

hmm, does it still pass if you add ABCCategoricalIndex here as well? it is more consistent

How about isinstance(data, (ABCSeries, type(self))). That'll catch Series(dtype='category'), and CategoricalIndex, but not Categorical.

how about (is_categorical_dtype and not ABCCategorical)?

maybe add a helper method

_ensure_categorical ?

which passes thru Categorical
does .values on CI and Series
and raises else

maybe add a helper method

I'd prefer to wait till we need that. I haven't come across any other cases where I need to treat CategoricalIndex and Series(dtype=category) differently than Categorical.

topper-123 · 2017-09-13T21:26:33Z

I am able to replicate this result (159 ms -> 9.88 µs).

jreback · 2017-09-14T22:54:13Z

lgtm. merge on green.

TomAugspurger · 2017-09-14T23:29:36Z

Unrelated test failure https://travis-ci.org/pandas-dev/pandas/jobs/275540546#L1608 (already an issue at #17510)

TomAugspurger added Categorical Categorical Data Type Performance Memory or execution speed performance labels Sep 13, 2017

TomAugspurger commented Sep 13, 2017

View reviewed changes

PERF: Faster CategoricalIndex from categorical

7cad4f8

TomAugspurger force-pushed the ci-ctor-from-categorical branch from d77b095 to 7cad4f8 Compare September 14, 2017 16:26

jorisvandenbossche approved these changes Sep 14, 2017

View reviewed changes

jreback added this to the 0.21.0 milestone Sep 14, 2017

TomAugspurger merged commit 94266d4 into pandas-dev:master Sep 14, 2017

TomAugspurger deleted the ci-ctor-from-categorical branch September 14, 2017 23:29

alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017

PERF: Faster CategoricalIndex from categorical (pandas-dev#17513)

28d4744

No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017

PERF: Faster CategoricalIndex from categorical (pandas-dev#17513)

c5c929a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Faster CategoricalIndex from categorical #17513

PERF: Faster CategoricalIndex from categorical #17513

TomAugspurger commented Sep 13, 2017

codecov bot commented Sep 13, 2017 •

edited

Loading

TomAugspurger Sep 13, 2017

jreback Sep 14, 2017

TomAugspurger Sep 14, 2017

jreback Sep 14, 2017

TomAugspurger Sep 14, 2017

topper-123 commented Sep 13, 2017

jreback commented Sep 14, 2017

TomAugspurger commented Sep 14, 2017

PERF: Faster CategoricalIndex from categorical #17513

PERF: Faster CategoricalIndex from categorical #17513

Conversation

TomAugspurger commented Sep 13, 2017

codecov bot commented Sep 13, 2017 • edited Loading

Codecov Report

TomAugspurger Sep 13, 2017

Choose a reason for hiding this comment

jreback Sep 14, 2017

Choose a reason for hiding this comment

TomAugspurger Sep 14, 2017

Choose a reason for hiding this comment

jreback Sep 14, 2017

Choose a reason for hiding this comment

TomAugspurger Sep 14, 2017

Choose a reason for hiding this comment

topper-123 commented Sep 13, 2017

jreback commented Sep 14, 2017

TomAugspurger commented Sep 14, 2017

codecov bot commented Sep 13, 2017 •

edited

Loading