PERF: Avoid materializing values in Categorical.set_categories #17515

TomAugspurger · 2017-09-13T18:42:48Z

Mater:

In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)];
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

HEAD:

In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)]
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Closes #17508

I'll rebase #16015 on top of this. Running an ASV now.

TomAugspurger · 2017-09-13T18:57:43Z

Here's the ASV:

[100.00%] ··· Running categoricals.Categoricals2.time_set_categories                                                                                80.8±1ms       before           after         ratio
     [f11bbf2f]       [6d72836e]
-        80.8±1ms       15.4±0.3ms     0.19  categoricals.Categoricals2.time_set_categories

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

pep8speaks · 2017-09-13T19:46:32Z

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on September 14, 2017 at 10:55 Hours UTC

codecov · 2017-09-13T22:07:20Z

Codecov Report

Merging #17515 into master will decrease coverage by 0.04%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #17515      +/-   ##
==========================================
- Coverage   91.24%    91.2%   -0.05%     
==========================================
  Files         163      163              
  Lines       49582    49586       +4     
==========================================
- Hits        45242    45225      -17     
- Misses       4340     4361      +21

Flag	Coverage Δ
#multiple	`88.99% <100%> (-0.03%)`	⬇️
#single	`40.2% <18.18%> (-0.06%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/dtypes/concat.py	`98.26% <100%> (-0.03%)`	⬇️
pandas/core/categorical.py	`95.57% <100%> (+0.04%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/plotting/_converter.py	`63.23% <0%> (-1.82%)`	⬇️
pandas/core/frame.py	`97.77% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 97abd2c...d238e3e. Read the comment docs.

jreback · 2017-09-13T23:24:41Z

pandas/core/categorical.py

+
+    Examples
+    --------
+    >>> old_cat = pd.Index(['b', 'a', 'c'])


isn't this just a special case of union_categoricals?

these both remap codes; seems like they've should share

I see. There was a little bit I could extract from union_categorical. Simplified things a bit too by using take_1d. See my latest commit.

looks nice.

TomAugspurger · 2017-09-14T10:39:16Z

I don't think so. `union_categoricals` takes the union of the two categories. This replaces the old categories with the new.

…

On Wed, Sep 13, 2017 at 6:24 PM, Jeff Reback ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In pandas/core/categorical.py <#17515 (comment)>: > +def _recode_for_categories(codes, old_categories, new_categories): + """ + Convert a set of codes for to a new set of categories + + Parameters + ---------- + codes : array + old_categories, new_categories : Index + + Returns + ------- + new_codes : array + + Examples + -------- + >>> old_cat = pd.Index(['b', 'a', 'c']) isn't this just a special case of union_categoricals? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17515 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIr3MDXGGPEIv_2U2nqZ4qxzcvvvRks5siGQ9gaJpZM4PWjZF> .

Mater: ```python In [1]: import pandas as pd; import numpy as np In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)]; s = pd.Series(arr).astype('category') In [3]: %timeit s.cat.set_categories(s.cat.categories) 68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` HEAD: ```python In [1]: import pandas as pd; import numpy as np In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)] s = pd.Series(arr).astype('category') In [3]: %timeit s.cat.set_categories(s.cat.categories) 7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` Closes pandas-dev#17508

jreback · 2017-09-14T16:09:34Z

thanks @TomAugspurger

Mater: ```python In [1]: import pandas as pd; import numpy as np In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)]; s = pd.Series(arr).astype('category') In [3]: %timeit s.cat.set_categories(s.cat.categories) 68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` HEAD: ```python In [1]: import pandas as pd; import numpy as np In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)] s = pd.Series(arr).astype('category') In [3]: %timeit s.cat.set_categories(s.cat.categories) 7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` Closes pandas-dev#17508

TomAugspurger added Categorical Categorical Data Type Performance Memory or execution speed performance labels Sep 13, 2017

TomAugspurger force-pushed the set_categories-perf branch from d848c95 to 6d72836 Compare September 13, 2017 18:47

TomAugspurger force-pushed the set_categories-perf branch 2 times, most recently from 17dd9f9 to f74da7e Compare September 13, 2017 19:46

TomAugspurger force-pushed the set_categories-perf branch from f74da7e to 9b311f4 Compare September 13, 2017 19:47

jreback added this to the 0.21.0 milestone Sep 13, 2017

jreback reviewed Sep 13, 2017

View reviewed changes

TomAugspurger force-pushed the set_categories-perf branch from 9b311f4 to d238e3e Compare September 14, 2017 10:55

jreback merged commit 0097cb7 into pandas-dev:master Sep 14, 2017

TomAugspurger deleted the set_categories-perf branch September 14, 2017 23:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: Avoid materializing values in Categorical.set_categories #17515

PERF: Avoid materializing values in Categorical.set_categories #17515

Uh oh!

TomAugspurger commented Sep 13, 2017 •

edited

Loading

Uh oh!

TomAugspurger commented Sep 13, 2017

Uh oh!

pep8speaks commented Sep 13, 2017 •

edited

Loading

Uh oh!

codecov bot commented Sep 13, 2017 •

edited

Loading

Uh oh!

jreback Sep 13, 2017

Uh oh!

jreback Sep 14, 2017

Uh oh!

TomAugspurger Sep 14, 2017

Uh oh!

jreback Sep 14, 2017

Uh oh!

TomAugspurger commented Sep 14, 2017 via email

Uh oh!

jreback commented Sep 14, 2017

Uh oh!

Uh oh!

Uh oh!

PERF: Avoid materializing values in Categorical.set_categories #17515

PERF: Avoid materializing values in Categorical.set_categories #17515

Uh oh!

Conversation

TomAugspurger commented Sep 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Sep 13, 2017

Uh oh!

pep8speaks commented Sep 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated on September 14, 2017 at 10:55 Hours UTC

Uh oh!

codecov bot commented Sep 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jreback Sep 13, 2017

Choose a reason for hiding this comment

Uh oh!

jreback Sep 14, 2017

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Sep 14, 2017

Choose a reason for hiding this comment

Uh oh!

jreback Sep 14, 2017

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Sep 14, 2017 via email

Uh oh!

jreback commented Sep 14, 2017

Uh oh!

Uh oh!

TomAugspurger commented Sep 13, 2017 •

edited

Loading

pep8speaks commented Sep 13, 2017 •

edited

Loading

codecov bot commented Sep 13, 2017 •

edited

Loading