-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Avoid materializing values in Categorical.set_categories #17515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d848c95
to
6d72836
Compare
Here's the ASV:
|
17dd9f9
to
f74da7e
Compare
Hello @TomAugspurger! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on September 14, 2017 at 10:55 Hours UTC |
f74da7e
to
9b311f4
Compare
Codecov Report
@@ Coverage Diff @@
## master #17515 +/- ##
==========================================
- Coverage 91.24% 91.2% -0.05%
==========================================
Files 163 163
Lines 49582 49586 +4
==========================================
- Hits 45242 45225 -17
- Misses 4340 4361 +21
Continue to review full report at Codecov.
|
|
||
Examples | ||
-------- | ||
>>> old_cat = pd.Index(['b', 'a', 'c']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this just a special case of union_categoricals
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these both remap codes; seems like they've should share
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. There was a little bit I could extract from union_categorical. Simplified things a bit too by using take_1d. See my latest commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks nice.
I don't think so. `union_categoricals` takes the union of the two
categories. This replaces the old categories with the new.
…On Wed, Sep 13, 2017 at 6:24 PM, Jeff Reback ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pandas/core/categorical.py
<#17515 (comment)>:
> +def _recode_for_categories(codes, old_categories, new_categories):
+ """
+ Convert a set of codes for to a new set of categories
+
+ Parameters
+ ----------
+ codes : array
+ old_categories, new_categories : Index
+
+ Returns
+ -------
+ new_codes : array
+
+ Examples
+ --------
+ >>> old_cat = pd.Index(['b', 'a', 'c'])
isn't this just a special case of union_categoricals?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17515 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIr3MDXGGPEIv_2U2nqZ4qxzcvvvRks5siGQ9gaJpZM4PWjZF>
.
|
Mater: ```python In [1]: import pandas as pd; import numpy as np In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)]; s = pd.Series(arr).astype('category') In [3]: %timeit s.cat.set_categories(s.cat.categories) 68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` HEAD: ```python In [1]: import pandas as pd; import numpy as np In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)] s = pd.Series(arr).astype('category') In [3]: %timeit s.cat.set_categories(s.cat.categories) 7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` Closes pandas-dev#17508
9b311f4
to
d238e3e
Compare
thanks @TomAugspurger |
Mater: ```python In [1]: import pandas as pd; import numpy as np In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)]; s = pd.Series(arr).astype('category') In [3]: %timeit s.cat.set_categories(s.cat.categories) 68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` HEAD: ```python In [1]: import pandas as pd; import numpy as np In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)] s = pd.Series(arr).astype('category') In [3]: %timeit s.cat.set_categories(s.cat.categories) 7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` Closes pandas-dev#17508
Mater: ```python In [1]: import pandas as pd; import numpy as np In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)]; s = pd.Series(arr).astype('category') In [3]: %timeit s.cat.set_categories(s.cat.categories) 68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` HEAD: ```python In [1]: import pandas as pd; import numpy as np In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)] s = pd.Series(arr).astype('category') In [3]: %timeit s.cat.set_categories(s.cat.categories) 7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` Closes pandas-dev#17508
Mater:
HEAD:
Closes #17508
I'll rebase #16015 on top of this. Running an ASV now.