Skip to content

PERF: Avoid materializing values in Categorical.set_categories #17515

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 14, 2017

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Sep 13, 2017

Mater:

In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)];
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

HEAD:

In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)]
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Closes #17508

I'll rebase #16015 on top of this. Running an ASV now.

@TomAugspurger TomAugspurger added Categorical Categorical Data Type Performance Memory or execution speed performance labels Sep 13, 2017
@TomAugspurger
Copy link
Contributor Author

Here's the ASV:

[100.00%] ··· Running categoricals.Categoricals2.time_set_categories                                                                                80.8±1ms       before           after         ratio
     [f11bbf2f]       [6d72836e]
-        80.8±1ms       15.4±0.3ms     0.19  categoricals.Categoricals2.time_set_categories

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@TomAugspurger TomAugspurger force-pushed the set_categories-perf branch 2 times, most recently from 17dd9f9 to f74da7e Compare September 13, 2017 19:46
@pep8speaks
Copy link

pep8speaks commented Sep 13, 2017

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on September 14, 2017 at 10:55 Hours UTC

@codecov
Copy link

codecov bot commented Sep 13, 2017

Codecov Report

Merging #17515 into master will decrease coverage by 0.04%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17515      +/-   ##
==========================================
- Coverage   91.24%    91.2%   -0.05%     
==========================================
  Files         163      163              
  Lines       49582    49586       +4     
==========================================
- Hits        45242    45225      -17     
- Misses       4340     4361      +21
Flag Coverage Δ
#multiple 88.99% <100%> (-0.03%) ⬇️
#single 40.2% <18.18%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/concat.py 98.26% <100%> (-0.03%) ⬇️
pandas/core/categorical.py 95.57% <100%> (+0.04%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/plotting/_converter.py 63.23% <0%> (-1.82%) ⬇️
pandas/core/frame.py 97.77% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 97abd2c...d238e3e. Read the comment docs.

@jreback jreback added this to the 0.21.0 milestone Sep 13, 2017

Examples
--------
>>> old_cat = pd.Index(['b', 'a', 'c'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this just a special case of union_categoricals?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these both remap codes; seems like they've should share

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. There was a little bit I could extract from union_categorical. Simplified things a bit too by using take_1d. See my latest commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks nice.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 14, 2017 via email

Mater:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)];
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

HEAD:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)]
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Closes pandas-dev#17508
@jreback jreback merged commit 0097cb7 into pandas-dev:master Sep 14, 2017
@jreback
Copy link
Contributor

jreback commented Sep 14, 2017

thanks @TomAugspurger

@TomAugspurger TomAugspurger deleted the set_categories-perf branch September 14, 2017 23:30
alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
Mater:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)];
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

HEAD:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)]
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Closes pandas-dev#17508
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
Mater:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)];
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

HEAD:

```python
In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10,
                                                      size=500000)]
s = pd.Series(arr).astype('category')

In [3]: %timeit s.cat.set_categories(s.cat.categories)
7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Closes pandas-dev#17508
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants