Skip to content

PERF: improve perf. of Categorical.searchsorted #28795

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 6, 2019

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Oct 4, 2019

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Improves performance of Categorical.searchsorted by avoiding expensive data convertions.

>>> n = 100_000
>>> c = pd.Categorical(['a'] * n + ['b'] * n + ['c'] * n)
>>> %timeit c.searchsorted('b')
259 µs ± 2.95 µs per loop  # master
5.5 µs ± 165 ns per loop  # this PR
>>> %timeit c.searchsorted(['b', 'c'])
240 µs ± 4.24 µs per loop  # master
9.9 µs ± 166 ns per loop  # this PR

Also, CategoricalIndex.searchsorted now calls self.values.searchsorted directly instead of going through algorithms.searchsorted, which always ends up calling self.values.searchsorted anyway. This ends up getting performance to 5.5 µs instead of 12 µs.

@topper-123 topper-123 added Performance Memory or execution speed performance Categorical Categorical Data Type labels Oct 4, 2019
@@ -162,6 +162,7 @@ Performance improvements
- Performance improvement in :meth:`DataFrame.corr` when ``method`` is ``"spearman"`` (:issue:`28139`)
- Performance improvement in :meth:`DataFrame.replace` when provided a list of values to replace (:issue:`28099`)
- Performance improvement in :meth:`DataFrame.select_dtypes` by using vectorization instead of iterating over a loop (:issue:`28317`)
- Performance improvement in :meth:`Categorical.searchsorted` and :meth:`CategoricalIndex.searchsorted` when searching for a single scalar value (:issue:`XXXXX`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reference the PR as the issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fixed.

@topper-123 topper-123 force-pushed the Categorical.searchsorted_II branch from 0f46d60 to 27bd6f7 Compare October 5, 2019 08:58
@jreback jreback added this to the 1.0 milestone Oct 5, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, small comment, ping on green.


codes = codes[0] if is_scalar(value) else codes

if is_scalar(value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, i would add a comment here that this is perf sensitive

@topper-123
Copy link
Contributor Author

Comments addressed.

@topper-123 topper-123 changed the title PERF: improve perf. of Categorical.searchesorted PERF: improve perf. of Categorical.searchsorted Oct 6, 2019
@jreback jreback merged commit 66918d0 into pandas-dev:master Oct 6, 2019
@jreback
Copy link
Contributor

jreback commented Oct 6, 2019

thanks @topper-123

@topper-123 topper-123 deleted the Categorical.searchsorted_II branch October 6, 2019 22:33
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
bongolegend pushed a commit to bongolegend/pandas that referenced this pull request Jan 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants