Skip to content

PERF/REGR: isin slowdown for masked type #42714

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 26, 2021

Conversation

mzeitlin11
Copy link
Member

@mzeitlin11 mzeitlin11 commented Jul 25, 2021

xref #42708

  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

Fixes isin slowdowns mentioned in https://simonjayhawkins.github.io/fantastic-dollop/#regressions?sort=3&dir=desc&branch=1.3.x

Benchmarks:

-         550±7ms         495±10ms     0.90  algos.isin.IsInLongSeriesValuesDominate.time_isin('Float64', 'random')
-         514±8ms          449±6ms     0.87  algos.isin.IsInLongSeriesValuesDominate.time_isin('Float64', 'monotone')
-     2.22±0.07ms      1.68±0.05ms     0.76  algos.isin.IsInFloat64.time_isin('Float64', 'many_different_values')
-         208±9ms          147±8ms     0.71  algos.isin.IsInLongSeriesValuesDominate.time_isin('Int64', 'monotone')
-        395±20μs          279±4μs     0.71  algos.isin.IsIn.time_isin_categorical('boolean')
-        304±10μs          200±2μs     0.66  algos.isin.IsIn.time_isin_empty('boolean')
-        341±10μs         218±10μs     0.64  algos.isin.IsIn.time_isin('Int64')
-        339±10μs          212±3μs     0.63  algos.isin.IsIn.time_isin_categorical('Int64')
-        258±10μs          147±3μs     0.57  algos.isin.IsIn.time_isin_empty('Int64')
-         221±5μs          104±3μs     0.47  algos.isin.IsIn.time_isin('boolean')
-         155±5μs         63.4±1μs     0.41  algos.isin.IsIn.time_isin_mismatched_dtype('boolean')
-         166±7μs         63.4±1μs     0.38  algos.isin.IsIn.time_isin_mismatched_dtype('Int64')
-         499±6ms          176±2ms     0.35  algos.isin.IsInLongSeriesLookUpDominates.time_isin('Float64', 1000, 'random_misses')
-         500±8ms          173±2ms     0.35  algos.isin.IsInLongSeriesLookUpDominates.time_isin('Float64', 1000, 'monotone_misses')
-         125±1ms       33.2±0.8ms     0.27  algos.isin.IsInLongSeriesLookUpDominates.time_isin('Int64', 5, 'random_hits')
-         437±6ms          116±1ms     0.26  algos.isin.IsInLongSeriesLookUpDominates.time_isin('Float64', 1000, 'random_hits')
-         125±1ms       32.9±0.8ms     0.26  algos.isin.IsInLongSeriesLookUpDominates.time_isin('Int64', 5, 'random_misses')
-       131±0.9ms         33.3±2ms     0.25  algos.isin.IsInLongSeriesLookUpDominates.time_isin('Int64', 5, 'monotone_hits')
-        443±10ms        109±0.6ms     0.25  algos.isin.IsInLongSeriesLookUpDominates.time_isin('Int64', 1000, 'monotone_misses')
-        417±10ms       81.0±0.6ms     0.19  algos.isin.IsInLongSeriesLookUpDominates.time_isin('Float64', 1000, 'monotone_hits')
-         298±8ms       41.2±0.1ms     0.14  algos.isin.IsInLongSeriesLookUpDominates.time_isin('Int64', 1000, 'monotone_hits')
-        343±10ms       41.0±0.3ms     0.12  algos.isin.IsInLongSeriesLookUpDominates.time_isin('Int64', 1000, 'random_hits')
-         370±7ms       39.3±0.2ms     0.11  algos.isin.IsInLongSeriesLookUpDominates.time_isin('Int64', 1000, 'random_misses')
-        371±10ms       36.6±0.3ms     0.10  algos.isin.IsInLongSeriesLookUpDominates.time_isin('Int64', 5, 'monotone_misses')
-        350±10ms       33.8±0.5ms     0.10  algos.isin.IsInLongSeriesLookUpDominates.time_isin('Float64', 5, 'monotone_misses')
-        351±10ms       32.7±0.7ms     0.09  algos.isin.IsInLongSeriesLookUpDominates.time_isin('Float64', 5, 'monotone_hits')
-        347±10ms       32.0±0.8ms     0.09  algos.isin.IsInLongSeriesLookUpDominates.time_isin('Float64', 5, 'random_misses')
-        357±10ms       31.7±0.5ms     0.09  algos.isin.IsInLongSeriesLookUpDominates.time_isin('Float64', 5, 'random_hits')

Not sure if we do this anywhere else, but based on this slowdown seems like empty/zeros_like(extensionarr) should be avoided (this call has extra overhead through dispatching, but also triggers an astype somewhere, which is the main perf difference).

Also as a sidenote seems like empty_like/zeros_like should be avoided in general in favor of zeros/empty, see https://stackoverflow.com/questions/27464039/why-the-performance-difference-between-numpy-zeros-and-numpy-zeros-like

But zeros_like -> zeros makes a much smaller difference than zeros_like(self) -> zeros_like(self._data)

@mzeitlin11 mzeitlin11 added isin isin method NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version labels Jul 25, 2021
@mzeitlin11 mzeitlin11 added this to the 1.3.2 milestone Jul 25, 2021
@@ -14,6 +14,7 @@ including other versions of pandas.

Fixed regressions
~~~~~~~~~~~~~~~~~
- Performance regression in :meth:`DataFrame.isin` and :meth:`Series.isin` for nullable data types (:issue:`42708`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change to this PR number

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@jreback jreback merged commit 9731fd0 into pandas-dev:master Jul 26, 2021
@jreback
Copy link
Contributor

jreback commented Jul 26, 2021

thanks @mzeitlin11

@jreback
Copy link
Contributor

jreback commented Jul 26, 2021

@meeseeksdev backport 1.3.x

@lumberbot-app
Copy link

lumberbot-app bot commented Jul 26, 2021

Something went wrong ... Please have a look at my logs.

@mzeitlin11 mzeitlin11 deleted the regr_isin_masked branch July 26, 2021 14:35
simonjayhawkins pushed a commit that referenced this pull request Jul 26, 2021
debnathshoham pushed a commit to debnathshoham/pandas that referenced this pull request Jul 27, 2021
CGe0516 pushed a commit to CGe0516/pandas that referenced this pull request Jul 29, 2021
feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
isin isin method NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants