Skip to content

PERF: Fixes performance regression in DataFrame[bool_indexer] (#33924) #34199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Jun 1, 2020

Conversation

mproszewska
Copy link
Contributor

Insted of calling construction.array in check_array_indexer, creates array with dtype=bool before calling check_array_indexer.

Fixes performance regression after commit b9bcdc3.

setup = """
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(100000, 5))
bool_indexer = [True] * 50000 + [False] * 50000
"""
import timeit
timeit.timeit("df[bool_indexer]",setup=setup, number=1000)

# master
# 27.29123323399108
# now
# 8.814320757053792

@pep8speaks
Copy link

pep8speaks commented May 15, 2020

Hello @mproszewska! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-05-31 23:38:07 UTC

result = np.asarray(result, dtype=bool)
result = check_array_indexer(index, result)
else:
if not is_array_like(result):
# GH 33924
result = pd_array(result, dtype=bool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just np.array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because of nullable input

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the input is nullable, wouldn't not is_array_like(result) be False?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that this array/list can have nullable elements.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an asv that hits this case & a whatsnew note

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels May 18, 2020
@mproszewska
Copy link
Contributor Author

indexing.DataFrameNumericIndexing.time_bool_indexer looks smiliar to the case that linked issue describes, so I'm not sure if additional asv is needed

@jreback jreback added this to the 1.1 milestone May 25, 2020
@jreback
Copy link
Contributor

jreback commented May 25, 2020

can you run and post the indexing the asvs

@mproszewska
Copy link
Contributor Author

[ 50.00%] · For pandas commit e0a34f8f <perf-bool> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.6-Cython0.29.16-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 55.00%] ··· indexing.DataFrameNumericIndexing.time_bool_indexer                                                                                    1.29±0.3ms
[ 60.00%] ··· indexing.DataFrameNumericIndexing.time_iloc                                                                                              147±20μs
[ 65.00%] ··· indexing.DataFrameNumericIndexing.time_iloc_dups                                                                                         204±20μs
[ 70.00%] ··· indexing.DataFrameNumericIndexing.time_loc                                                                                               83.2±4μs
[ 75.00%] ··· indexing.DataFrameNumericIndexing.time_loc_dups                                                                                        4.73±0.3ms
[ 75.00%] · For pandas commit 5f26c342 <master^2> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython0.29.16-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 75.00%] ·· Benchmarking conda-py3.6-Cython0.29.16-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 80.00%] ··· indexing.DataFrameNumericIndexing.time_bool_indexer                                                                                    3.63±0.1ms
[ 85.00%] ··· indexing.DataFrameNumericIndexing.time_iloc                                                                                               116±6μs
[ 90.00%] ··· indexing.DataFrameNumericIndexing.time_iloc_dups                                                                                         180±20μs
[ 95.00%] ··· indexing.DataFrameNumericIndexing.time_loc                                                                                               85.1±9μs
[100.00%] ··· indexing.DataFrameNumericIndexing.time_loc_dups                                                                                        5.86±0.8ms

@mproszewska
Copy link
Contributor Author

I could change size of tested DataFrame, so that benchmarks would change more significantly.

@jreback
Copy link
Contributor

jreback commented May 31, 2020

I could change size of tested DataFrame, so that benchmarks would change more significantly.

sure as long as benchamarks are << 1s its ok (e.g. in ms). please post the actual asv run. otherwise looks fine. ping on green.

@mproszewska
Copy link
Contributor Author

After asv change

 before           after         ratio
     [5f26c342]       [e0a34f8f]
     <master^2>       <perf-bool>
-      26.3±0.3ms      7.27±0.03ms     0.28  indexing.DataFrameNumericIndexing.time_bool_indexer

@jreback
Copy link
Contributor

jreback commented Jun 1, 2020

lgtm ping on green

@mproszewska
Copy link
Contributor Author

lgtm ping on green

ping

@jreback jreback merged commit c6c273a into pandas-dev:master Jun 1, 2020
@jreback
Copy link
Contributor

jreback commented Jun 1, 2020

thanks @mproszewska

@mproszewska mproszewska deleted the perf-bool branch June 1, 2020 01:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance regression in DataFrame[bool_indexer]
4 participants