PERF: Fixes performance regression in DataFrame[bool_indexer] (#33924) #34199

mproszewska · 2020-05-15T20:25:45Z

closes Performance regression in DataFrame[bool_indexer] #33924
tests passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Insted of calling construction.array in check_array_indexer, creates array with dtype=bool before calling check_array_indexer.

Fixes performance regression after commit b9bcdc3.

setup = """
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(100000, 5))
bool_indexer = [True] * 50000 + [False] * 50000
"""
import timeit
timeit.timeit("df[bool_indexer]",setup=setup, number=1000)

# master
# 27.29123323399108
# now
# 8.814320757053792

pep8speaks · 2020-05-15T20:25:48Z

Hello @mproszewska! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-05-31 23:38:07 UTC

jbrockmendel · 2020-05-15T20:55:08Z

pandas/core/indexing.py

        result = np.asarray(result, dtype=bool)
        result = check_array_indexer(index, result)
    else:
+        if not is_array_like(result):
+            # GH 33924
+            result = pd_array(result, dtype=bool)


Why not just np.array?

because of nullable input

if the input is nullable, wouldn't not is_array_like(result) be False?

I meant that this array/list can have nullable elements.

jreback

can you add an asv that hits this case & a whatsnew note

pandas/core/indexing.py

…o perf-bool

mproszewska · 2020-05-24T02:19:23Z

indexing.DataFrameNumericIndexing.time_bool_indexer looks smiliar to the case that linked issue describes, so I'm not sure if additional asv is needed

doc/source/whatsnew/v1.1.0.rst

pandas/core/indexing.py

jreback · 2020-05-25T17:18:53Z

can you run and post the indexing the asvs

…o perf-bool

mproszewska · 2020-05-31T22:36:04Z

[ 50.00%] · For pandas commit e0a34f8f <perf-bool> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.6-Cython0.29.16-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 55.00%] ··· indexing.DataFrameNumericIndexing.time_bool_indexer                                                                                    1.29±0.3ms
[ 60.00%] ··· indexing.DataFrameNumericIndexing.time_iloc                                                                                              147±20μs
[ 65.00%] ··· indexing.DataFrameNumericIndexing.time_iloc_dups                                                                                         204±20μs
[ 70.00%] ··· indexing.DataFrameNumericIndexing.time_loc                                                                                               83.2±4μs
[ 75.00%] ··· indexing.DataFrameNumericIndexing.time_loc_dups                                                                                        4.73±0.3ms
[ 75.00%] · For pandas commit 5f26c342 <master^2> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython0.29.16-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 75.00%] ·· Benchmarking conda-py3.6-Cython0.29.16-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 80.00%] ··· indexing.DataFrameNumericIndexing.time_bool_indexer                                                                                    3.63±0.1ms
[ 85.00%] ··· indexing.DataFrameNumericIndexing.time_iloc                                                                                               116±6μs
[ 90.00%] ··· indexing.DataFrameNumericIndexing.time_iloc_dups                                                                                         180±20μs
[ 95.00%] ··· indexing.DataFrameNumericIndexing.time_loc                                                                                               85.1±9μs
[100.00%] ··· indexing.DataFrameNumericIndexing.time_loc_dups                                                                                        5.86±0.8ms

mproszewska · 2020-05-31T22:37:28Z

I could change size of tested DataFrame, so that benchmarks would change more significantly.

jreback · 2020-05-31T22:51:12Z

I could change size of tested DataFrame, so that benchmarks would change more significantly.

sure as long as benchamarks are << 1s its ok (e.g. in ms). please post the actual asv run. otherwise looks fine. ping on green.

mproszewska · 2020-05-31T23:38:45Z

After asv change

 before           after         ratio
     [5f26c342]       [e0a34f8f]
     <master^2>       <perf-bool>
-      26.3±0.3ms      7.27±0.03ms     0.28  indexing.DataFrameNumericIndexing.time_bool_indexer

jreback · 2020-06-01T00:49:04Z

lgtm ping on green

mproszewska · 2020-06-01T01:36:07Z

lgtm ping on green

ping

jreback · 2020-06-01T01:42:40Z

thanks @mproszewska

mproszewska added 3 commits May 15, 2020 17:38

PERF: Remove unnecessary copies in sorting functions

c94b45e

PERF: Create array from list with given dtype=bool

4ba7472

Run black

509f74a

jbrockmendel reviewed May 15, 2020

View reviewed changes

mproszewska added 4 commits May 16, 2020 19:06

Run tests

0ab450b

Run tests

54c7304

Run tests

c7de90b

Fix imports

0e75426

jreback requested changes May 17, 2020

View reviewed changes

jreback reviewed May 17, 2020

View reviewed changes

pandas/core/indexing.py Outdated Show resolved Hide resolved

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels May 18, 2020

mproszewska and others added 10 commits May 22, 2020 23:19

Add asv

6d72a34

Run black

5ba54a6

Remove asv

2766270

Add requested changes

cb1312c

Merge branch 'master' into perf-bool

202f409

Run black

74cb449

Merge branch 'perf-bool' of https://github.com/mproszewska/pandas int…

6440281

…o perf-bool

Delete newline

915dec8

Fix whatsnew

6925c29

Merge branch 'perf'

91176ca

jreback added this to the 1.1 milestone May 25, 2020

jreback requested changes May 25, 2020

View reviewed changes

doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved

pandas/core/indexing.py Show resolved Hide resolved

mproszewska added 3 commits May 28, 2020 17:00

Add requested changes

0734a87

Merge branch 'master' into perf-bool

2def933

Fix

70a266c

mproszewska and others added 7 commits May 28, 2020 17:03

Fix

8bed9b5

Fix typo

cfa6b9e

Merge branch 'master' into perf-bool

013495d

Fix

8707be4

Merge remote-tracking branch 'upstream/master'

f748b78

Conflict resolve

5b9fe28

Merge branch 'perf-bool' of https://github.com/mproszewska/pandas int…

e0a34f8

…o perf-bool

Update asv

2ce27c8

jreback approved these changes Jun 1, 2020

View reviewed changes

jreback merged commit c6c273a into pandas-dev:master Jun 1, 2020

mproszewska deleted the perf-bool branch June 1, 2020 01:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Fixes performance regression in DataFrame[bool_indexer] (#33924) #34199

PERF: Fixes performance regression in DataFrame[bool_indexer] (#33924) #34199

mproszewska commented May 15, 2020

pep8speaks commented May 15, 2020 •

edited

Loading

jbrockmendel May 15, 2020

mproszewska May 16, 2020

jbrockmendel May 16, 2020

mproszewska May 17, 2020

jreback left a comment

mproszewska commented May 24, 2020

jreback commented May 25, 2020

mproszewska commented May 31, 2020

mproszewska commented May 31, 2020

jreback commented May 31, 2020

mproszewska commented May 31, 2020

jreback commented Jun 1, 2020

mproszewska commented Jun 1, 2020

jreback commented Jun 1, 2020

PERF: Fixes performance regression in DataFrame[bool_indexer] (#33924) #34199

PERF: Fixes performance regression in DataFrame[bool_indexer] (#33924) #34199

Conversation

mproszewska commented May 15, 2020

pep8speaks commented May 15, 2020 • edited Loading

Comment last updated at 2020-05-31 23:38:07 UTC

jbrockmendel May 15, 2020

Choose a reason for hiding this comment

mproszewska May 16, 2020

Choose a reason for hiding this comment

jbrockmendel May 16, 2020

Choose a reason for hiding this comment

mproszewska May 17, 2020

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

mproszewska commented May 24, 2020

jreback commented May 25, 2020

mproszewska commented May 31, 2020

mproszewska commented May 31, 2020

jreback commented May 31, 2020

mproszewska commented May 31, 2020

jreback commented Jun 1, 2020

mproszewska commented Jun 1, 2020

jreback commented Jun 1, 2020

pep8speaks commented May 15, 2020 •

edited

Loading