PERF: performance regression in slicing a Series #33323

jorisvandenbossche · 2020-04-06T14:19:57Z

All of the slice indexing benchmarks are showing regressions, eg https://pandas.pydata.org/speed/pandas/#indexing.NumericSeriesIndexing.time_loc_slice?p-index_dtype=%3Cclass%20'pandas.core.indexes.numeric.Int64Index'%3E&p-index_structure='unique_monotonic_inc'&commits=80d37adc-4f89c261

Replicating one of them with a small snippet confirms this:

In [1]: pd.__version__   
Out[1]: '1.1.0.dev0+1122.gc7c640ec7'

In [2]: N = 10 ** 6  

In [3]: data = pd.Series(np.random.rand(N), index=pd.Int64Index(range(N))) 

In [4]: %timeit data.iloc[:800000] 
103 ms ± 8.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit data[:800000] 
97.2 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

vs

In [1]: pd.__version__
Out[1]: '1.0.3'

In [2]: N = 10 ** 6  

In [3]: data = pd.Series(np.random.rand(N), index=pd.Int64Index(range(N))) 

In [4]: %timeit data.iloc[:800000]  
55.7 µs ± 2.72 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit data[:800000]   
69.6 µs ± 2.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Checking with snakeviz shows that on master, quasi all time the indexing operation takes is spent in setting the mgr_locs (creating the BlockPlacement object), which is certainly not as expected:

Putting a breakpoint in the Block init just before creating the BlockPlacement, showed that the value being passed on master is a range object, while this should be a slice object, I assume.
Going up the stack to see where this range object is created, points to:

pandas/pandas/core/internals/managers.py

Line 1571 in a9c105a

block = blk.make_block_same_class(array, placement=range(len(array)))

which according to git blame is last touched by this PR: #32421. And indeed, a range object was added there instead of a slice (other places in the PR did use a slice, so this was probably kind of a typo).

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2020-04-06T15:56:28Z

sounds like a typo, yah

jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version labels Apr 6, 2020

jorisvandenbossche added this to the 1.1 milestone Apr 6, 2020

jorisvandenbossche mentioned this issue Apr 6, 2020

PERF: fix placement when slicing a Series #33324

Merged

jreback closed this as completed in #33324 Apr 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: performance regression in slicing a Series #33323

PERF: performance regression in slicing a Series #33323

jorisvandenbossche commented Apr 6, 2020 •

edited

Loading

jbrockmendel commented Apr 6, 2020

PERF: performance regression in slicing a Series #33323

PERF: performance regression in slicing a Series #33323

Comments

jorisvandenbossche commented Apr 6, 2020 • edited Loading

jbrockmendel commented Apr 6, 2020

jorisvandenbossche commented Apr 6, 2020 •

edited

Loading