Skip to content

PERF: optimize index.__getitem__ for slice & boolean mask indexers #6440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 28, 2014

Conversation

immerrr
Copy link
Contributor

@immerrr immerrr commented Feb 22, 2014

This patch fixes performance issues discussed in and closes #6370.

There's an API change though: no type inference will now happen upon resulting index, e.g. here's what it would look like on current master:

In [1]: pd.Index([1,2,3, 'a', 'b', 'c'])
Out[1]: Index([1, 2, 3, u'a', u'b', u'c'], dtype='object')

In [2]: _1[[0,1,2]]
Out[2]: Int64Index([1, 2, 3], dtype='int64')

And this is how it looks with the patch:

In [1]: pd.Index([1,2,3, 'a', 'b', 'c'])
Out[1]: Index([1, 2, 3, u'a', u'b', u'c'], dtype='object')

In [2]: _1[[0,1,2]]
Out[2]: Index([1, 2, 3], dtype='object')

On the bright side, I'd say using mixed-type indices seems a quite rare scenario and the performance improvement is worth it.

Performance Before:

In [1]: import pandas.util.testing as tm

In [2]: idx = tm.makeStringIndex(1000000)

In [3]: mask = np.arange(1000000) % 3 == 0

In [4]: series_mask = pd.Series(mask); mask
Out[4]: array([ True, False, False, ..., False, False,  True], dtype=bool)

In [5]: timeit idx[:-1]
100000 loops, best of 3: 1.57 µs per loop

In [6]: timeit idx[::2]
100 loops, best of 3: 14.6 ms per loop

In [7]: timeit idx[mask]
100 loops, best of 3: 15.4 ms per loop

In [8]: timeit idx[series_mask]
100 loops, best of 3: 15.4 ms per loop

In [9]: pd.__version__
Out[9]: '0.13.1-278-gaf63b99'

Performance After:

In [1]: import pandas.util.testing as tm

In [2]: idx = tm.makeStringIndex(1000000)

In [3]: mask = np.arange(1000000) % 3 == 0

In [4]: series_mask = pd.Series(mask)

In [5]: timeit idx[:-1]
1000000 loops, best of 3: 1.56 µs per loop

In [6]: timeit idx[::2]
100000 loops, best of 3: 4.89 µs per loop

In [7]: timeit idx[mask]
100 loops, best of 3: 11.4 ms per loop

In [8]: timeit idx[series_mask]
100 loops, best of 3: 11.4 ms per loop

In [9]: pd.__version__
Out[9]: '0.13.1-279-gbc810f0'

@jreback
Copy link
Contributor

jreback commented Feb 22, 2014

can u post any significant changes in a full vbench run

return result
except (IndexError):
if not len(key):
return self.__class__(np.empty(0, dtype=self.dtype),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use Index here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With or without fastpath?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think u can just do Index([])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause losing specific index type, i.e. something like that:

In [4]: pd.Index([1,2,3], name='int_index')
Out[4]: Int64Index([1, 2, 3], dtype='int64')

In [5]: _4[[]]
Out[5]: Int64Index([], dtype='int64')

will return generic Index, not Int64Index as above. Is that OK?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO you're fixing the behaviour above where a sub-Series can have different class its super Series, for consistency this should also share class when empty. :)

Maybe self._constructor is better ?

Maybe class should be determined from dtype passed to the Index constructor:

In [11]: pd.Index([], dtype=int)  # expect Int64Index ?
Out[11]: Index([], dtype='object')

But I guess that is a sep issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe self._constructor is better ?

I see it's implemented as self.__class__ in Index, so I assume it provides some extra functionality in other classes, but I can't grip the rationale: for Series it always will return Series, even if you subclass it, same for DataFrame. What exactly is it supposed to do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it back, they are equivalent.

I think it's supposed to do just that, be used instead of Index/Series/DataFrame directly so these can be subclassed and methods will respect subclassing.

Anyway, this PR is really good (and a bug fix as well as perf).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's supposed to do just that, be used instead of Index/Series/DataFrame directly so these can be subclassed and methods will respect subclassing.

This is exactly what seems strange to me: I usually use self.__class__ or type(self) if I need a base class method to recognize it can be subclassed, but OTOH looking at how _constructor is implemented in Series it appears to seal class hierarchy (unless you override it in derived class) and the whole point of this is unclear to me.

@immerrr
Copy link
Contributor Author

immerrr commented Feb 22, 2014

Here's full vbench log: https://gist.github.com/immerrr/9153429

Interesting parts:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_frame_apply_overhead                 |   8.4306 |  11.8597 |   0.7109 |
groupby_frame_apply                          |  45.0277 |  60.4376 |   0.7450 |
index_str_boolean_indexer                    |  10.9810 |  14.5193 |   0.7563 |
index_str_boolean_series_indexer             |  11.0783 |  14.5703 |   0.7603 |
....
frame_mask_bools                             |  14.3187 |  12.2263 |   1.1711 |
datamatrix_getitem_scalar                    |   0.0110 |   0.0083 |   1.3269 |

@jreback
Copy link
Contributor

jreback commented Feb 22, 2014

can u see if the get item scalar vbench repeats when u rerun?
eg is it an actual change

secondly why does the type inference fail on a selected series? that doesn't break any tests?

@immerrr
Copy link
Contributor Author

immerrr commented Feb 22, 2014

can u see if the get item scalar vbench repeats when u rerun?

can you elaborate what "get item scalar vbench" means?

secondly why does the type inference fail on a selected series? that doesn't break any tests?

No, it doesn't break tests (as run by nosetests -w pandas on my Linux system).

As for why type inference fails, it's the type inference that actually causes the 1000x slowdown. I've got rid of it and the slice operation has become O(1) instead of O(n) again.

As for if this inference is necessary or not, my reasoning is as follows: one inference already happens when the initial index object is created and given that most of the time index keys come from one source and thus have same dtypes, it's very unlikely that slicing will change index type. If this assumption is true then in order to support such rare cases each index slicing operation must suffer O(n) penalty (which can easily be 1000x for 1M elements as shown).

@jreback
Copy link
Contributor

jreback commented Feb 22, 2014

datamatrix_getitem_scalar | 0.0110 | 0.0083 | 1.3269 |

is what I meant (this is a hard to measure test its in mu....that's why I don't think it will repeat)

ok on the type inference then.

Can you add a test that uses the example from above (and shows that the resultant is an Index type). Just (and doc to this issue), just that it is referenced and know and will break if its changed later.

@immerrr
Copy link
Contributor Author

immerrr commented Feb 22, 2014

Will do tomorrow.

@jreback jreback added this to the 0.14.0 milestone Feb 22, 2014
@jreback
Copy link
Contributor

jreback commented Feb 22, 2014

ahh ok

then leave it like it is

@immerrr
Copy link
Contributor Author

immerrr commented Feb 23, 2014

Apparently, datamatrix_getitem_scalar result is not relevant:

$ ./test_perf.sh -H -N 5 -r datamatrix_getitem_scalar
CPU affinity set to 1


*** LOG_FILE = /home/immerrr/sources/pandas/vb_suite.log


Performing 1 benchmarks (5 runs each)
.

*** 
Invoked with :
--ncalls: 3
--repeats: 3


-----------------------------------------------------------------------------------------------------
Test name                                    |    #0    |    #1    |    #2    |    #3    |    #4    |
-----------------------------------------------------------------------------------------------------
datamatrix_getitem_scalar                    |   0.0083 |   0.0087 |   0.0087 |   0.0083 |   0.0083 |
-----------------------------------------------------------------------------------------------------
Test name                                    |    #0    |    #1    |    #2    |    #3    |    #4    |
-----------------------------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [8801089] : TST: make sure index type doesn't change

@jreback
Copy link
Contributor

jreback commented Feb 23, 2014

can you add a release notes mention..otherwise looks good to go

@immerrr
Copy link
Contributor Author

immerrr commented Feb 25, 2014

@jreback would you please look here:

9f0dc3b#diff-dc96e62847cdfe77b570cf9c6b0e8086R629

It appears that the if not len(key): result = [] piece is part of allowing iloc array indexers to include out-of-bounds indices without raising an error. According to #6296, that's supposed to be rolled back in #6443 and I'd guess it should be removed here as well, right?

@jreback
Copy link
Contributor

jreback commented Feb 25, 2014

prove it with a test case

their are so many paths fife inspection often is faulty

@immerrr
Copy link
Contributor Author

immerrr commented Feb 27, 2014

I've removed the special handling for empty key case and added a test for it. If that's OK, I'll squash and bump the changelog.

Btw, speaking of changelog, as @hayd pointed out, aside from perf improvements this also changes API somewhat. How do I categorize this PR?

@jreback
Copy link
Contributor

jreback commented Feb 27, 2014

put it in the release.rst API section (the 1-liner reffing this issue). Plus do a small example in v0.14.0.txt in the API section. Show a 'common' operation, e.g. create a frame with a mixed index, then slice it and show that the index is not inferred as before (the before should be a code-block)

@hayd
Copy link
Contributor

hayd commented Feb 27, 2014

IMO the old behaviour was undocumented so current categorization is all good.

Perhaps worth saying in release notes you can/must be explicit if you want to convert the sub-index (e.g in your example above using _1[[0,1,2]].astype(int)).

@immerrr
Copy link
Contributor Author

immerrr commented Feb 28, 2014

Rebased & bumped the changelog.

Some commits look rather independent from the others to me, do you want them folded into one?

@jreback
Copy link
Contributor

jreback commented Feb 28, 2014

you can squash them all into 1
or 2 if u want

@immerrr
Copy link
Contributor Author

immerrr commented Feb 28, 2014

Done

jreback added a commit that referenced this pull request Feb 28, 2014
PERF: optimize index.__getitem__ for slice & boolean mask indexers
@jreback jreback merged commit f149927 into pandas-dev:master Feb 28, 2014
@jreback
Copy link
Contributor

jreback commented Feb 28, 2014

thanks!

@immerrr immerrr deleted the index-getitem-performance branch February 28, 2014 16:15
@immerrr immerrr restored the index-getitem-performance branch March 3, 2014 17:14
@immerrr immerrr deleted the index-getitem-performance branch March 3, 2014 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: Index.__getitem__ performance issue
3 participants