-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: optimize index.__getitem__ for slice & boolean mask indexers #6440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
can u post any significant changes in a full vbench run |
return result | ||
except (IndexError): | ||
if not len(key): | ||
return self.__class__(np.empty(0, dtype=self.dtype), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use Index here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With or without fastpath?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think u can just do Index([])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will cause losing specific index type, i.e. something like that:
In [4]: pd.Index([1,2,3], name='int_index')
Out[4]: Int64Index([1, 2, 3], dtype='int64')
In [5]: _4[[]]
Out[5]: Int64Index([], dtype='int64')
will return generic Index
, not Int64Index
as above. Is that OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO you're fixing the behaviour above where a sub-Series can have different class its super Series, for consistency this should also share class when empty. :)
Maybe self._constructor
is better ?
Maybe class should be determined from dtype passed to the Index constructor:
In [11]: pd.Index([], dtype=int) # expect Int64Index ?
Out[11]: Index([], dtype='object')
But I guess that is a sep issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe self._constructor is better ?
I see it's implemented as self.__class__
in Index
, so I assume it provides some extra functionality in other classes, but I can't grip the rationale: for Series
it always will return Series
, even if you subclass it, same for DataFrame
. What exactly is it supposed to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take it back, they are equivalent.
I think it's supposed to do just that, be used instead of Index/Series/DataFrame directly so these can be subclassed and methods will respect subclassing.
Anyway, this PR is really good (and a bug fix as well as perf).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's supposed to do just that, be used instead of Index/Series/DataFrame directly so these can be subclassed and methods will respect subclassing.
This is exactly what seems strange to me: I usually use self.__class__
or type(self)
if I need a base class method to recognize it can be subclassed, but OTOH looking at how _constructor
is implemented in Series
it appears to seal class hierarchy (unless you override it in derived class) and the whole point of this is unclear to me.
Here's full vbench log: https://gist.github.com/immerrr/9153429 Interesting parts:
|
can u see if the get item scalar vbench repeats when u rerun? secondly why does the type inference fail on a selected series? that doesn't break any tests? |
can you elaborate what "get item scalar vbench" means?
No, it doesn't break tests (as run by As for why type inference fails, it's the type inference that actually causes the 1000x slowdown. I've got rid of it and the slice operation has become As for if this inference is necessary or not, my reasoning is as follows: one inference already happens when the initial index object is created and given that most of the time index keys come from one source and thus have same dtypes, it's very unlikely that slicing will change index type. If this assumption is true then in order to support such rare cases each index slicing operation must suffer |
is what I meant (this is a hard to measure test its in mu....that's why I don't think it will repeat) ok on the type inference then. Can you add a test that uses the example from above (and shows that the resultant is an |
Will do tomorrow. |
ahh ok then leave it like it is |
Apparently,
|
can you add a release notes mention..otherwise looks good to go |
@jreback would you please look here: 9f0dc3b#diff-dc96e62847cdfe77b570cf9c6b0e8086R629 It appears that the |
prove it with a test case their are so many paths fife inspection often is faulty |
I've removed the special handling for empty key case and added a test for it. If that's OK, I'll squash and bump the changelog. Btw, speaking of changelog, as @hayd pointed out, aside from perf improvements this also changes API somewhat. How do I categorize this PR? |
put it in the release.rst API section (the 1-liner reffing this issue). Plus do a small example in v0.14.0.txt in the API section. Show a 'common' operation, e.g. create a frame with a mixed index, then slice it and show that the index is not inferred as before (the before should be a code-block) |
IMO the old behaviour was undocumented so current categorization is all good. Perhaps worth saying in release notes you can/must be explicit if you want to convert the sub-index (e.g in your example above using |
Rebased & bumped the changelog. Some commits look rather independent from the others to me, do you want them folded into one? |
you can squash them all into 1 |
Done |
PERF: optimize index.__getitem__ for slice & boolean mask indexers
thanks! |
This patch fixes performance issues discussed in and closes #6370.
There's an API change though: no type inference will now happen upon resulting index, e.g. here's what it would look like on current master:
And this is how it looks with the patch:
On the bright side, I'd say using mixed-type indices seems a quite rare scenario and the performance improvement is worth it.
Performance Before:
Performance After: