-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Add masked engine #49420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add masked engine #49420
Conversation
@@ -9,6 +9,11 @@ | |||
|
|||
from pandas._libs import index as libindex | |||
|
|||
from pandas import ( | |||
NA, | |||
Series, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file shouldn’t depend on series (see docstring)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to construct the data that is passed to the Engine constructor in some way. For the other engines that is a plain numpy array (so not depending on anything outside of pd._libs), but for those new engines for masked arrays that are the data and mask numpy arrays. While it is possible to construct those manually, it is easier to do that through pd.Series(..)
(or pd.array(..)
would be sufficient), I think? (and this is only for the setup
, not for the actual benchmark)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx missed that docstring
We don't need the actual array since we are passing the underlying NumPy arrays anyway, we can just construct the mask and the data separately
pandas/_libs/index.pyx
Outdated
@@ -158,6 +168,8 @@ cdef class IndexEngine: | |||
return self._get_loc_duplicates(val) | |||
|
|||
try: | |||
if self.mask is not None and val is C_NA: | |||
return self.mapping.get_na() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this go outside the try/except?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thx
pandas/core/indexes/base.py
Outdated
): | ||
return libindex.ExtensionEngine(target_values) | ||
if isinstance(target_values, ExtensionArray): | ||
is_masked = hasattr(target_values, "_mask") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you check directly for e.g. BaseMaskedArray here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, have to import BaseMaskedArray locally though
# Conflicts: # asv_bench/benchmarks/indexing.py # pandas/tests/indexes/numeric/test_indexing.py
pandas/core/indexes/base.py
Outdated
): | ||
return libindex.ExtensionEngine(target_values) | ||
if isinstance(target_values, ExtensionArray): | ||
from pandas.core.arrays import BaseMaskedArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cant do this up top?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, circular import...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm that sucks. not for this PR but i bet either Categorical or IntervalArray are importing something from indexes and it would be reasonable to make the runtime imports happen there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep sounds sensible
pandas/core/indexes/base.py
Outdated
if isinstance(vals, StringArray): | ||
# GH#45652 much more performant than ExtensionEngine | ||
return vals._ndarray | ||
if type(self) is Index and isinstance(self._values, ExtensionArray): | ||
return self._values.astype(object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC _can_use_libjoin will make it so we don't go down this path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but we can not use this in one case, because the no use lib join path does not work there for non interval, so we have to go through object and end up here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 4444
This is less ugly now :) |
pandas/core/indexes/base.py
Outdated
@@ -4762,8 +4779,9 @@ def _get_engine_target(self) -> ArrayLike: | |||
# GH#45652 much more performant than ExtensionEngine | |||
return vals._ndarray | |||
if type(self) is Index and isinstance(self._values, ExtensionArray): | |||
# TODO(ExtensionIndex): remove special-case, just use self._values | |||
return self._values.astype(object) | |||
if not isinstance(self._values, BaseMaskedArray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an this be combined with the previous condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, found this a bit easier to read, but no strong opinion
cc @jbrockmendel Could you have another look? |
asv_bench/benchmarks/indexing.py
Outdated
indices = { | ||
True: Index(range(N), dtype=dtype), | ||
False: Index( | ||
list(range(50)) + [54, 53, 52, 51] + list(range(55, N - 1)), dtype=dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NBD but i think if you do this at the class level outside of setup it might only construct once. might add up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we run into risks of caching stuff on the Index level that could impact performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its possible. could just create the arrays instead of the Index objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, moved the list construction outside. We should probably think about going over the asvs, we are doing this in a lot of places.
@@ -9,6 +9,8 @@ | |||
|
|||
from pandas._libs import index as libindex | |||
|
|||
from pandas.core.arrays import BaseMaskedArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this renders the file-level docstring not-quite accurate. It shouldn't affect the main point of the docstring, but might be worth noting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for being patient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Could use a whatsnew note describing the performance improvement of indexing with masked dtypes
Added |
Thanks @phofl |
Nice! Way to press through |
Thanks! I'll work on deduplicating some parts of the code soonish (as discussed sometime during the review) |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Lets see if ci passes