-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: further improve take (reindex/unstack) for ArrayManager #40300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# check for promotion based on types only (do this first because | ||
# it's faster than computing a mask) | ||
dtype, fill_value = maybe_promote(arr.dtype, fill_value) | ||
if dtype != arr.dtype and (out is None or out.dtype != dtype): | ||
# check if promotion is actually required based on indexer | ||
mask = indexer == -1 | ||
needs_masking = mask.any() | ||
mask_info = mask, needs_masking | ||
if needs_masking: | ||
if out is not None and out.dtype != dtype: | ||
raise TypeError("Incompatible type for fill_value") | ||
else: | ||
# if not, then depromote, set fill_value to dummy | ||
# (it won't be used but we don't want the cython code | ||
# to crash when trying to cast it to dtype) | ||
dtype, fill_value = arr.dtype, arr.dtype.type() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing changed in those lines of the code; it's only a change in indentation (which github does't seem to detect nicely).
Only the first lines:
if indexer is None:
indexer = np.arange(arr.shape[axis], dtype=np.int64)
dtype, fill_value = arr.dtype, arr.dtype.type()
else:
indexer = ensure_int64(indexer, copy=False)
have been moved out into _take_nd_ndarray
if indexer is None: | ||
indexer = np.arange(arr.shape[axis], dtype=np.int64) | ||
dtype, fill_value = arr.dtype, arr.dtype.type() | ||
else: | ||
indexer = ensure_int64(indexer, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is copied from _take_preprocess_indexer_and_fill_value
, so we can avoid doing that in take_1d
@@ -999,11 +1000,16 @@ def _reindex_indexer( | |||
|
|||
else: | |||
validate_indices(indexer, len(self._axes[0])) | |||
indexer = ensure_int64(indexer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you converting to int64? we use intp for indexing via take
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's existing code (just cut and paste). And our cython take algos always use int64:
const int64_t[:] indexer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should they though? i expect they ought to be intp_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im increasingly confident that intp_t is the correct thing to do, but its out of scope for this PR, xref #40390
if (indexer == -1).any(): | ||
allow_fill = True | ||
else: | ||
allow_fill = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche in looking at the benchmarks you posted in #39146 it seems like some of them involve optimizations implemented for AM that could be implemented for BM but weren't. is there anything here that falls into that category?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in looking at the benchmarks you posted in #39146 it seems like some of them involve optimizations implemented for AM that could be implemented for BM but weren't
Are you thinking of something specific?
The optimization I do here above can probably be useful if you have lots of non-consolidated blocks, but for the benchmark cases with single blocks posted in #39146, that wouldn't make a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Series has an algos.take_nd call i think could be take_1d. Also several of the EA.take implementations
Are there further comments here? |
indexer, dtype, fill_value, mask_info = _take_preprocess_indexer_and_fill_value( | ||
arr, indexer, 0, None, fill_value, allow_fill | ||
arr, indexer, None, fill_value, allow_fill |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we ever get to _take_preprocess_indexer_and_fill_value with allow_fill=False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's currently possible in take_nd
, and a quick search reveals that we have a few cases throughout pandas that calls take_nd
specifying this argument. That's probably something to optimize though.
yeah i think generally we should be using intp for indexing as we we should change our cython routines to do this (this maybe a non trivial ask) it is likely this is a perf bottleneck (as we are moving between intp and int64 a fair amount) i would be wrong though and this is not really an issue |
checks are failing (maybe just rebase needed) |
arent these aliases on 64bit builds? |
Can/should we just have take_nd do an earlier check for ndim == 1 and then dispatch to take_1d? In NDArrayBackedExtensionArray.take if i change take_nd to
then on the time_full_product benchmark referenced #39146 (comment) we shave off 25-30% relative to master (non-AM) |
That might be possible (although the check for EA and |
im fine with that, just suggesting if we have a clear optimization for the general |
Certainly, see #40405 |
For the rest, any remaining comments before merging this? |
ill take another look this afternoon |
new_indexer2D = new_indexer.reshape(*unstacker.full_shape) | ||
new_indexer2D = ensure_int64(new_indexer2D) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this chunk of code be made to match Block._unstack (or vice-versa) more closely?
e.g. on L1094 you check unstacker.mask.all() which I think is related to the comment in Block._unstack # TODO: in all tests we have mask.all(); can we rely on that?
but bc mask
is defined there differently it is not obvious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming you are referring to the EA-specific Block.unstack
(as the other one has even less in common): that's for a single column while this is for multiple columns at once, and needs to deal with creating blocks, with placement, etc. So personally I don't see much possibilities to share code.
e.g. on L1094 you check unstacker.mask.all() which I think is related to the comment in Block._unstack
As far as I understand the Block version, I don't think that's related. For Block/BlockManager.unstack, that mask
value somehow indicates which of the columns need to end up in the result (I don't really understand that code, though, as in the ArrayManager I am simply using the result of unstacker.get_new_columns
without any filtering of that afterwards), while here I only check the mask
to potentially use a faster take
for performance-reasons, but that doesn't influence the end result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's related
OK.
(I don't really understand that code, though
at some point we should make sure that one of us (active maintainers, not just the two of us) have a handle on this.
lgtm @jbrockmendel merge when ready |
LGTM. Merging this evening absent other comments. |
thanks @jorisvandenbossche |
A few related changes to
take_1d
/ArrayManager.unstack
:ArrayManager.unstack
whether there will be missing values introduced or not, and if not, passallow_fill=False
to thetake
function, so this can take a more performant path (and doesn't need to check repeatedly forindexer == -1
). And the same forArrayManager.reindex_indexer
.take_1d
, ifallow_fill=False
, use the numpytake
instead of our own cython implementation (numpy is a bit faster).ensure_int64
check for the indexer out of the shared_take_preprocess_indexer_and_fill_value
. This way, we can do this once in ArrayManager.unstack and avoid doing it repeatedly for the same indexer insidetake_1d
.Case from the
frame_methods.py::Unstack.time_full_product
ASV benchmark:with the combined changes, this gives: