-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF/REGR: astype changing order of some 2d data #42475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -1094,14 +1093,12 @@ def astype_nansafe( | |||
The dtype was a datetime64/timedelta64 dtype, but it had no unit. | |||
""" | |||
if arr.ndim > 1: | |||
# Make sure we are doing non-copy ravel and reshape. | |||
flags = arr.flags | |||
flat = arr.ravel("K") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wont this mean a perf hit in cases where arr is F-contiguous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep (some discussion about that in the pr body). Would be a regression from master (but not from 1.2.x). The other fix for this regression is to just use arr.ravel(order)
where order
is the same as used for reshape
.
That avoids the hit you're talking about, but doesn't fix the perf regression (although that perf regression is a bit suspicious, was trying to profile it, but on larger data sizes the regression doesn't show up anymore)
EDIT: Regression does show up even with larger data, the perf hit comes from the reduction step, eg bottleneck.reduce.nanmean
on a call like .mean(axis=0)
or 'reduce' of 'numpy.ufunc'
with .sum(axis=0)
. Only difference I can see on data going in is F-contiguity on master vs C-contiguity on this pr (and 1.2.x)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does arr.ravel(order)
fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah arr.ravel(order)
fixes the regression in behavior, but not the performance regression (discussed a bit in the comments above). Seems like it comes down to computations like bottleneck.reduce.nanmean
being slower on F-contiguity data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it comes down to computations like bottleneck.reduce.nanmean being slower on F-contiguity data.
yes for sure that would be true.
ok so the question remains, this goes back to 1.2.x perf? are the f-contig routines hit? when do we hit these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep exactly -> the benchmarks shown at the top of the pr are back to 1.2.x perf. With the perf regression on master, for those benchmarks we hit reduction routines with f-contig data. With this pr and 1.2.x, we hit those routines with C-contig data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mzeitlin11 generally lgtm. some comments but none blockers for the bug fix and perf regression fix.
pandas/core/dtypes/cast.py
Outdated
# Make sure we are doing non-copy ravel and reshape. | ||
flags = arr.flags | ||
flat = arr.ravel("K") | ||
# TODO: try to use contiguity to avoid potentially copying here, see #42475 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe lose this comment. trying to do this caused the regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have removed
# error: Item "ExtensionArray" of "Union[ExtensionArray, ndarray]" has no | ||
# attribute "reshape" | ||
return result.reshape(arr.shape, order=order) # type: ignore[union-attr] | ||
return result.reshape(arr.shape) # type: ignore[union-attr] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we revert all of #38507, do we remove the need for an ignore? i.e. is there another issue lurking here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverting would remove the need for the ignore, in the previous logic the reshape was inside an isinstance(ndarray)
.
AFAICT there is no issue here though - the problematic case here would be if astype_nansafe
is called on a 2D ndarray and we try to convert to an ExtensionDtype
. But based on
Lines 5731 to 5738 in 87d7855
elif is_extension_array_dtype(dtype) and self.ndim > 1: | |
# GH 18099/22869: columnwise conversion to extension dtype | |
# GH 24704: use iloc to handle duplicate column names | |
# TODO(EA2D): special case not needed with 2D EAs | |
results = [ | |
self.iloc[:, i].astype(dtype, copy=copy) | |
for i in range(len(self.columns)) | |
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. but let's leave that out of this PR which is being backported
data = np.arange(16).reshape(4, 4) | ||
df = DataFrame(data, dtype=np.intp) | ||
|
||
result = df.iloc[:step1, :step2].astype("int16").astype(np.intp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using "step" as the parameter name is confusing in this part of the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, have refactored test to clean that up
expected = df.iloc[:step1, :step2] | ||
tm.assert_frame_equal(result, expected) | ||
|
||
result = df.iloc[::step1, ::step2].astype("int16").astype(np.intp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally the result should just be the op under test. There are 2 astypes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, have changed up assert to avoid the need for this
im a bit confused as to what parts of this are about a regression[bug] vs a regression[perf]. @mzeitlin11 will this be obvious once i get some caffeine? |
# error: Item "ExtensionArray" of "Union[ExtensionArray, ndarray]" has no | ||
# attribute "reshape" | ||
return result.reshape(arr.shape, order=order) # type: ignore[union-attr] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the point of the flags
stuff was to 1) avoid a copy in ravel and b) retain the same contiguity in the returned array. is the problem that these goals aren't being achieved, or something else?
(im pretty sure i used the same flags-based pattern in the datetimelike arrays, so whatever fix is used here might need to be applied there too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the behavior regression, what was happening was the following:
- Data which is not contiguous gets passed in
- Depending on structure of this data,
arr.ravel("K")
may choose to index withF
ordering. order = "F" if flags.f_contiguous else "C"
will giveC
order since the input is not contiguous.- So the
reshape
will index withC
ordering, changing the data ordering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so what if we change L1100 to something like order = "F" if flags.f_contiguous else "C" if flags.c_contiguous else None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reshape
doesn't have a concept of None
-> it can default to C
if no argument passed, otherwise one of C
, F
, A
.
To be clear, the order
argument for reshape
doesn't have to do with contiguity - it has to do with the order in which elements are read from the input and written to the output (so specifying the order C
does not mean that the underlying input must be C-contiguous, it means that the input will be read with C
ordering)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for walking me through this
Haha, I'll try to clarify :)
|
ok thanks for the explanation @mzeitlin11 |
@meeseeksdev backport 1.3.x |
Something went wrong ... Please have a look at my logs. |
This just lets
numpy
stick with the default oforder="C"
for bothravel
andreshape
so no reordering occurs.Benchmarks:
Will look to come up with a benchmark hitting this regression more squarely (which would help test if a more clever contiguity approach than always defaulting to "C" can improve performance).
EDIT: after regression is fixed, might be worth looking into reinstating something like the original approach. Some timings show performance benefit we lose here:
EDIT: This slowdown comes down to the ravel copying because of contiguity that doesn't match. Let me know if it's better to try to fix the slowdown shown above in this pr or to leave as a followup for 1.4