Skip to content

PERF/REGR: astype changing order of some 2d data #42475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 13, 2021

Conversation

mzeitlin11
Copy link
Member

@mzeitlin11 mzeitlin11 commented Jul 10, 2021

This just lets numpy stick with the default of order="C" for both ravel and reshape so no reordering occurs.

Benchmarks:

       before           after         ratio
     [87d78559]       [0c9d5496]
     <master>         <regr_astype>
-      9.34±0.7ms       4.95±0.9ms     0.53  stat_ops.FrameOps.time_op('kurt', 'int', 1)
-      5.08±0.8ms       2.63±0.1ms     0.52  stat_ops.FrameOps.time_op('skew', 'Int64', 0)
-      4.46±0.2ms       2.25±0.3ms     0.50  stat_ops.FrameOps.time_op('std', 'int', 1)
-      5.76±0.2ms       2.73±0.6ms     0.47  stat_ops.FrameOps.time_op('var', 'int', 1)
-      11.5±0.7ms       5.09±0.3ms     0.44  stat_ops.FrameOps.time_op('sem', 'int', 0)
-      1.36±0.2ms         600±50μs     0.44  stat_ops.FrameOps.time_op('prod', 'int', 0)
-        17.5±5ms       7.60±0.7ms     0.43  stat_ops.FrameOps.time_op('skew', 'int', 1)
-      8.64±0.7ms       3.37±0.4ms     0.39  stat_ops.FrameOps.time_op('kurt', 'int', 0)
-      9.03±0.6ms       3.21±0.2ms     0.36  stat_ops.FrameOps.time_op('skew', 'int', 0)
-      9.49±0.6ms       3.27±0.4ms     0.34  stat_ops.FrameOps.time_op('mad', 'int', 0)
-     1.31±0.08ms         420±20μs     0.32  stat_ops.FrameOps.time_op('sum', 'int', 1)
-      5.87±0.5ms       1.87±0.5ms     0.32  stat_ops.FrameOps.time_op('var', 'int', 0)
-     1.49±0.09ms         467±20μs     0.31  stat_ops.FrameOps.time_op('sum', 'int', 0)
-      4.44±0.3ms       1.39±0.1ms     0.31  stat_ops.FrameOps.time_op('std', 'int', 0)
-     1.90±0.06ms         569±70μs     0.30  stat_ops.FrameOps.time_op('mean', 'int', 0)
-      1.22±0.1ms          346±4μs     0.28  stat_ops.FrameOps.time_op('prod', 'int', 1)
-      1.86±0.1ms         519±60μs     0.28  stat_ops.FrameOps.time_op('mean', 'int', 1)

Will look to come up with a benchmark hitting this regression more squarely (which would help test if a more clever contiguity approach than always defaulting to "C" can improve performance).

EDIT: after regression is fixed, might be worth looking into reinstating something like the original approach. Some timings show performance benefit we lose here:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: data = np.ones((2000, 2000), order="F")

In [4]: df = pd.DataFrame(data)

In [5]: %timeit df.astype(np.int32)
5.63 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # this pr
5.7 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # master

In [6]: data = np.ones((2000, 2000), order="C")

In [7]: df = pd.DataFrame(data)

In [8]: %timeit df.astype(np.int32)
26.8 ms ± 313 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  # this pr
5.63 ms ± 77.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # master

EDIT: This slowdown comes down to the ravel copying because of contiguity that doesn't match. Let me know if it's better to try to fix the slowdown shown above in this pr or to leave as a followup for 1.4

@mzeitlin11 mzeitlin11 added Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance Reduction Operations sum, mean, min, max, etc. Regression Functionality that used to work in a prior pandas version labels Jul 10, 2021
@mzeitlin11 mzeitlin11 added this to the 1.3.1 milestone Jul 10, 2021
@@ -1094,14 +1093,12 @@ def astype_nansafe(
The dtype was a datetime64/timedelta64 dtype, but it had no unit.
"""
if arr.ndim > 1:
# Make sure we are doing non-copy ravel and reshape.
flags = arr.flags
flat = arr.ravel("K")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wont this mean a perf hit in cases where arr is F-contiguous?

Copy link
Member Author

@mzeitlin11 mzeitlin11 Jul 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep (some discussion about that in the pr body). Would be a regression from master (but not from 1.2.x). The other fix for this regression is to just use arr.ravel(order) where order is the same as used for reshape.

That avoids the hit you're talking about, but doesn't fix the perf regression (although that perf regression is a bit suspicious, was trying to profile it, but on larger data sizes the regression doesn't show up anymore)

EDIT: Regression does show up even with larger data, the perf hit comes from the reduction step, eg bottleneck.reduce.nanmean on a call like .mean(axis=0) or 'reduce' of 'numpy.ufunc' with .sum(axis=0). Only difference I can see on data going in is F-contiguity on master vs C-contiguity on this pr (and 1.2.x)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does arr.ravel(order) fix this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah arr.ravel(order) fixes the regression in behavior, but not the performance regression (discussed a bit in the comments above). Seems like it comes down to computations like bottleneck.reduce.nanmean being slower on F-contiguity data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it comes down to computations like bottleneck.reduce.nanmean being slower on F-contiguity data.

yes for sure that would be true.

ok so the question remains, this goes back to 1.2.x perf? are the f-contig routines hit? when do we hit these

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep exactly -> the benchmarks shown at the top of the pr are back to 1.2.x perf. With the perf regression on master, for those benchmarks we hit reduction routines with f-contig data. With this pr and 1.2.x, we hit those routines with C-contig data.

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mzeitlin11 generally lgtm. some comments but none blockers for the bug fix and perf regression fix.

# Make sure we are doing non-copy ravel and reshape.
flags = arr.flags
flat = arr.ravel("K")
# TODO: try to use contiguity to avoid potentially copying here, see #42475
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe lose this comment. trying to do this caused the regression.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have removed

# error: Item "ExtensionArray" of "Union[ExtensionArray, ndarray]" has no
# attribute "reshape"
return result.reshape(arr.shape, order=order) # type: ignore[union-attr]
return result.reshape(arr.shape) # type: ignore[union-attr]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we revert all of #38507, do we remove the need for an ignore? i.e. is there another issue lurking here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting would remove the need for the ignore, in the previous logic the reshape was inside an isinstance(ndarray).

AFAICT there is no issue here though - the problematic case here would be if astype_nansafe is called on a 2D ndarray and we try to convert to an ExtensionDtype. But based on

pandas/pandas/core/generic.py

Lines 5731 to 5738 in 87d7855

elif is_extension_array_dtype(dtype) and self.ndim > 1:
# GH 18099/22869: columnwise conversion to extension dtype
# GH 24704: use iloc to handle duplicate column names
# TODO(EA2D): special case not needed with 2D EAs
results = [
self.iloc[:, i].astype(dtype, copy=copy)
for i in range(len(self.columns))
]
that should be impossible. So I think we could get rid of the ignore with an assert if desired

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. but let's leave that out of this PR which is being backported

data = np.arange(16).reshape(4, 4)
df = DataFrame(data, dtype=np.intp)

result = df.iloc[:step1, :step2].astype("int16").astype(np.intp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using "step" as the parameter name is confusing in this part of the test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, have refactored test to clean that up

expected = df.iloc[:step1, :step2]
tm.assert_frame_equal(result, expected)

result = df.iloc[::step1, ::step2].astype("int16").astype(np.intp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally the result should just be the op under test. There are 2 astypes here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, have changed up assert to avoid the need for this

@jbrockmendel
Copy link
Member

im a bit confused as to what parts of this are about a regression[bug] vs a regression[perf]. @mzeitlin11 will this be obvious once i get some caffeine?

# error: Item "ExtensionArray" of "Union[ExtensionArray, ndarray]" has no
# attribute "reshape"
return result.reshape(arr.shape, order=order) # type: ignore[union-attr]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the point of the flags stuff was to 1) avoid a copy in ravel and b) retain the same contiguity in the returned array. is the problem that these goals aren't being achieved, or something else?

(im pretty sure i used the same flags-based pattern in the datetimelike arrays, so whatever fix is used here might need to be applied there too)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the behavior regression, what was happening was the following:

  1. Data which is not contiguous gets passed in
  2. Depending on structure of this data, arr.ravel("K") may choose to index with F ordering.
  3. order = "F" if flags.f_contiguous else "C" will give C order since the input is not contiguous.
  4. So the reshape will index with C ordering, changing the data ordering

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what if we change L1100 to something like order = "F" if flags.f_contiguous else "C" if flags.c_contiguous else None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reshape doesn't have a concept of None -> it can default to C if no argument passed, otherwise one of C, F, A.

To be clear, the order argument for reshape doesn't have to do with contiguity - it has to do with the order in which elements are read from the input and written to the output (so specifying the order C does not mean that the underlying input must be C-contiguous, it means that the input will be read with C ordering)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for walking me through this

@mzeitlin11
Copy link
Member Author

mzeitlin11 commented Jul 13, 2021

im a bit confused as to what parts of this are about a regression[bug] vs a regression[perf]. @mzeitlin11 will this be obvious once i get some caffeine?

Haha, I'll try to clarify :)

  • Bug: The ravel and reshape could happen with different orders, garbling the data. This can be fixed by specifying the same order in ravel as used for reshape (described in more detail below your other comment). However, this will not fix the perf regression
  • Perf: With behavior fix above, we can end up with F-contiguous data depending on what contiguity goes into the astype call. For the reduction regressions at the top of the issue, this F-contiguity is what caused the regression - the routines are just slower on F-contig data. Just leaving default order ensures C-contiguity, fixing the perf regression.

@jreback
Copy link
Contributor

jreback commented Jul 13, 2021

ok thanks for the explanation @mzeitlin11

@jreback jreback merged commit 6455912 into pandas-dev:master Jul 13, 2021
@jreback
Copy link
Contributor

jreback commented Jul 13, 2021

@meeseeksdev backport 1.3.x

@lumberbot-app
Copy link

lumberbot-app bot commented Jul 13, 2021

Something went wrong ... Please have a look at my logs.

@mzeitlin11 mzeitlin11 deleted the regr_astype branch July 14, 2021 00:36
mzeitlin11 added a commit to mzeitlin11/pandas that referenced this pull request Jul 14, 2021
mzeitlin11 added a commit to mzeitlin11/pandas that referenced this pull request Jul 14, 2021
feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance Reduction Operations sum, mean, min, max, etc. Regression Functionality that used to work in a prior pandas version
Projects
None yet
4 participants