PERF/REGR: astype changing order of some 2d data #42475

mzeitlin11 · 2021-07-10T01:04:35Z

closes BUG: astype() on an integer DataFrame changes the order of data #42396
closes Performance regression in DataFrame reduction ops #38592
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

This just lets numpy stick with the default of order="C" for both ravel and reshape so no reordering occurs.

Benchmarks:

       before           after         ratio
     [87d78559]       [0c9d5496]
     <master>         <regr_astype>
-      9.34±0.7ms       4.95±0.9ms     0.53  stat_ops.FrameOps.time_op('kurt', 'int', 1)
-      5.08±0.8ms       2.63±0.1ms     0.52  stat_ops.FrameOps.time_op('skew', 'Int64', 0)
-      4.46±0.2ms       2.25±0.3ms     0.50  stat_ops.FrameOps.time_op('std', 'int', 1)
-      5.76±0.2ms       2.73±0.6ms     0.47  stat_ops.FrameOps.time_op('var', 'int', 1)
-      11.5±0.7ms       5.09±0.3ms     0.44  stat_ops.FrameOps.time_op('sem', 'int', 0)
-      1.36±0.2ms         600±50μs     0.44  stat_ops.FrameOps.time_op('prod', 'int', 0)
-        17.5±5ms       7.60±0.7ms     0.43  stat_ops.FrameOps.time_op('skew', 'int', 1)
-      8.64±0.7ms       3.37±0.4ms     0.39  stat_ops.FrameOps.time_op('kurt', 'int', 0)
-      9.03±0.6ms       3.21±0.2ms     0.36  stat_ops.FrameOps.time_op('skew', 'int', 0)
-      9.49±0.6ms       3.27±0.4ms     0.34  stat_ops.FrameOps.time_op('mad', 'int', 0)
-     1.31±0.08ms         420±20μs     0.32  stat_ops.FrameOps.time_op('sum', 'int', 1)
-      5.87±0.5ms       1.87±0.5ms     0.32  stat_ops.FrameOps.time_op('var', 'int', 0)
-     1.49±0.09ms         467±20μs     0.31  stat_ops.FrameOps.time_op('sum', 'int', 0)
-      4.44±0.3ms       1.39±0.1ms     0.31  stat_ops.FrameOps.time_op('std', 'int', 0)
-     1.90±0.06ms         569±70μs     0.30  stat_ops.FrameOps.time_op('mean', 'int', 0)
-      1.22±0.1ms          346±4μs     0.28  stat_ops.FrameOps.time_op('prod', 'int', 1)
-      1.86±0.1ms         519±60μs     0.28  stat_ops.FrameOps.time_op('mean', 'int', 1)

Will look to come up with a benchmark hitting this regression more squarely (which would help test if a more clever contiguity approach than always defaulting to "C" can improve performance).

EDIT: after regression is fixed, might be worth looking into reinstating something like the original approach. Some timings show performance benefit we lose here:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: data = np.ones((2000, 2000), order="F")

In [4]: df = pd.DataFrame(data)

In [5]: %timeit df.astype(np.int32)
5.63 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # this pr
5.7 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # master

In [6]: data = np.ones((2000, 2000), order="C")

In [7]: df = pd.DataFrame(data)

In [8]: %timeit df.astype(np.int32)
26.8 ms ± 313 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  # this pr
5.63 ms ± 77.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # master

EDIT: This slowdown comes down to the ravel copying because of contiguity that doesn't match. Let me know if it's better to try to fix the slowdown shown above in this pr or to leave as a followup for 1.4

jbrockmendel · 2021-07-10T02:17:09Z

pandas/core/dtypes/cast.py

@@ -1094,14 +1093,12 @@ def astype_nansafe(
        The dtype was a datetime64/timedelta64 dtype, but it had no unit.
    """
    if arr.ndim > 1:
-        # Make sure we are doing non-copy ravel and reshape.
-        flags = arr.flags
-        flat = arr.ravel("K")


wont this mean a perf hit in cases where arr is F-contiguous?

Yep (some discussion about that in the pr body). Would be a regression from master (but not from 1.2.x). The other fix for this regression is to just use arr.ravel(order) where order is the same as used for reshape.

That avoids the hit you're talking about, but doesn't fix the perf regression ~~(although that perf regression is a bit suspicious, was trying to profile it, but on larger data sizes the regression doesn't show up anymore)~~

EDIT: Regression does show up even with larger data, the perf hit comes from the reduction step, eg bottleneck.reduce.nanmean on a call like .mean(axis=0) or 'reduce' of 'numpy.ufunc' with .sum(axis=0). Only difference I can see on data going in is F-contiguity on master vs C-contiguity on this pr (and 1.2.x)

does arr.ravel(order) fix this?

Yeah arr.ravel(order) fixes the regression in behavior, but not the performance regression (discussed a bit in the comments above). Seems like it comes down to computations like bottleneck.reduce.nanmean being slower on F-contiguity data.

Seems like it comes down to computations like bottleneck.reduce.nanmean being slower on F-contiguity data.

yes for sure that would be true.

ok so the question remains, this goes back to 1.2.x perf? are the f-contig routines hit? when do we hit these

Yep exactly -> the benchmarks shown at the top of the pr are back to 1.2.x perf. With the perf regression on master, for those benchmarks we hit reduction routines with f-contig data. With this pr and 1.2.x, we hit those routines with C-contig data.

simonjayhawkins

Thanks @mzeitlin11 generally lgtm. some comments but none blockers for the bug fix and perf regression fix.

simonjayhawkins · 2021-07-10T12:00:34Z

pandas/core/dtypes/cast.py

-        # Make sure we are doing non-copy ravel and reshape.
-        flags = arr.flags
-        flat = arr.ravel("K")
+        # TODO: try to use contiguity to avoid potentially copying here, see #42475


maybe lose this comment. trying to do this caused the regression.

have removed

simonjayhawkins · 2021-07-10T12:01:37Z

pandas/core/dtypes/cast.py

        # error: Item "ExtensionArray" of "Union[ExtensionArray, ndarray]" has no
        # attribute "reshape"
-        return result.reshape(arr.shape, order=order)  # type: ignore[union-attr]
+        return result.reshape(arr.shape)  # type: ignore[union-attr]


if we revert all of #38507, do we remove the need for an ignore? i.e. is there another issue lurking here.

Reverting would remove the need for the ignore, in the previous logic the reshape was inside an isinstance(ndarray).

AFAICT there is no issue here though - the problematic case here would be if astype_nansafe is called on a 2D ndarray and we try to convert to an ExtensionDtype. But based on

pandas/pandas/core/generic.py

Lines 5731 to 5738 in 87d7855

elif is_extension_array_dtype(dtype) and self.ndim > 1:

# GH 18099/22869: columnwise conversion to extension dtype

# GH 24704: use iloc to handle duplicate column names

# TODO(EA2D): special case not needed with 2D EAs

results = [

self.iloc[:, i].astype(dtype, copy=copy)

for i in range(len(self.columns))

]

that should be impossible. So I think we could get rid of the ignore with an assert if desired

sure. but let's leave that out of this PR which is being backported

simonjayhawkins · 2021-07-10T12:03:18Z

pandas/tests/frame/methods/test_astype.py

+        data = np.arange(16).reshape(4, 4)
+        df = DataFrame(data, dtype=np.intp)
+
+        result = df.iloc[:step1, :step2].astype("int16").astype(np.intp)


using "step" as the parameter name is confusing in this part of the test

thanks, have refactored test to clean that up

simonjayhawkins · 2021-07-10T12:05:05Z

pandas/tests/frame/methods/test_astype.py

+        expected = df.iloc[:step1, :step2]
+        tm.assert_frame_equal(result, expected)
+
+        result = df.iloc[::step1, ::step2].astype("int16").astype(np.intp)


ideally the result should just be the op under test. There are 2 astypes here.

Makes sense, have changed up assert to avoid the need for this

jbrockmendel · 2021-07-13T15:03:46Z

im a bit confused as to what parts of this are about a regression[bug] vs a regression[perf]. @mzeitlin11 will this be obvious once i get some caffeine?

jbrockmendel · 2021-07-13T15:07:06Z

pandas/core/dtypes/cast.py

        # error: Item "ExtensionArray" of "Union[ExtensionArray, ndarray]" has no
        # attribute "reshape"
-        return result.reshape(arr.shape, order=order)  # type: ignore[union-attr]


IIRC the point of the flags stuff was to 1) avoid a copy in ravel and b) retain the same contiguity in the returned array. is the problem that these goals aren't being achieved, or something else?

(im pretty sure i used the same flags-based pattern in the datetimelike arrays, so whatever fix is used here might need to be applied there too)

For the behavior regression, what was happening was the following:

Data which is not contiguous gets passed in

Depending on structure of this data, arr.ravel("K") may choose to index with F ordering.

order = "F" if flags.f_contiguous else "C" will give C order since the input is not contiguous.

So the reshape will index with C ordering, changing the data ordering

so what if we change L1100 to something like order = "F" if flags.f_contiguous else "C" if flags.c_contiguous else None?

reshape doesn't have a concept of None -> it can default to C if no argument passed, otherwise one of C, F, A.

To be clear, the order argument for reshape doesn't have to do with contiguity - it has to do with the order in which elements are read from the input and written to the output (so specifying the order C does not mean that the underlying input must be C-contiguous, it means that the input will be read with C ordering)

Got it, thanks for walking me through this

mzeitlin11 · 2021-07-13T15:09:54Z

im a bit confused as to what parts of this are about a regression[bug] vs a regression[perf]. @mzeitlin11 will this be obvious once i get some caffeine?

Haha, I'll try to clarify :)

Bug: The ravel and reshape could happen with different orders, garbling the data. This can be fixed by specifying the same order in ravel as used for reshape (described in more detail below your other comment). However, this will not fix the perf regression
Perf: With behavior fix above, we can end up with F-contiguous data depending on what contiguity goes into the astype call. For the reduction regressions at the top of the issue, this F-contiguity is what caused the regression - the routines are just slower on F-contig data. Just leaving default order ensures C-contiguity, fixing the perf regression.

jreback · 2021-07-13T21:18:03Z

ok thanks for the explanation @mzeitlin11

jreback · 2021-07-13T21:18:33Z

@meeseeksdev backport 1.3.x

…e 2d data

lumberbot-app · 2021-07-13T21:18:40Z

Something went wrong ... Please have a look at my logs.

…e 2d data

…42526)

mzeitlin11 added 3 commits July 9, 2021 20:12

REGR: astype changing frame order

0c9d549

Add test

bb497c9

PERF/REGR: astype changing order of some 2d data

2b1c561

mzeitlin11 added Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance Reduction Operations sum, mean, min, max, etc. Regression Functionality that used to work in a prior pandas version labels Jul 10, 2021

mzeitlin11 added this to the 1.3.1 milestone Jul 10, 2021

Fix windows and 32 bit

2d0ef39

jbrockmendel reviewed Jul 10, 2021

View reviewed changes

Windows fix, test with stride also

cd13406

simonjayhawkins approved these changes Jul 10, 2021

View reviewed changes

mzeitlin11 and others added 4 commits July 10, 2021 09:40

Refactor test

996650e

Remove TODO and simplify test

79f2d50

Simplify expected

c6a533b

Merge branch 'master' into regr_astype

c262158

jbrockmendel reviewed Jul 13, 2021

View reviewed changes

jreback merged commit 6455912 into pandas-dev:master Jul 13, 2021

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Jul 13, 2021

Backport PR pandas-dev#42475: PERF/REGR: astype changing order of som…

d5e3876

…e 2d data

meeseeksmachine mentioned this pull request Jul 13, 2021

Backport PR #42475 on branch 1.3.x (PERF/REGR: astype changing order of some 2d data) #42523

Closed

mzeitlin11 deleted the regr_astype branch July 14, 2021 00:36

mzeitlin11 added a commit to mzeitlin11/pandas that referenced this pull request Jul 14, 2021

PERF/REGR: astype changing order of some 2d data (pandas-dev#42475)

eb5185b

mzeitlin11 added a commit to mzeitlin11/pandas that referenced this pull request Jul 14, 2021

Backport PR pandas-dev#42475: PERF/REGR: astype changing order of som…

77997ce

…e 2d data

simonjayhawkins pushed a commit that referenced this pull request Jul 14, 2021

Backport PR #42475: PERF/REGR: astype changing order of some 2d data (#…

ac5eb51

…42526)

feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021

PERF/REGR: astype changing order of some 2d data (pandas-dev#42475)

06c3665

	elif is_extension_array_dtype(dtype) and self.ndim > 1:
	# GH 18099/22869: columnwise conversion to extension dtype
	# GH 24704: use iloc to handle duplicate column names
	# TODO(EA2D): special case not needed with 2D EAs
	results = [
	self.iloc[:, i].astype(dtype, copy=copy)
	for i in range(len(self.columns))
	]

Uh oh!

PERF/REGR: astype changing order of some 2d data #42475

PERF/REGR: astype changing order of some 2d data #42475

Uh oh!

Conversation

mzeitlin11 commented Jul 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mzeitlin11 Jul 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simonjayhawkins left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Jul 13, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mzeitlin11 commented Jul 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Jul 13, 2021

Uh oh!

jreback commented Jul 13, 2021

Uh oh!

lumberbot-app bot commented Jul 13, 2021

Uh oh!

Uh oh!

mzeitlin11 commented Jul 10, 2021 •

edited

Loading

mzeitlin11 Jul 10, 2021 •

edited

Loading

mzeitlin11 commented Jul 13, 2021 •

edited

Loading