Skip to content

PERF/REGR: astype changing order of some 2d data #42475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 13, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v1.3.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ Fixed regressions
- :class:`DataFrame` constructed with with an older version of pandas could not be unpickled (:issue:`42345`)
- Performance regression in constructing a :class:`DataFrame` from a dictionary of dictionaries (:issue:`42338`)
- Fixed regression in :meth:`DataFrame.agg` dropping values when the DataFrame had an Extension Array dtype, a duplicate index, and ``axis=1`` (:issue:`42380`)
- Fixed regression in :meth:`DataFrame.astype` changing the order of noncontiguous data (:issue:`42396`)
- Performance regression in :class:`DataFrame` in reduction operations requiring casting such as :meth:`DataFrame.mean` on integer data (:issue:`38592`)
- Performance regression in :meth:`DataFrame.to_dict` and :meth:`Series.to_dict` when ``orient`` argument one of "records", "dict", or "split" (:issue:`42352`)
- Fixed regression in indexing with a ``list`` subclass incorrectly raising ``TypeError`` (:issue:`42433`, :issue:42461`)
-
Expand Down
8 changes: 2 additions & 6 deletions pandas/core/dtypes/cast.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@
from typing import (
TYPE_CHECKING,
Any,
Literal,
Sized,
TypeVar,
cast,
Expand Down Expand Up @@ -1093,14 +1092,11 @@ def astype_nansafe(
The dtype was a datetime64/timedelta64 dtype, but it had no unit.
"""
if arr.ndim > 1:
# Make sure we are doing non-copy ravel and reshape.
flags = arr.flags
flat = arr.ravel("K")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wont this mean a perf hit in cases where arr is F-contiguous?

Copy link
Member Author

@mzeitlin11 mzeitlin11 Jul 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep (some discussion about that in the pr body). Would be a regression from master (but not from 1.2.x). The other fix for this regression is to just use arr.ravel(order) where order is the same as used for reshape.

That avoids the hit you're talking about, but doesn't fix the perf regression (although that perf regression is a bit suspicious, was trying to profile it, but on larger data sizes the regression doesn't show up anymore)

EDIT: Regression does show up even with larger data, the perf hit comes from the reduction step, eg bottleneck.reduce.nanmean on a call like .mean(axis=0) or 'reduce' of 'numpy.ufunc' with .sum(axis=0). Only difference I can see on data going in is F-contiguity on master vs C-contiguity on this pr (and 1.2.x)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does arr.ravel(order) fix this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah arr.ravel(order) fixes the regression in behavior, but not the performance regression (discussed a bit in the comments above). Seems like it comes down to computations like bottleneck.reduce.nanmean being slower on F-contiguity data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it comes down to computations like bottleneck.reduce.nanmean being slower on F-contiguity data.

yes for sure that would be true.

ok so the question remains, this goes back to 1.2.x perf? are the f-contig routines hit? when do we hit these

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep exactly -> the benchmarks shown at the top of the pr are back to 1.2.x perf. With the perf regression on master, for those benchmarks we hit reduction routines with f-contig data. With this pr and 1.2.x, we hit those routines with C-contig data.

flat = arr.ravel()
result = astype_nansafe(flat, dtype, copy=copy, skipna=skipna)
order: Literal["C", "F"] = "F" if flags.f_contiguous else "C"
# error: Item "ExtensionArray" of "Union[ExtensionArray, ndarray]" has no
# attribute "reshape"
return result.reshape(arr.shape, order=order) # type: ignore[union-attr]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the point of the flags stuff was to 1) avoid a copy in ravel and b) retain the same contiguity in the returned array. is the problem that these goals aren't being achieved, or something else?

(im pretty sure i used the same flags-based pattern in the datetimelike arrays, so whatever fix is used here might need to be applied there too)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the behavior regression, what was happening was the following:

  1. Data which is not contiguous gets passed in
  2. Depending on structure of this data, arr.ravel("K") may choose to index with F ordering.
  3. order = "F" if flags.f_contiguous else "C" will give C order since the input is not contiguous.
  4. So the reshape will index with C ordering, changing the data ordering

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what if we change L1100 to something like order = "F" if flags.f_contiguous else "C" if flags.c_contiguous else None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reshape doesn't have a concept of None -> it can default to C if no argument passed, otherwise one of C, F, A.

To be clear, the order argument for reshape doesn't have to do with contiguity - it has to do with the order in which elements are read from the input and written to the output (so specifying the order C does not mean that the underlying input must be C-contiguous, it means that the input will be read with C ordering)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for walking me through this

return result.reshape(arr.shape) # type: ignore[union-attr]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we revert all of #38507, do we remove the need for an ignore? i.e. is there another issue lurking here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting would remove the need for the ignore, in the previous logic the reshape was inside an isinstance(ndarray).

AFAICT there is no issue here though - the problematic case here would be if astype_nansafe is called on a 2D ndarray and we try to convert to an ExtensionDtype. But based on

pandas/pandas/core/generic.py

Lines 5731 to 5738 in 87d7855

elif is_extension_array_dtype(dtype) and self.ndim > 1:
# GH 18099/22869: columnwise conversion to extension dtype
# GH 24704: use iloc to handle duplicate column names
# TODO(EA2D): special case not needed with 2D EAs
results = [
self.iloc[:, i].astype(dtype, copy=copy)
for i in range(len(self.columns))
]
that should be impossible. So I think we could get rid of the ignore with an assert if desired

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. but let's leave that out of this PR which is being backported


# We get here with 0-dim from sparse
arr = np.atleast_1d(arr)
Expand Down
20 changes: 20 additions & 0 deletions pandas/tests/frame/methods/test_astype.py
Original file line number Diff line number Diff line change
Expand Up @@ -670,6 +670,26 @@ def test_astype_bytes(self):
result = DataFrame(["foo", "bar", "baz"]).astype(bytes)
assert result.dtypes[0] == np.dtype("S3")

@pytest.mark.parametrize(
"index_slice",
[
np.s_[:2, :2],
np.s_[:1, :2],
np.s_[:2, :1],
np.s_[::2, ::2],
np.s_[::1, ::2],
np.s_[::2, ::1],
],
)
def test_astype_noncontiguous(self, index_slice):
# GH#42396
data = np.arange(16).reshape(4, 4)
df = DataFrame(data)

result = df.iloc[index_slice].astype("int16")
expected = df.iloc[index_slice]
tm.assert_frame_equal(result, expected, check_dtype=False)


class TestAstypeCategorical:
def test_astype_from_categorical3(self):
Expand Down