Skip to content

ENH: Add lazy copy to astype #50802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Feb 9, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions pandas/core/dtypes/astype.py
Original file line number Diff line number Diff line change
Expand Up @@ -271,6 +271,10 @@ def astype_is_view(dtype: DtypeObj, new_dtype: DtypeObj) -> bool:
if is_string_dtype(dtype) and is_string_dtype(new_dtype):
return True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think this is not always guaranteed to be a view? (but of course safer to return True than incorrectly assume it is always a copy)

is_string_dtype also returns true for generic "object", and converting object to string is not necessarily no-copy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this branch, I see the following wrong behaviour in case you have mixed objects:

In [1]: s = pd.Series(['a', 'b', 1])

In [2]: s2 = s.astype("string")

In [3]: s.values
Out[3]: array(['a', 'b', 1], dtype=object)

In [4]: pd.options.mode.copy_on_write = True

In [5]: s = pd.Series(['a', 'b', 1])

In [6]: s2 = s.astype("string")

In [7]: s.values
Out[7]: array(['a', 'b', '1'], dtype=object)

Because of not taking a copy, the ensure_string_array actually mutates the original values in place.

For the case of "object -> some extension dtype" casting, we should probably always do copy=True, because I don't think we can rely on _from_sequence to do the correct thing (although for this specific case I also don't think that StringArray._from_sequence(arr, copy=False) should mutate arr in place. I would expect to only not make a copy if it's not needed, i.e. if it can just view the original data and no modification needs to be done)

Copy link
Member Author

@phofl phofl Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #51073 to address the inplace modification.

You are correct, it is not guaranteed to be a no-copy op, but it could be and I couldn't figure out a more precise check. We can still optimise in a follow-up to get this stricter, for now I was aiming to get as many cases as possible without making it overly complex

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still optimise in a follow-up to get this stricter, for now I was aiming to get as many cases as possible without making it overly complex

Can you add a comment explaining the True and to note something like that?


elif is_object_dtype(dtype) and new_dtype.kind == "O":
# When the underlying array has dtype object, we don't have to make a copy
return True
Comment on lines +275 to +277
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our own extension types fall in this category? (eg casting object column of Period objects to period dtype, although this specific example is actually never a view)

Rather for a follow-up, would it be worth to have some more custom logic here for our own EAs? Both IntervalDtype and PeriodDtype have kind of O, but I think neither of them can ever cast from object dtype without making a copy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep would prefer to do this as a follow up with specific tests


elif is_string_dtype(dtype) or is_string_dtype(new_dtype):
return False

Expand All @@ -280,6 +284,8 @@ def astype_is_view(dtype: DtypeObj, new_dtype: DtypeObj) -> bool:
elif getattr(dtype, "numpy_dtype", dtype) == getattr(
new_dtype, "numpy_dtype", new_dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for our masked arrays? If so, can you add some comments explaining that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyarrow as well, both don't copy

):
# If underlying numpy dtype is the same, no copy is made, e.g.
# int64 -> Int64 or int64[pyarrow]
return True

return False
6 changes: 1 addition & 5 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -6302,11 +6302,7 @@ def astype(
for i, (col_name, col) in enumerate(self.items()):
cdt = dtype_ser.iat[i]
if isna(cdt):
if using_copy_on_write():
# Make a shallow copy even if copy=False for CoW
res_col = col.copy(deep=copy)
else:
res_col = col if copy is False else col.copy()
res_col = col.copy(deep=copy)
else:
try:
res_col = col.astype(dtype=cdt, copy=copy, errors=errors)
Expand Down
11 changes: 11 additions & 0 deletions pandas/tests/arrays/test_array.py
Original file line number Diff line number Diff line change
Expand Up @@ -421,3 +421,14 @@ def test_array_to_numpy_na():
result = arr.to_numpy(na_value=True, dtype=bool)
expected = np.array([True, True])
tm.assert_numpy_array_equal(result, expected)


def test_array_copy_on_write(using_copy_on_write):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is about pd.array(..) I think, so I would move it either to the decimal tests (extension/decimal/test_decimal), or to the other copy_view astype tests and import the DecimalDtype there

df = pd.DataFrame({"a": [decimal.Decimal(2), decimal.Decimal(3)]}, dtype="object")
df2 = df.astype(DecimalDtype())
df.iloc[0, 0] = 0
if using_copy_on_write:
expected = pd.DataFrame(
{"a": [decimal.Decimal(2), decimal.Decimal(3)]}, dtype=DecimalDtype()
)
tm.assert_equal(df2.values, expected.values)
8 changes: 5 additions & 3 deletions pandas/tests/copy_view/test_constructors.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,20 @@
import numpy as np
import pytest

from pandas import Series

# -----------------------------------------------------------------------------
# Copy/view behaviour for Series / DataFrame constructors


def test_series_from_series(using_copy_on_write):
@pytest.mark.parametrize("dtype", [None, "int64"])
def test_series_from_series(dtype, using_copy_on_write):
# Case: constructing a Series from another Series object follows CoW rules:
# a new object is returned and thus mutations are not propagated
ser = Series([1, 2, 3], name="name")

# default is copy=False -> new Series is a shallow copy / view of original
result = Series(ser)
result = Series(ser, dtype=dtype)

# the shallow copy still shares memory
assert np.shares_memory(ser.values, result.values)
Expand All @@ -34,7 +36,7 @@ def test_series_from_series(using_copy_on_write):
assert np.shares_memory(ser.values, result.values)

# the same when modifying the parent
result = Series(ser)
result = Series(ser, dtype=dtype)

if using_copy_on_write:
# mutating original doesn't mutate new series
Expand Down
22 changes: 16 additions & 6 deletions pandas/tests/copy_view/test_methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -545,6 +545,11 @@ def test_astype_single_dtype(using_copy_on_write):
assert not np.shares_memory(get_array(df2, "c"), get_array(df, "c"))
tm.assert_frame_equal(df, df_orig)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tm.assert_frame_equal(df, df_orig)
tm.assert_frame_equal(df, df_orig)
# mutating parent also doesn't update result
df2 = df.astype("float64")
df.iloc[0, 2] = 5.5
tm.assert_frame_equal(df2, df_orig.astype("float64")

We don't test this consistently for all methods here, but astype seems a sufficiently complicated case (not just based on a copy(deep=False) under the hood) that it's probably good to be complete.

(same for the ones below)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep makes sense


# mutating parent also doesn't update result
df2 = df.astype("float64")
df.iloc[0, 2] = 5.5
tm.assert_frame_equal(df2, df_orig.astype("float64"))


@pytest.mark.parametrize("dtype", ["int64", "Int64"])
@pytest.mark.parametrize("new_dtype", ["int64", "Int64", "int64[pyarrow]"])
Expand All @@ -558,8 +563,6 @@ def test_astype_avoids_copy(using_copy_on_write, dtype, new_dtype):
if using_copy_on_write:
assert np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
else:
if new_dtype == "int64[pyarrow]":
pytest.skip("Does not make a copy")
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))

# mutating df2 triggers a copy-on-write for that column/block
Expand All @@ -568,6 +571,11 @@ def test_astype_avoids_copy(using_copy_on_write, dtype, new_dtype):
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
tm.assert_frame_equal(df, df_orig)

# mutating parent also doesn't update result
df2 = df.astype(new_dtype)
df.iloc[0, 0] = 100
tm.assert_frame_equal(df2, df_orig.astype(new_dtype))


@pytest.mark.parametrize("dtype", ["float64", "int32", "Int32", "int32[pyarrow]"])
def test_astype_different_target_dtype(using_copy_on_write, dtype):
Expand All @@ -577,14 +585,16 @@ def test_astype_different_target_dtype(using_copy_on_write, dtype):
df_orig = df.copy()
df2 = df.astype(dtype)

if using_copy_on_write:
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
else:
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
assert df2._mgr._has_no_reference(0)

Maybe also explicitly check that df2._mgr._has_no_reference(0)? Because the shares_memory and iloc setitem test doesn't ensure we didn't incorrectly trigger CoW unnecessarily

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is a very good check, would have made sure that I did not miss the numpy dtype thing below


df2.iloc[0, 0] = 5
tm.assert_frame_equal(df, df_orig)

# mutating parent also doesn't update result
df2 = df.astype(dtype)
df.iloc[0, 0] = 100
tm.assert_frame_equal(df2, df_orig.astype(dtype))


@pytest.mark.parametrize(
"dtype, new_dtype", [("object", "string"), ("string", "object")]
Expand Down