Skip to content

ENH: Add lazy copy to astype #50802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Feb 9, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,7 @@ Copy-on-Write improvements
- :meth:`DataFrame.truncate`
- :meth:`DataFrame.tz_convert` / :meth:`Series.tz_localize`
- :meth:`DataFrame.infer_objects` / :meth:`Series.infer_objects`
- :meth:`DataFrame.astype` / :meth:`Series.astype`
- :func:`concat`

These methods return views when Copy-on-Write is enabled, which provides a significant
Expand Down
53 changes: 53 additions & 0 deletions pandas/core/dtypes/astype.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
is_dtype_equal,
is_integer_dtype,
is_object_dtype,
is_string_dtype,
is_timedelta64_dtype,
pandas_dtype,
)
Expand Down Expand Up @@ -246,3 +247,55 @@ def astype_array_safe(
raise

return new_values


def astype_is_view(dtype: DtypeObj, new_dtype: DtypeObj) -> bool:
"""Checks if astype avoided copying the data.

Parameters
----------
dtype : Original dtype
new_dtype : target dtype

Returns
-------
True if new data is a view or not guaranteed to be a copy, False otherwise
"""
if dtype == new_dtype:
return True

elif isinstance(dtype, np.dtype) and isinstance(new_dtype, np.dtype):
# Only equal numpy dtypes avoid a copy
return False

elif is_string_dtype(dtype) and is_string_dtype(new_dtype):
# Potentially! a view when converting from object to string
return True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think this is not always guaranteed to be a view? (but of course safer to return True than incorrectly assume it is always a copy)

is_string_dtype also returns true for generic "object", and converting object to string is not necessarily no-copy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this branch, I see the following wrong behaviour in case you have mixed objects:

In [1]: s = pd.Series(['a', 'b', 1])

In [2]: s2 = s.astype("string")

In [3]: s.values
Out[3]: array(['a', 'b', 1], dtype=object)

In [4]: pd.options.mode.copy_on_write = True

In [5]: s = pd.Series(['a', 'b', 1])

In [6]: s2 = s.astype("string")

In [7]: s.values
Out[7]: array(['a', 'b', '1'], dtype=object)

Because of not taking a copy, the ensure_string_array actually mutates the original values in place.

For the case of "object -> some extension dtype" casting, we should probably always do copy=True, because I don't think we can rely on _from_sequence to do the correct thing (although for this specific case I also don't think that StringArray._from_sequence(arr, copy=False) should mutate arr in place. I would expect to only not make a copy if it's not needed, i.e. if it can just view the original data and no modification needs to be done)

Copy link
Member Author

@phofl phofl Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #51073 to address the inplace modification.

You are correct, it is not guaranteed to be a no-copy op, but it could be and I couldn't figure out a more precise check. We can still optimise in a follow-up to get this stricter, for now I was aiming to get as many cases as possible without making it overly complex

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still optimise in a follow-up to get this stricter, for now I was aiming to get as many cases as possible without making it overly complex

Can you add a comment explaining the True and to note something like that?


elif is_object_dtype(dtype) and new_dtype.kind == "O":
# When the underlying array has dtype object, we don't have to make a copy
return True
Comment on lines +275 to +277
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our own extension types fall in this category? (eg casting object column of Period objects to period dtype, although this specific example is actually never a view)

Rather for a follow-up, would it be worth to have some more custom logic here for our own EAs? Both IntervalDtype and PeriodDtype have kind of O, but I think neither of them can ever cast from object dtype without making a copy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep would prefer to do this as a follow up with specific tests


elif dtype.kind in "mM" and new_dtype.kind in "mM":
dtype = getattr(dtype, "numpy_dtype", dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phofl i finally got around to seeing this. this numpy_dtype getattr check looks like it is for arrow/masked arrays. can we avoid this pattern? it makes it difficult to tell where we are special-casing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you prefer special casing for isinstance(dtype, ExtensionDtype) explicitly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id prefer for these to be as explicit as possible. i suspect this is intended for something more specific than just ExtensionDtype

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, basically all extension dtypes that have a numpy dtype

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? the logic here looks like it could be broken pretty easily by a 3rd-party EA that happened to have a dtype.numpy_dtpye

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They could yes, but right now this is implemented defensively to track more references than necessary if we are not able to determine with certainty that a copy happened.

new_dtype = getattr(new_dtype, "numpy_dtype", new_dtype)
return getattr(dtype, "unit", None) == getattr(new_dtype, "unit", None)

numpy_dtype = getattr(dtype, "numpy_dtype", None)
new_numpy_dtype = getattr(new_dtype, "numpy_dtype", None)

if numpy_dtype is None and isinstance(dtype, np.dtype):
numpy_dtype = dtype

if new_numpy_dtype is None and isinstance(new_dtype, np.dtype):
numpy_dtype = new_dtype

if numpy_dtype is not None and new_numpy_dtype is not None:
# if both have NumPy dtype or one of them is a numpy dtype
# they are only a view when the numpy dtypes are equal, e.g.
# int64 -> Int64 or int64[pyarrow]
# int64 -> Int32 copies
return numpy_dtype == new_numpy_dtype

# Assume this is a view since we don't know for sure if a copy was made
return True
6 changes: 3 additions & 3 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -6120,7 +6120,7 @@ def dtypes(self):
return self._constructor_sliced(data, index=self._info_axis, dtype=np.object_)

def astype(
self: NDFrameT, dtype, copy: bool_t = True, errors: IgnoreRaise = "raise"
self: NDFrameT, dtype, copy: bool_t | None = None, errors: IgnoreRaise = "raise"
) -> NDFrameT:
"""
Cast a pandas object to a specified dtype ``dtype``.
Expand Down Expand Up @@ -6257,7 +6257,7 @@ def astype(
for i, (col_name, col) in enumerate(self.items()):
cdt = dtype_ser.iat[i]
if isna(cdt):
res_col = col.copy() if copy else col
res_col = col.copy(deep=copy)
else:
try:
res_col = col.astype(dtype=cdt, copy=copy, errors=errors)
Expand All @@ -6284,7 +6284,7 @@ def astype(

# GH 33113: handle empty frame or series
if not results:
return self.copy()
return self.copy(deep=None)

# GH 19920: retain column metadata after concat
result = concat(results, axis=1, copy=False)
Expand Down
5 changes: 4 additions & 1 deletion pandas/core/internals/array_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -366,7 +366,10 @@ def fillna(self: T, value, limit, inplace: bool, downcast) -> T:
"fillna", value=value, limit=limit, inplace=inplace, downcast=downcast
)

def astype(self: T, dtype, copy: bool = False, errors: str = "raise") -> T:
def astype(self: T, dtype, copy: bool | None = False, errors: str = "raise") -> T:
if copy is None:
copy = True

return self.apply(astype_array_safe, dtype=dtype, copy=copy, errors=errors)

def convert(self: T, copy: bool | None) -> T:
Expand Down
20 changes: 17 additions & 3 deletions pandas/core/internals/blocks.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,10 @@
from pandas.util._decorators import cache_readonly
from pandas.util._validators import validate_bool_kwarg

from pandas.core.dtypes.astype import astype_array_safe
from pandas.core.dtypes.astype import (
astype_array_safe,
astype_is_view,
)
from pandas.core.dtypes.cast import (
LossySetitemError,
can_hold_element,
Expand Down Expand Up @@ -470,7 +473,11 @@ def dtype(self) -> DtypeObj:

@final
def astype(
self, dtype: DtypeObj, copy: bool = False, errors: IgnoreRaise = "raise"
self,
dtype: DtypeObj,
copy: bool = False,
errors: IgnoreRaise = "raise",
using_cow: bool = False,
) -> Block:
"""
Coerce to the new dtype.
Expand All @@ -483,6 +490,8 @@ def astype(
errors : str, {'raise', 'ignore'}, default 'raise'
- ``raise`` : allow exceptions to be raised
- ``ignore`` : suppress exceptions. On error return original object
using_cow: bool, default False
Signaling if copy on write copy logic is used.

Returns
-------
Expand All @@ -493,7 +502,12 @@ def astype(
new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)

new_values = maybe_coerce_values(new_values)
newb = self.make_block(new_values)

refs = None
if using_cow and astype_is_view(values.dtype, new_values.dtype):
refs = self.refs

newb = self.make_block(new_values, refs=refs)
if newb.shape != self.shape:
raise TypeError(
f"cannot set astype for copy = [{copy}] for dtype "
Expand Down
16 changes: 14 additions & 2 deletions pandas/core/internals/managers.py
Original file line number Diff line number Diff line change
Expand Up @@ -421,8 +421,20 @@ def fillna(self: T, value, limit, inplace: bool, downcast) -> T:
"fillna", value=value, limit=limit, inplace=inplace, downcast=downcast
)

def astype(self: T, dtype, copy: bool = False, errors: str = "raise") -> T:
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
def astype(self: T, dtype, copy: bool | None = False, errors: str = "raise") -> T:
if copy is None:
if using_copy_on_write():
copy = False
else:
copy = True

return self.apply(
"astype",
dtype=dtype,
copy=copy,
errors=errors,
using_cow=using_copy_on_write(),
)

def convert(self: T, copy: bool | None) -> T:
if copy is None:
Expand Down
195 changes: 195 additions & 0 deletions pandas/tests/copy_view/test_astype.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
import numpy as np
import pytest

from pandas.compat import pa_version_under7p0

from pandas import (
DataFrame,
Series,
Timestamp,
date_range,
)
import pandas._testing as tm
from pandas.tests.copy_view.util import get_array


def test_astype_single_dtype(using_copy_on_write):
df = DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": 1.5})
df_orig = df.copy()
df2 = df.astype("float64")

if using_copy_on_write:
assert np.shares_memory(get_array(df2, "c"), get_array(df, "c"))
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
else:
assert not np.shares_memory(get_array(df2, "c"), get_array(df, "c"))
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))

# mutating df2 triggers a copy-on-write for that column/block
df2.iloc[0, 2] = 5.5
if using_copy_on_write:
assert not np.shares_memory(get_array(df2, "c"), get_array(df, "c"))
tm.assert_frame_equal(df, df_orig)

# mutating parent also doesn't update result
df2 = df.astype("float64")
df.iloc[0, 2] = 5.5
tm.assert_frame_equal(df2, df_orig.astype("float64"))


@pytest.mark.parametrize("dtype", ["int64", "Int64"])
@pytest.mark.parametrize("new_dtype", ["int64", "Int64", "int64[pyarrow]"])
def test_astype_avoids_copy(using_copy_on_write, dtype, new_dtype):
if new_dtype == "int64[pyarrow]" and pa_version_under7p0:
pytest.skip("pyarrow not installed")
df = DataFrame({"a": [1, 2, 3]}, dtype=dtype)
df_orig = df.copy()
df2 = df.astype(new_dtype)

if using_copy_on_write:
assert np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
else:
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))

# mutating df2 triggers a copy-on-write for that column/block
df2.iloc[0, 0] = 10
if using_copy_on_write:
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
tm.assert_frame_equal(df, df_orig)

# mutating parent also doesn't update result
df2 = df.astype(new_dtype)
df.iloc[0, 0] = 100
tm.assert_frame_equal(df2, df_orig.astype(new_dtype))


@pytest.mark.parametrize("dtype", ["float64", "int32", "Int32", "int32[pyarrow]"])
def test_astype_different_target_dtype(using_copy_on_write, dtype):
if dtype == "int32[pyarrow]" and pa_version_under7p0:
pytest.skip("pyarrow not installed")
df = DataFrame({"a": [1, 2, 3]})
df_orig = df.copy()
df2 = df.astype(dtype)

assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
if using_copy_on_write:
assert df2._mgr._has_no_reference(0)

df2.iloc[0, 0] = 5
tm.assert_frame_equal(df, df_orig)

# mutating parent also doesn't update result
df2 = df.astype(dtype)
df.iloc[0, 0] = 100
tm.assert_frame_equal(df2, df_orig.astype(dtype))


@pytest.mark.parametrize(
"dtype, new_dtype", [("object", "string"), ("string", "object")]
)
def test_astype_string_and_object(using_copy_on_write, dtype, new_dtype):
df = DataFrame({"a": ["a", "b", "c"]}, dtype=dtype)
df_orig = df.copy()
df2 = df.astype(new_dtype)

if using_copy_on_write:
assert np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
else:
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))

df2.iloc[0, 0] = "x"
tm.assert_frame_equal(df, df_orig)


@pytest.mark.parametrize(
"dtype, new_dtype", [("object", "string"), ("string", "object")]
)
def test_astype_string_and_object_update_original(
using_copy_on_write, dtype, new_dtype
):
df = DataFrame({"a": ["a", "b", "c"]}, dtype=dtype)
df2 = df.astype(new_dtype)
df_orig = df2.copy()

if using_copy_on_write:
assert np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
else:
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))

df.iloc[0, 0] = "x"
tm.assert_frame_equal(df2, df_orig)


def test_astype_dict_dtypes(using_copy_on_write):
df = DataFrame(
{"a": [1, 2, 3], "b": [4, 5, 6], "c": Series([1.5, 1.5, 1.5], dtype="float64")}
)
df_orig = df.copy()
df2 = df.astype({"a": "float64", "c": "float64"})

if using_copy_on_write:
assert np.shares_memory(get_array(df2, "c"), get_array(df, "c"))
assert np.shares_memory(get_array(df2, "b"), get_array(df, "b"))
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
else:
assert not np.shares_memory(get_array(df2, "c"), get_array(df, "c"))
assert not np.shares_memory(get_array(df2, "b"), get_array(df, "b"))
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))

# mutating df2 triggers a copy-on-write for that column/block
df2.iloc[0, 2] = 5.5
if using_copy_on_write:
assert not np.shares_memory(get_array(df2, "c"), get_array(df, "c"))

df2.iloc[0, 1] = 10
if using_copy_on_write:
assert not np.shares_memory(get_array(df2, "b"), get_array(df, "b"))
tm.assert_frame_equal(df, df_orig)


def test_astype_different_datetime_resos(using_copy_on_write):
df = DataFrame({"a": date_range("2019-12-31", periods=2, freq="D")})
result = df.astype("datetime64[ms]")

assert not np.shares_memory(get_array(df, "a"), get_array(result, "a"))
if using_copy_on_write:
assert result._mgr._has_no_reference(0)


def test_astype_different_timezones(using_copy_on_write):
df = DataFrame(
{"a": date_range("2019-12-31", periods=5, freq="D", tz="US/Pacific")}
)
result = df.astype("datetime64[ns, Europe/Berlin]")
if using_copy_on_write:
assert not result._mgr._has_no_reference(0)
assert np.shares_memory(get_array(df, "a").asi8, get_array(result, "a").asi8)


def test_astype_different_timezones_different_reso(using_copy_on_write):
df = DataFrame(
{"a": date_range("2019-12-31", periods=5, freq="D", tz="US/Pacific")}
)
result = df.astype("datetime64[ms, Europe/Berlin]")
if using_copy_on_write:
assert result._mgr._has_no_reference(0)
assert not np.shares_memory(
get_array(df, "a").asi8, get_array(result, "a").asi8
)


@pytest.mark.skipif(pa_version_under7p0, reason="pyarrow not installed")
def test_astype_arrow_timestamp(using_copy_on_write):
df = DataFrame(
{
"a": [
Timestamp("2020-01-01 01:01:01.000001"),
Timestamp("2020-01-01 01:01:01.000001"),
]
},
dtype="M8[ns]",
)
result = df.astype("timestamp[ns][pyarrow]")
if using_copy_on_write:
assert not result._mgr._has_no_reference(0)
assert np.shares_memory(get_array(df, "a").asi8, get_array(result, "a")._data)
Loading