-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add lazy copy to astype #50802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add lazy copy to astype #50802
Changes from 26 commits
ed2e322
1967a2a
45ea989
a2e1e53
8f29fda
55fc3bc
a1847e8
75bdf1d
528b0fb
f29a0be
35b9fa4
664c31b
49a2029
e1dc971
894aaa8
273b5aa
31557be
c0f5003
031aaa5
174e2d3
876d293
ff37e58
5492c10
620181c
4f6f954
95a8296
7640c36
469383d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,6 +26,7 @@ | |
is_dtype_equal, | ||
is_integer_dtype, | ||
is_object_dtype, | ||
is_string_dtype, | ||
is_timedelta64_dtype, | ||
pandas_dtype, | ||
) | ||
|
@@ -246,3 +247,55 @@ def astype_array_safe( | |
raise | ||
|
||
return new_values | ||
|
||
|
||
def astype_is_view(dtype: DtypeObj, new_dtype: DtypeObj) -> bool: | ||
"""Checks if astype avoided copying the data. | ||
|
||
Parameters | ||
---------- | ||
dtype : Original dtype | ||
new_dtype : target dtype | ||
|
||
Returns | ||
------- | ||
True if new data is a view or not guaranteed to be a copy, False otherwise | ||
""" | ||
if dtype == new_dtype: | ||
return True | ||
|
||
elif isinstance(dtype, np.dtype) and isinstance(new_dtype, np.dtype): | ||
# Only equal numpy dtypes avoid a copy | ||
return False | ||
|
||
elif is_string_dtype(dtype) and is_string_dtype(new_dtype): | ||
# Potentially! a view when converting from object to string | ||
return True | ||
|
||
elif is_object_dtype(dtype) and new_dtype.kind == "O": | ||
# When the underlying array has dtype object, we don't have to make a copy | ||
return True | ||
Comment on lines
+275
to
+277
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Our own extension types fall in this category? (eg casting object column of Period objects to period dtype, although this specific example is actually never a view) Rather for a follow-up, would it be worth to have some more custom logic here for our own EAs? Both IntervalDtype and PeriodDtype have kind of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep would prefer to do this as a follow up with specific tests |
||
|
||
elif dtype.kind in "mM" and new_dtype.kind in "mM": | ||
# Item "dtype[Any]" of "Union[dtype[Any], ExtensionDtype]" has no | ||
# attribute "unit" | ||
return dtype.unit == new_dtype.unit # type: ignore[union-attr] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This apparently doesn't work for the pyarrow backed one (something we should fix as well I suppose), see the test failures on CI There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yikes, thx. This needs the same logic as the nullables below |
||
|
||
numpy_dtype = getattr(dtype, "numpy_dtype", None) | ||
new_numpy_dtype = getattr(new_dtype, "numpy_dtype", None) | ||
|
||
if numpy_dtype is None and isinstance(dtype, np.dtype): | ||
numpy_dtype = dtype | ||
|
||
if new_numpy_dtype is None and isinstance(new_dtype, np.dtype): | ||
numpy_dtype = new_dtype | ||
|
||
if numpy_dtype is not None and new_numpy_dtype is not None: | ||
# if both have NumPy dtype or one of them is a numpy dtype | ||
# they are only a view when the numpy dtypes are equal, e.g. | ||
# int64 -> Int64 or int64[pyarrow] | ||
# int64 -> Int32 copies | ||
return numpy_dtype == new_numpy_dtype | ||
|
||
# Assume this is a view since we don't know for sure if a copy was made | ||
return True |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,176 @@ | ||
import numpy as np | ||
import pytest | ||
|
||
from pandas.compat import pa_version_under7p0 | ||
|
||
from pandas import ( | ||
DataFrame, | ||
Series, | ||
date_range, | ||
) | ||
import pandas._testing as tm | ||
from pandas.tests.copy_view.util import get_array | ||
|
||
|
||
def test_astype_single_dtype(using_copy_on_write): | ||
df = DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": 1.5}) | ||
df_orig = df.copy() | ||
df2 = df.astype("float64") | ||
|
||
if using_copy_on_write: | ||
assert np.shares_memory(get_array(df2, "c"), get_array(df, "c")) | ||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||
else: | ||
assert not np.shares_memory(get_array(df2, "c"), get_array(df, "c")) | ||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||
|
||
# mutating df2 triggers a copy-on-write for that column/block | ||
df2.iloc[0, 2] = 5.5 | ||
if using_copy_on_write: | ||
assert not np.shares_memory(get_array(df2, "c"), get_array(df, "c")) | ||
tm.assert_frame_equal(df, df_orig) | ||
|
||
# mutating parent also doesn't update result | ||
df2 = df.astype("float64") | ||
df.iloc[0, 2] = 5.5 | ||
tm.assert_frame_equal(df2, df_orig.astype("float64")) | ||
|
||
|
||
@pytest.mark.parametrize("dtype", ["int64", "Int64"]) | ||
@pytest.mark.parametrize("new_dtype", ["int64", "Int64", "int64[pyarrow]"]) | ||
def test_astype_avoids_copy(using_copy_on_write, dtype, new_dtype): | ||
if new_dtype == "int64[pyarrow]" and pa_version_under7p0: | ||
pytest.skip("pyarrow not installed") | ||
df = DataFrame({"a": [1, 2, 3]}, dtype=dtype) | ||
df_orig = df.copy() | ||
df2 = df.astype(new_dtype) | ||
|
||
if using_copy_on_write: | ||
assert np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||
else: | ||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||
|
||
# mutating df2 triggers a copy-on-write for that column/block | ||
df2.iloc[0, 0] = 10 | ||
if using_copy_on_write: | ||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||
tm.assert_frame_equal(df, df_orig) | ||
|
||
# mutating parent also doesn't update result | ||
df2 = df.astype(new_dtype) | ||
df.iloc[0, 0] = 100 | ||
tm.assert_frame_equal(df2, df_orig.astype(new_dtype)) | ||
|
||
|
||
@pytest.mark.parametrize("dtype", ["float64", "int32", "Int32", "int32[pyarrow]"]) | ||
def test_astype_different_target_dtype(using_copy_on_write, dtype): | ||
if dtype == "int32[pyarrow]" and pa_version_under7p0: | ||
pytest.skip("pyarrow not installed") | ||
df = DataFrame({"a": [1, 2, 3]}) | ||
df_orig = df.copy() | ||
df2 = df.astype(dtype) | ||
|
||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||
assert df2._mgr._has_no_reference(0) | ||
|
||
df2.iloc[0, 0] = 5 | ||
tm.assert_frame_equal(df, df_orig) | ||
|
||
# mutating parent also doesn't update result | ||
df2 = df.astype(dtype) | ||
df.iloc[0, 0] = 100 | ||
tm.assert_frame_equal(df2, df_orig.astype(dtype)) | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"dtype, new_dtype", [("object", "string"), ("string", "object")] | ||
) | ||
def test_astype_string_and_object(using_copy_on_write, dtype, new_dtype): | ||
df = DataFrame({"a": ["a", "b", "c"]}, dtype=dtype) | ||
df_orig = df.copy() | ||
df2 = df.astype(new_dtype) | ||
|
||
if using_copy_on_write: | ||
assert np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||
else: | ||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||
|
||
df2.iloc[0, 0] = "x" | ||
tm.assert_frame_equal(df, df_orig) | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"dtype, new_dtype", [("object", "string"), ("string", "object")] | ||
) | ||
def test_astype_string_and_object_update_original( | ||
using_copy_on_write, dtype, new_dtype | ||
): | ||
df = DataFrame({"a": ["a", "b", "c"]}, dtype=dtype) | ||
df2 = df.astype(new_dtype) | ||
df_orig = df2.copy() | ||
|
||
if using_copy_on_write: | ||
assert np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||
else: | ||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||
|
||
df.iloc[0, 0] = "x" | ||
tm.assert_frame_equal(df2, df_orig) | ||
|
||
|
||
def test_astype_dict_dtypes(using_copy_on_write): | ||
df = DataFrame( | ||
{"a": [1, 2, 3], "b": [4, 5, 6], "c": Series([1.5, 1.5, 1.5], dtype="float64")} | ||
) | ||
df_orig = df.copy() | ||
df2 = df.astype({"a": "float64", "c": "float64"}) | ||
|
||
if using_copy_on_write: | ||
assert np.shares_memory(get_array(df2, "c"), get_array(df, "c")) | ||
assert np.shares_memory(get_array(df2, "b"), get_array(df, "b")) | ||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||
else: | ||
assert not np.shares_memory(get_array(df2, "c"), get_array(df, "c")) | ||
assert not np.shares_memory(get_array(df2, "b"), get_array(df, "b")) | ||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||
|
||
# mutating df2 triggers a copy-on-write for that column/block | ||
df2.iloc[0, 2] = 5.5 | ||
if using_copy_on_write: | ||
assert not np.shares_memory(get_array(df2, "c"), get_array(df, "c")) | ||
|
||
df2.iloc[0, 1] = 10 | ||
if using_copy_on_write: | ||
assert not np.shares_memory(get_array(df2, "b"), get_array(df, "b")) | ||
tm.assert_frame_equal(df, df_orig) | ||
|
||
|
||
def test_astype_different_datetime_resos(using_copy_on_write): | ||
df = DataFrame({"a": date_range("2019-12-31", periods=2, freq="D")}) | ||
result = df.astype("datetime64[ms]") | ||
|
||
assert not np.shares_memory(get_array(df, "a"), get_array(result, "a")) | ||
if using_copy_on_write: | ||
assert result._mgr._has_no_reference(0) | ||
|
||
|
||
def test_astype_different_timezones(using_copy_on_write): | ||
df = DataFrame( | ||
{"a": date_range("2019-12-31", periods=5, freq="D", tz="US/Pacific")} | ||
) | ||
result = df.astype("datetime64[ns, Europe/Berlin]") | ||
if using_copy_on_write: | ||
assert not result._mgr._has_no_reference(0) | ||
assert np.shares_memory(get_array(df, "a").asi8, get_array(result, "a").asi8) | ||
|
||
|
||
def test_astype_different_timezones_different_reso(using_copy_on_write): | ||
df = DataFrame( | ||
{"a": date_range("2019-12-31", periods=5, freq="D", tz="US/Pacific")} | ||
) | ||
result = df.astype("datetime64[ms, Europe/Berlin]") | ||
if using_copy_on_write: | ||
assert result._mgr._has_no_reference(0) | ||
assert not np.shares_memory( | ||
get_array(df, "a").asi8, get_array(result, "a").asi8 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think this is not always guaranteed to be a view? (but of course safer to return True than incorrectly assume it is always a copy)
is_string_dtype
also returns true for generic "object", and converting object to string is not necessarily no-copyThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using this branch, I see the following wrong behaviour in case you have mixed objects:
Because of not taking a copy, the
ensure_string_array
actually mutates the original values in place.For the case of "object -> some extension dtype" casting, we should probably always do
copy=True
, because I don't think we can rely on_from_sequence
to do the correct thing (although for this specific case I also don't think thatStringArray._from_sequence(arr, copy=False)
should mutatearr
in place. I would expect to only not make a copy if it's not needed, i.e. if it can just view the original data and no modification needs to be done)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #51073 to address the inplace modification.
You are correct, it is not guaranteed to be a no-copy op, but it could be and I couldn't figure out a more precise check. We can still optimise in a follow-up to get this stricter, for now I was aiming to get as many cases as possible without making it overly complex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment explaining the True and to note something like that?