-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add lazy copy to astype #50802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add lazy copy to astype #50802
Changes from 3 commits
ed2e322
1967a2a
45ea989
a2e1e53
8f29fda
55fc3bc
a1847e8
75bdf1d
528b0fb
f29a0be
35b9fa4
664c31b
49a2029
e1dc971
894aaa8
273b5aa
31557be
c0f5003
031aaa5
174e2d3
876d293
ff37e58
5492c10
620181c
4f6f954
95a8296
7640c36
469383d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -271,6 +271,10 @@ def astype_is_view(dtype: DtypeObj, new_dtype: DtypeObj) -> bool: | |
if is_string_dtype(dtype) and is_string_dtype(new_dtype): | ||
return True | ||
|
||
elif is_object_dtype(dtype) and new_dtype.kind == "O": | ||
# When the underlying array has dtype object, we don't have to make a copy | ||
return True | ||
Comment on lines
+275
to
+277
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Our own extension types fall in this category? (eg casting object column of Period objects to period dtype, although this specific example is actually never a view) Rather for a follow-up, would it be worth to have some more custom logic here for our own EAs? Both IntervalDtype and PeriodDtype have kind of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep would prefer to do this as a follow up with specific tests |
||
|
||
elif is_string_dtype(dtype) or is_string_dtype(new_dtype): | ||
return False | ||
|
||
|
@@ -280,6 +284,8 @@ def astype_is_view(dtype: DtypeObj, new_dtype: DtypeObj) -> bool: | |
elif getattr(dtype, "numpy_dtype", dtype) == getattr( | ||
new_dtype, "numpy_dtype", new_dtype | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is for our masked arrays? If so, can you add some comments explaining that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pyarrow as well, both don't copy |
||
): | ||
# If underlying numpy dtype is the same, no copy is made, e.g. | ||
# int64 -> Int64 or int64[pyarrow] | ||
return True | ||
|
||
return False |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -421,3 +421,14 @@ def test_array_to_numpy_na(): | |
result = arr.to_numpy(na_value=True, dtype=bool) | ||
expected = np.array([True, True]) | ||
tm.assert_numpy_array_equal(result, expected) | ||
|
||
|
||
def test_array_copy_on_write(using_copy_on_write): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This file is about pd.array(..) I think, so I would move it either to the decimal tests (extension/decimal/test_decimal), or to the other copy_view astype tests and import the DecimalDtype there |
||
df = pd.DataFrame({"a": [decimal.Decimal(2), decimal.Decimal(3)]}, dtype="object") | ||
df2 = df.astype(DecimalDtype()) | ||
df.iloc[0, 0] = 0 | ||
if using_copy_on_write: | ||
expected = pd.DataFrame( | ||
{"a": [decimal.Decimal(2), decimal.Decimal(3)]}, dtype=DecimalDtype() | ||
) | ||
tm.assert_equal(df2.values, expected.values) |
Original file line number | Diff line number | Diff line change | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -545,6 +545,11 @@ def test_astype_single_dtype(using_copy_on_write): | |||||||||||||||
assert not np.shares_memory(get_array(df2, "c"), get_array(df, "c")) | ||||||||||||||||
tm.assert_frame_equal(df, df_orig) | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
We don't test this consistently for all methods here, but astype seems a sufficiently complicated case (not just based on a copy(deep=False) under the hood) that it's probably good to be complete. (same for the ones below) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep makes sense |
||||||||||||||||
|
||||||||||||||||
# mutating parent also doesn't update result | ||||||||||||||||
df2 = df.astype("float64") | ||||||||||||||||
df.iloc[0, 2] = 5.5 | ||||||||||||||||
tm.assert_frame_equal(df2, df_orig.astype("float64")) | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
@pytest.mark.parametrize("dtype", ["int64", "Int64"]) | ||||||||||||||||
@pytest.mark.parametrize("new_dtype", ["int64", "Int64", "int64[pyarrow]"]) | ||||||||||||||||
|
@@ -558,8 +563,6 @@ def test_astype_avoids_copy(using_copy_on_write, dtype, new_dtype): | |||||||||||||||
if using_copy_on_write: | ||||||||||||||||
assert np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||||||||||||||||
else: | ||||||||||||||||
if new_dtype == "int64[pyarrow]": | ||||||||||||||||
pytest.skip("Does not make a copy") | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||||||||||||||||
|
||||||||||||||||
# mutating df2 triggers a copy-on-write for that column/block | ||||||||||||||||
|
@@ -568,6 +571,11 @@ def test_astype_avoids_copy(using_copy_on_write, dtype, new_dtype): | |||||||||||||||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||||||||||||||||
tm.assert_frame_equal(df, df_orig) | ||||||||||||||||
|
||||||||||||||||
# mutating parent also doesn't update result | ||||||||||||||||
df2 = df.astype(new_dtype) | ||||||||||||||||
df.iloc[0, 0] = 100 | ||||||||||||||||
tm.assert_frame_equal(df2, df_orig.astype(new_dtype)) | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
@pytest.mark.parametrize("dtype", ["float64", "int32", "Int32", "int32[pyarrow]"]) | ||||||||||||||||
def test_astype_different_target_dtype(using_copy_on_write, dtype): | ||||||||||||||||
|
@@ -577,14 +585,16 @@ def test_astype_different_target_dtype(using_copy_on_write, dtype): | |||||||||||||||
df_orig = df.copy() | ||||||||||||||||
df2 = df.astype(dtype) | ||||||||||||||||
|
||||||||||||||||
if using_copy_on_write: | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||||||||||||||||
else: | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Maybe also explicitly check that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes this is a very good check, would have made sure that I did not miss the numpy dtype thing below |
||||||||||||||||
|
||||||||||||||||
df2.iloc[0, 0] = 5 | ||||||||||||||||
tm.assert_frame_equal(df, df_orig) | ||||||||||||||||
|
||||||||||||||||
# mutating parent also doesn't update result | ||||||||||||||||
df2 = df.astype(dtype) | ||||||||||||||||
df.iloc[0, 0] = 100 | ||||||||||||||||
tm.assert_frame_equal(df2, df_orig.astype(dtype)) | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
@pytest.mark.parametrize( | ||||||||||||||||
"dtype, new_dtype", [("object", "string"), ("string", "object")] | ||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think this is not always guaranteed to be a view? (but of course safer to return True than incorrectly assume it is always a copy)
is_string_dtype
also returns true for generic "object", and converting object to string is not necessarily no-copyThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using this branch, I see the following wrong behaviour in case you have mixed objects:
Because of not taking a copy, the
ensure_string_array
actually mutates the original values in place.For the case of "object -> some extension dtype" casting, we should probably always do
copy=True
, because I don't think we can rely on_from_sequence
to do the correct thing (although for this specific case I also don't think thatStringArray._from_sequence(arr, copy=False)
should mutatearr
in place. I would expect to only not make a copy if it's not needed, i.e. if it can just view the original data and no modification needs to be done)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #51073 to address the inplace modification.
You are correct, it is not guaranteed to be a no-copy op, but it could be and I couldn't figure out a more precise check. We can still optimise in a follow-up to get this stricter, for now I was aiming to get as many cases as possible without making it overly complex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment explaining the True and to note something like that?