-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add lazy copy to astype #50802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add lazy copy to astype #50802
Changes from 2 commits
ed2e322
1967a2a
45ea989
a2e1e53
8f29fda
55fc3bc
a1847e8
75bdf1d
528b0fb
f29a0be
35b9fa4
664c31b
49a2029
e1dc971
894aaa8
273b5aa
31557be
c0f5003
031aaa5
174e2d3
876d293
ff37e58
5492c10
620181c
4f6f954
95a8296
7640c36
469383d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,9 +11,12 @@ | |
cast, | ||
final, | ||
) | ||
import weakref | ||
|
||
import numpy as np | ||
|
||
from pandas._config import using_copy_on_write | ||
|
||
from pandas._libs import ( | ||
Timestamp, | ||
internals as libinternals, | ||
|
@@ -152,6 +155,7 @@ class Block(PandasObject): | |
is_extension = False | ||
_can_consolidate = True | ||
_validate_ndim = True | ||
_ref = None | ||
|
||
@final | ||
@cache_readonly | ||
|
@@ -496,6 +500,10 @@ def astype( | |
f"({self.dtype.name} [{self.shape}]) to different shape " | ||
f"({newb.dtype.name} [{newb.shape}])" | ||
) | ||
if using_copy_on_write(): | ||
if not copy: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see 2 options here: Creating a function that checks if original and target dtype can be cast to each other without making a copy or figuring out a way to check for shared memory. Thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Checking numpy's astype, I think it always returns a copy except if the dtype is exactly the same (eg casting int64 to datetime64[ns], would could be a view, is still a copy even with So if we don't have special fast paths for such case ourselves, it might be possible to rely on just checking if dtypes are equal (but we would have to check our extension types is to check that) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Casting from int64 to Int64 is a view unfortunately, so this has to be more intelligent |
||
# This tracks more references than necessary. | ||
newb._ref = weakref.ref(self) | ||
return newb | ||
|
||
@final | ||
|
Original file line number | Diff line number | Diff line change | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -417,6 +417,52 @@ def test_to_frame(using_copy_on_write): | |||||||||||||||
tm.assert_frame_equal(df, expected) | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
def test_astype_single_dtype(using_copy_on_write): | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shall we move the astype tests to a dedicated file instead of in the middle of the other methods? My hunch is that we might need to add some more astype tests (if we specialize more for our own dtypes), and test_methods.py is already getting long There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah sounds good |
||||||||||||||||
df = DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": 1.5}) | ||||||||||||||||
df_orig = df.copy() | ||||||||||||||||
df2 = df.astype("float64") | ||||||||||||||||
|
||||||||||||||||
if using_copy_on_write: | ||||||||||||||||
assert np.shares_memory(get_array(df2, "c"), get_array(df, "c")) | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||||||||||||||||
else: | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "c"), get_array(df, "c")) | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||||||||||||||||
|
||||||||||||||||
# mutating df2 triggers a copy-on-write for that column/block | ||||||||||||||||
df2.iloc[0, 2] = 5.5 | ||||||||||||||||
if using_copy_on_write: | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "c"), get_array(df, "c")) | ||||||||||||||||
tm.assert_frame_equal(df, df_orig) | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
We don't test this consistently for all methods here, but astype seems a sufficiently complicated case (not just based on a copy(deep=False) under the hood) that it's probably good to be complete. (same for the ones below) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep makes sense |
||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
def test_astype_dict_dtypes(using_copy_on_write): | ||||||||||||||||
df = DataFrame( | ||||||||||||||||
{"a": [1, 2, 3], "b": [4, 5, 6], "c": Series([1.5, 1.5, 1.5], dtype="float64")} | ||||||||||||||||
) | ||||||||||||||||
df_orig = df.copy() | ||||||||||||||||
df2 = df.astype({"a": "float64", "c": "float64"}) | ||||||||||||||||
|
||||||||||||||||
if using_copy_on_write: | ||||||||||||||||
assert np.shares_memory(get_array(df2, "c"), get_array(df, "c")) | ||||||||||||||||
assert np.shares_memory(get_array(df2, "b"), get_array(df, "b")) | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||||||||||||||||
else: | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "c"), get_array(df, "c")) | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "b"), get_array(df, "b")) | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a")) | ||||||||||||||||
|
||||||||||||||||
# mutating df2 triggers a copy-on-write for that column/block | ||||||||||||||||
df2.iloc[0, 2] = 5.5 | ||||||||||||||||
if using_copy_on_write: | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "c"), get_array(df, "c")) | ||||||||||||||||
|
||||||||||||||||
df2.iloc[0, 1] = 10 | ||||||||||||||||
if using_copy_on_write: | ||||||||||||||||
assert not np.shares_memory(get_array(df2, "b"), get_array(df, "b")) | ||||||||||||||||
tm.assert_frame_equal(df, df_orig) | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
@pytest.mark.parametrize("ax", ["index", "columns"]) | ||||||||||||||||
def test_swapaxes_noop(using_copy_on_write, ax): | ||||||||||||||||
df = DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) | ||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can simplify this to just
res_col = col.copy(deep=copy)
. The resulting Series is still passed toconcat
, so it's not that we ever actually return this Series, so I would think that for the non-CoW case it shouldn't matter if the original column or a shallow copy of it is passed to concat?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, true. Is a bit slower, but just a little bit :)