ENH: Add lazy copy to astype #50802

phofl · 2023-01-17T21:27:50Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

phofl · 2023-01-17T22:22:45Z

pandas/core/internals/blocks.py

@@ -496,6 +500,10 @@ def astype(
                f"({self.dtype.name} [{self.shape}]) to different shape "
                f"({newb.dtype.name} [{newb.shape}])"
            )
+        if using_copy_on_write():
+            if not copy:


cc @jorisvandenbossche

I see 2 options here: Creating a function that checks if original and target dtype can be cast to each other without making a copy or figuring out a way to check for shared memory. Thoughts?

Checking numpy's astype, I think it always returns a copy except if the dtype is exactly the same (eg casting int64 to datetime64[ns], would could be a view, is still a copy even with copy=False).

So if we don't have special fast paths for such case ourselves, it might be possible to rely on just checking if dtypes are equal (but we would have to check our extension types is to check that)

Casting from int64 to Int64 is a view unfortunately, so this has to be more intelligent

phofl · 2023-01-27T03:13:31Z

cc @jorisvandenbossche

I think this is ready for another look

jorisvandenbossche

Another round of review!

jorisvandenbossche · 2023-01-30T14:07:36Z

pandas/tests/copy_view/test_methods.py

+    df2.iloc[0, 2] = 5.5
+    if using_copy_on_write:
+        assert not np.shares_memory(get_array(df2, "c"), get_array(df, "c"))
+    tm.assert_frame_equal(df, df_orig)


Suggested change

tm.assert_frame_equal(df, df_orig)

tm.assert_frame_equal(df, df_orig)

# mutating parent also doesn't update result

df2 = df.astype("float64")

df.iloc[0, 2] = 5.5

tm.assert_frame_equal(df2, df_orig.astype("float64")

We don't test this consistently for all methods here, but astype seems a sufficiently complicated case (not just based on a copy(deep=False) under the hood) that it's probably good to be complete.

(same for the ones below)

Yep makes sense

jorisvandenbossche · 2023-01-30T14:13:27Z

pandas/tests/copy_view/test_methods.py

+    if using_copy_on_write:
+        assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
+    else:
+        assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))


Suggested change

if using_copy_on_write:

assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))

else:

assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))

assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))

This will never be a view (also not in the future), so we can have the simpler version?

Not sure why I did not simplify this, was a mistake

jorisvandenbossche · 2023-01-30T14:20:10Z

pandas/core/generic.py

+                        # Make a shallow copy even if copy=False for CoW
+                        res_col = col.copy(deep=copy)
+                    else:
+                        res_col = col if copy is False else col.copy()


I think we can simplify this to just res_col = col.copy(deep=copy). The resulting Series is still passed to concat, so it's not that we ever actually return this Series, so I would think that for the non-CoW case it shouldn't matter if the original column or a shallow copy of it is passed to concat?

Yeah, true. Is a bit slower, but just a little bit :)

jorisvandenbossche · 2023-01-30T14:25:30Z

pandas/core/dtypes/astype.py

+        return False
+
+    if is_string_dtype(dtype) and is_string_dtype(new_dtype):
+        return True


I would think this is not always guaranteed to be a view? (but of course safer to return True than incorrectly assume it is always a copy)

is_string_dtype also returns true for generic "object", and converting object to string is not necessarily no-copy

Using this branch, I see the following wrong behaviour in case you have mixed objects:

In [1]: s = pd.Series(['a', 'b', 1]) In [2]: s2 = s.astype("string") In [3]: s.values Out[3]: array(['a', 'b', 1], dtype=object) In [4]: pd.options.mode.copy_on_write = True In [5]: s = pd.Series(['a', 'b', 1]) In [6]: s2 = s.astype("string") In [7]: s.values Out[7]: array(['a', 'b', '1'], dtype=object)

Because of not taking a copy, the ensure_string_array actually mutates the original values in place.

For the case of "object -> some extension dtype" casting, we should probably always do copy=True, because I don't think we can rely on _from_sequence to do the correct thing (although for this specific case I also don't think that StringArray._from_sequence(arr, copy=False) should mutate arr in place. I would expect to only not make a copy if it's not needed, i.e. if it can just view the original data and no modification needs to be done)

I opened #51073 to address the inplace modification.

You are correct, it is not guaranteed to be a no-copy op, but it could be and I couldn't figure out a more precise check. We can still optimise in a follow-up to get this stricter, for now I was aiming to get as many cases as possible without making it overly complex

We can still optimise in a follow-up to get this stricter, for now I was aiming to get as many cases as possible without making it overly complex

Can you add a comment explaining the True and to note something like that?

jorisvandenbossche · 2023-01-30T14:26:11Z

pandas/core/dtypes/astype.py

+        return True
+
+    elif getattr(dtype, "numpy_dtype", dtype) == getattr(
+        new_dtype, "numpy_dtype", new_dtype


This is for our masked arrays? If so, can you add some comments explaining that?

pyarrow as well, both don't copy

jorisvandenbossche · 2023-01-30T14:45:41Z

pandas/tests/series/test_constructors.py

+        if using_copy_on_write:
+            assert s[1] == 2
+        else:
+            assert s[1] == 5


Hmm, so this wasn't yet covered when updating the Series(series) constructor (#49524). Could you add an explicit test for copy/view behaviour with a case like above to copy_view/test_constructors.py ?

Just parametrizing the first test there should be sufficient to cover this case:

diff --git a/pandas/tests/copy_view/test_constructors.py b/pandas/tests/copy_view/test_constructors.py index c04c733e5e..7793c6cad5 100644 --- a/pandas/tests/copy_view/test_constructors.py +++ b/pandas/tests/copy_view/test_constructors.py @@ -1,3 +1,5 @@ +import pytest + import numpy as np from pandas import Series @@ -6,13 +8,14 @@ from pandas import Series # Copy/view behaviour for Series / DataFrame constructors -def test_series_from_series(using_copy_on_write): +@pytest.mark.parametrize("dtype", [None, "int64"]) +def test_series_from_series(dtype, using_copy_on_write): # Case: constructing a Series from another Series object follows CoW rules: # a new object is returned and thus mutations are not propagated ser = Series([1, 2, 3], name="name") # default is copy=False -> new Series is a shallow copy / view of original - result = Series(ser) + result = Series(ser, dtype=dtype) # the shallow copy still shares memory assert np.shares_memory(ser.values, result.values) @@ -34,7 +37,7 @@ def test_series_from_series(using_copy_on_write): assert np.shares_memory(ser.values, result.values) # the same when modifying the parent - result = Series(ser) + result = Series(ser, dtype=dtype) if using_copy_on_write: # mutating original doesn't mutate new series

We should still add a test that ensure that if the cast requires a copy, we do not track references (to avoid a unnecessary copy later on), but that can be done later.

The copy case is more for your pr I guess? Adjusted the test accordingly

jorisvandenbossche · 2023-01-30T15:03:35Z

For the case of "object -> some extension dtype" casting, we should probably always do copy=True, because I don't think we can rely on _from_sequence to do the correct thing (although for this specific case I also don't think that StringArray._from_sequence(arr, copy=False) should mutate arr in place. I would expect to only not make a copy if it's not needed, i.e. if it can just view the original data and no modification needs to be done)

For generic extension dtypes, we can test with the test decimal extension dtype (although first have to fix its constructor to not ignore the copy keyword). This is also a case of storing an object-dtype array under the hood, so object->decimal can be a view if we pass copy=False down.

But it might also not only be an issue for object dtype. For any numpy dtype -> extension dtype, this eventually goes through EA._from_sequence(..). So we should maybe pass copy=True for all those cases (for generic EAs, we can special case for our own internal ones)

phofl · 2023-01-30T21:58:42Z

Added a hacky decimal test. What's the most efficient way of figuring out our own eas?

jorisvandenbossche · 2023-02-01T21:00:21Z

pandas/tests/arrays/test_array.py

@@ -421,3 +421,14 @@ def test_array_to_numpy_na():
    result = arr.to_numpy(na_value=True, dtype=bool)
    expected = np.array([True, True])
    tm.assert_numpy_array_equal(result, expected)
+
+
+def test_array_copy_on_write(using_copy_on_write):


This file is about pd.array(..) I think, so I would move it either to the decimal tests (extension/decimal/test_decimal), or to the other copy_view astype tests and import the DecimalDtype there

jorisvandenbossche · 2023-02-01T21:52:31Z

So the reason that the decimal test you added works without actual code change is because the ExtensionDtype.kind attribute for that dtype is also object (that's the default on the base class ExtensionDtype). And then you get the elif is_object_dtype(dtype) and new_dtype.kind == "O": returning True, so we keep track of references. But if the kind is not object, we might end up in the final return False, and we might be incorrectly assuming the result is a copy and not keep track of references.

And I think that all our own extension dtypes (that are not the nullable ones) have a kind of "O" (categorical, period, interval, ..), and so those don't run into the issue.
We have one test extension dtype that doesn't have to be of kind object, and that is the FloatAttrDtype. If I update that dtype to have the correct kind of float:

diff --git a/pandas/tests/extension/array_with_attr/array.py b/pandas/tests/extension/array_with_attr/array.py
index d9327ca9f2..94923b1b32 100644
--- a/pandas/tests/extension/array_with_attr/array.py
+++ b/pandas/tests/extension/array_with_attr/array.py
@@ -20,6 +20,7 @@ class FloatAttrDtype(ExtensionDtype):
     type = float
     name = "float_attr"
     na_value = np.nan
+    kind = "f"
 
     @classmethod
     def construct_array_type(cls) -> type_t[FloatAttrArray]:

then I observe the following bug with the current branch:

In [1]: from pandas.tests.extension.array_with_attr import FloatAttrDtype

In [2]: s = pd.Series([1, 2, 3], dtype="float")

In [3]: s2 = s.astype(FloatAttrDtype())

In [4]: s[0] = 10.0   # <- this is fine, s2 not changed (since a copy was made)

In [5]: s2
Out[5]: 
0    1.0
1    2.0
2    3.0
dtype: float_attr

In [6]: pd.options.mode.copy_on_write = True

In [7]: s = pd.Series([1, 2, 3], dtype="float")

In [8]: s2 = s.astype(FloatAttrDtype())

In [9]: s[0] = 10.0   # but with CoW enabled, this does mutate s2

In [10]: s2
Out[10]: 
0    10.0
1     2.0
2     3.0
dtype: float_attr

I don't know if there are many EAs out there that would run into this, though .. But when we end up calling EA._from_sequence (calling astype(EAdtype) with a dtype we don't implement ourselves), I think we should either assume that the result can be view (and so track refs), or either in this case do pass copy=True to _from_sequence, instead of copy=False now when CoW is enabled.

jorisvandenbossche · 2023-02-01T21:53:58Z

(I am maybe also going a bit too far with my review here, it doesn't necessarily have to be perfect for all corner cases in the first PR :))

phofl · 2023-02-01T22:00:53Z

Yeah I agree with third party eas. I actually like your suggestion of tracking refs more than passing copy=True. I don't know right now how to determine if we have a 3rd party ea, that's why I did not implement it yet.

jorisvandenbossche · 2023-02-01T22:11:27Z

I don't know right now how to determine if we have a 3rd party ea, that's why I did not implement it yet.

I think all our internal EAs will be captured by one of the specific checks in astype_is_view, so whenever you end up in the end but the target dtype is an ExtensionDtype?

The drawback of not passing copy=True but keeping a ref, is that we might be doing an unnecessary copy later one. Of course with the benefit we might be avoiding a copy during the astype. No idea which one of the two would be most common, though. So given that, either solution is probably fine for now.

# Conflicts: # doc/source/development/copy_on_write.rst # pandas/_libs/internals.pyx # pandas/core/internals/blocks.py

phofl · 2023-02-08T09:20:59Z

I adjusted the view check a bit to return True more often. This should avoid missing no-copy ops. Can also iterate later if we want to do an eager copy In astype and set the references to False. But considering

df = df.astype()

deferring the copy makes the most sense to me

jorisvandenbossche

A few small comments, and some follow-up ideas/questions

pandas/core/dtypes/astype.py

jorisvandenbossche · 2023-02-08T09:56:53Z

pandas/core/dtypes/astype.py

+    elif is_object_dtype(dtype) and new_dtype.kind == "O":
+        # When the underlying array has dtype object, we don't have to make a copy
+        return True


Our own extension types fall in this category? (eg casting object column of Period objects to period dtype, although this specific example is actually never a view)

Rather for a follow-up, would it be worth to have some more custom logic here for our own EAs? Both IntervalDtype and PeriodDtype have kind of O, but I think neither of them can ever cast from object dtype without making a copy.

Yep would prefer to do this as a follow up with specific tests

jorisvandenbossche · 2023-02-08T09:58:31Z

pandas/core/dtypes/astype.py

+    elif dtype.kind in "mM" and new_dtype.kind in "mM":
+        return True


Also to note for a follow-up, I assume now with multiple resolutions being supported for datetime64, we should maybe check those, since if you have different units, this is never a view?

Actually, I am wondering: what is the purpose of this check? Because if you cast datetime64[ns]->datetime64[ns], that's already covered by the equal dtype case. Mixing datetime and timedelta is something we disallow explicitly (numpy allows it). So this is for DatetimeTZDtype?

Yep tz aware stuff here. Removing this check causes test failures

Added a bunch of additional tests and added a unit check here.

jorisvandenbossche · 2023-02-08T10:52:48Z

pandas/core/dtypes/astype.py

+    new_numpy_dtype = getattr(new_dtype, "numpy_dtype", None)
+
+    if numpy_dtype is not None and new_numpy_dtype is not None:
+        # if both have NumPy dtype then they are only views if they are equal


Can you add back the part of the comment that you had before that this is for example for nullable dtypes?

Also, this will currently only work for eg Int64->Int32 (both nullable), and not catch the mixed of numpy/nullable dtype (eg int64 -> Int64 (view), or int64 -> Int32 (copy))

Yeah I missed this case. Will adjust accordingly, since this should work out of the box imo

jorisvandenbossche · 2023-02-08T14:13:05Z

pandas/tests/copy_view/test_methods.py

+    df_orig = df.copy()
+    df2 = df.astype(dtype)
+
+    assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))


Suggested change

assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))

assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))

assert df2._mgr._has_no_reference(0)

Maybe also explicitly check that df2._mgr._has_no_reference(0)? Because the shares_memory and iloc setitem test doesn't ensure we didn't incorrectly trigger CoW unnecessarily

Yes this is a very good check, would have made sure that I did not miss the numpy dtype thing below

jorisvandenbossche · 2023-02-08T14:16:37Z

pandas/tests/copy_view/test_methods.py

@@ -527,6 +529,138 @@ def test_to_frame(using_copy_on_write):
        tm.assert_frame_equal(df, expected)


+def test_astype_single_dtype(using_copy_on_write):


Shall we move the astype tests to a dedicated file instead of in the middle of the other methods? My hunch is that we might need to add some more astype tests (if we specialize more for our own dtypes), and test_methods.py is already getting long

Yeah sounds good

jorisvandenbossche · 2023-02-08T14:32:58Z

In general: let's get this merged with the current version of astype_is_view, and fine tune that in follow-up PRs?

Co-authored-by: Joris Van den Bossche <[email protected]>

phofl · 2023-02-08T14:57:38Z

Yeah I think we cover most general cases now. I'd suggest tracking further optimisations in an issue.

Edit: I think we don't miss any potential views now while we already cover most cases.

jorisvandenbossche · 2023-02-08T16:18:16Z

pandas/core/dtypes/astype.py

+    elif dtype.kind in "mM" and new_dtype.kind in "mM":
+        #  Item "dtype[Any]" of "Union[dtype[Any], ExtensionDtype]" has no
+        #  attribute "unit"
+        return dtype.unit == new_dtype.unit  # type: ignore[union-attr]


This apparently doesn't work for the pyarrow backed one (something we should fix as well I suppose), see the test failures on CI

Yikes, thx. This needs the same logic as the nullables below

phofl · 2023-02-08T22:09:32Z

@jorisvandenbossche

Are you ok with merging this when greenish? One test was failing for array manager, is fixed now.

This opens up a few other things.

jorisvandenbossche · 2023-02-09T10:48:49Z

Yes, certainly, that will also make my constructor PR easier to update.

jbrockmendel · 2023-03-22T23:45:37Z

pandas/core/dtypes/astype.py

+        return True
+
+    elif dtype.kind in "mM" and new_dtype.kind in "mM":
+        dtype = getattr(dtype, "numpy_dtype", dtype)


@phofl i finally got around to seeing this. this numpy_dtype getattr check looks like it is for arrow/masked arrays. can we avoid this pattern? it makes it difficult to tell where we are special-casing

Would you prefer special casing for isinstance(dtype, ExtensionDtype) explicitly?

id prefer for these to be as explicit as possible. i suspect this is intended for something more specific than just ExtensionDtype

Not really, basically all extension dtypes that have a numpy dtype

Are you sure? the logic here looks like it could be broken pretty easily by a 3rd-party EA that happened to have a dtype.numpy_dtpye

They could yes, but right now this is implemented defensively to track more references than necessary if we are not able to determine with certainty that a copy happened.

phofl added 3 commits January 17, 2023 22:03

ENH: Add lazy copy to astype

ed2e322

Fix

1967a2a

Fix test

45ea989

phofl commented Jan 17, 2023

View reviewed changes

Fix array manager

a2e1e53

mroeschke added the Copy / view semantics label Jan 18, 2023

Narrow down

8f29fda

This was referenced Jan 23, 2023

ENH / CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate #49473

Closed

Handle CoW in BlockManager.apply #50948

Closed

phofl added 6 commits January 25, 2023 19:51

Merge remote-tracking branch 'upstream/main' into cow_astype

55fc3bc

Update astype

a1847e8

Fix datetime case

75bdf1d

Skip

528b0fb

Fix test

f29a0be

Refactor

35b9fa4

jorisvandenbossche reviewed Jan 30, 2023

View reviewed changes

jorisvandenbossche mentioned this pull request Jan 30, 2023

API / CoW: DataFrame(<dict of Series>, copy=False) constructor now gives lazy copy #50777

Merged

4 tasks

phofl added 3 commits January 30, 2023 22:13

Address review

664c31b

Merge remote-tracking branch 'upstream/main' into cow_astype

49a2029

Fix decimal case

e1dc971

jorisvandenbossche reviewed Feb 1, 2023

View reviewed changes

phofl added 2 commits February 7, 2023 21:31

Merge remote-tracking branch 'upstream/main' into cow_astype

894aaa8

Merge CoW ref tracking

273b5aa

phofl added 4 commits February 7, 2023 21:47

Refactor for new cow mechanism and move test

31557be

Merge remote-tracking branch 'upstream/main' into cow_astype

c0f5003

# Conflicts: # doc/source/development/copy_on_write.rst # pandas/_libs/internals.pyx # pandas/core/internals/blocks.py

Remove merge problem

031aaa5

Add comment

174e2d3

jorisvandenbossche reviewed Feb 8, 2023

View reviewed changes

phofl and others added 4 commits February 8, 2023 15:34

Update pandas/core/dtypes/astype.py

876d293

Co-authored-by: Joris Van den Bossche <[email protected]>

Check numpy non numpy conversion better

ff37e58

Add explicit timezone tests

5492c10

Adjust comment

620181c

phofl added 2 commits February 8, 2023 15:58

Merge remote-tracking branch 'upstream/main' into cow_astype

4f6f954

Add to whatsnew

95a8296

jorisvandenbossche reviewed Feb 8, 2023

View reviewed changes

phofl added 2 commits February 8, 2023 17:44

Fix tests

7640c36

Fix array manager

469383d

jorisvandenbossche approved these changes Feb 9, 2023

View reviewed changes

jorisvandenbossche merged commit 13db83a into pandas-dev:main Feb 9, 2023

phofl deleted the cow_astype branch February 10, 2023 10:44

jbrockmendel reviewed Mar 22, 2023

View reviewed changes

-    tm.assert_frame_equal(df, df_orig)
+    tm.assert_frame_equal(df, df_orig)
+    # mutating parent also doesn't update result
+    df2 = df.astype("float64")
+    df.iloc[0, 2] = 5.5
+    tm.assert_frame_equal(df2, df_orig.astype("float64")

		elif dtype.kind in "mM" and new_dtype.kind in "mM":
		return True

	assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
	assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
	assert df2._mgr._has_no_reference(0)

		@@ -527,6 +529,138 @@ def test_to_frame(using_copy_on_write):
		tm.assert_frame_equal(df, expected)


		def test_astype_single_dtype(using_copy_on_write):

Uh oh!

ENH: Add lazy copy to astype #50802

ENH: Add lazy copy to astype #50802

Uh oh!

Conversation

phofl commented Jan 17, 2023 • edited by jorisvandenbossche Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phofl commented Jan 27, 2023

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phofl Jan 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Jan 30, 2023

Uh oh!

phofl commented Jan 30, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Feb 1, 2023

Uh oh!

jorisvandenbossche commented Feb 1, 2023

Uh oh!

phofl commented Feb 1, 2023

Uh oh!

jorisvandenbossche commented Feb 1, 2023

Uh oh!

phofl commented Feb 8, 2023

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

phofl commented Jan 17, 2023 •

edited by jorisvandenbossche

Loading

phofl Jan 30, 2023 •

edited

Loading

phofl commented Feb 8, 2023 •

edited

Loading