API / CoW: return read-only numpy arrays in .values/to_numpy() #51082

jorisvandenbossche · 2023-01-31T09:34:15Z

Context: with the Copy-on-Write implementation (see overview follow up issue #48998), we can avoid that mutating one pandas object doesn't update another pandas object. But users can still easily get a viewing numpy array, and mutate that one. And at that point, we don't have any control over how this mutation propagates (it might update more objects than just the one from which the user obtained it, for example if other Series/DataFrames were sharing data with this object with CoW).

This is a draft PR just starting to explore this (see how much fails in our own test suite, to get an idea of the impact)

jorisvandenbossche · 2023-02-14T21:11:12Z

This needs some more tests, and currently also doesn't yet handle EAs (eg np.asarray(Series[Int64]))

phofl

looks good!

pandas/core/generic.py

phofl · 2023-03-04T14:00:14Z

pandas/tests/frame/methods/test_transpose.py

+            request.node.add_marker(
+                pytest.mark.xfail(reason="transpose doesn't yet do CoW")
+            )
+            assert (float_frame.values[5:10] != 5).all()


transpose if doing CoW now, I think you can remove the xfail (CI is failing as well here)

phofl · 2023-03-04T14:00:47Z

pandas/tests/indexing/test_chaining_and_caching.py

        # 10264
        df = DataFrame(
            np.zeros((5, 5), dtype="int64"),
            columns=["a", "b", "c", "d", "e"],
            index=range(5),
        )
        df["f"] = 0
+        df_orig = df.copy()
        # TODO(CoW) protect underlying values of being written to?


I think we can remove the TODO?

phofl · 2023-03-04T14:01:37Z

Couple smaller comments. I think this is something we should document explicitly. But can do as follow up I guesss

jorisvandenbossche · 2023-03-06T10:35:06Z

I added an explicit test for the failing np.fix case (currently this was called randomly in an unrelated indexing test).

The problem with np.fix is that this is basically implemented under the hood as:

res = np.asanyarray(np.ceil(x))
res = np.floor(x, out=res, where=np.greater_equal(x, 0))

And so the np.asanyarray result now gives a read-only array, but this is passed in a second step as the out parameter, trying to write values into it.

In this case if the np.asanyarray and np.ceil calls would be inversed, it would work (first converting to read-only numpy array, and then np.ceil operation would effectively return a new writeable array)

phofl · 2023-03-13T17:50:00Z

thx @jorisvandenbossche

…s in .values/to_numpy()

… arrays in .values/to_numpy()) (#51933) Backport PR #51082: API / CoW: return read-only numpy arrays in .values/to_numpy() Co-authored-by: Joris Van den Bossche <[email protected]>

jorisvandenbossche · 2023-08-13T21:41:05Z

cc @seberg @phofl my comment above at #51082 (comment) is the case that I was remembering where our change to read-only broke a legit case of calling a numpy function on a Series object: normally, the np.ceil call should make it writeable, but because this is wrapped again in a Series (since np.ceil is a ufunc), this new writeable array becomes again read-only when converted back to numpy, and then passing this to the out param in np.floor gives an error.
This causes an actual regression from the pandas' user point of view, while it could still work. I don't know if for numpy there is a specific reason to do np.asanyarray(np.ceil(x)) instead of np.ceil(np.asanyarray(x))? Although I suppose because you want to allow the object to use its overridden version through __array_ufunc__, and converting it to an array first would defeat that.

seberg · 2023-08-14T07:34:03Z

The asanyarray is really there for an obscure purpose probably (numpy converting 0-D arrays to scalars mostly). But yeah, it is also needed since only arrays are considered valid out= arguments (although I have to double check __array_prepare__ might allow it, but OTOH, I don't think it ever fully worked.

In either case, since we call it along the way, you are right that it would be fine to add an asanyarray call at the beginning.

API / CoW: return read-only numpy arrays in .values/to_numpy()

9fd459c

jorisvandenbossche mentioned this pull request Jan 31, 2023

Copy-on-Write (PDEP-7) follow-up overview issue #48998

Open

38 tasks

mroeschke added the Copy / view semantics label Jan 31, 2023

jorisvandenbossche mentioned this pull request Feb 10, 2023

BUG: avoid StringArray.__setitem__ to mutate the value being set as side-effect #51299

Merged

4 tasks

Merge remote-tracking branch 'upstream/main' into cow-to-numpy-readonly

5a59564

jorisvandenbossche mentioned this pull request Feb 10, 2023

TST: avoid mutating DataFrame.values in tests (use iloc instead) #51301

Merged

jorisvandenbossche added 3 commits February 14, 2023 21:43

Merge remote-tracking branch 'upstream/main' into cow-to-numpy-readonly

1939132

actually commit tests that I wrote

5ec2763

also handle Series/DataFrame __array__

3ade20d

jorisvandenbossche requested a review from phofl February 14, 2023 21:11

phofl reviewed Feb 14, 2023

View reviewed changes

jorisvandenbossche marked this pull request as ready for review February 15, 2023 21:31

jorisvandenbossche added 3 commits February 15, 2023 22:32

Merge remote-tracking branch 'upstream/main' into cow-to-numpy-readonly

ee3c9f8

fix tests

4123918

Merge remote-tracking branch 'upstream/main' into cow-to-numpy-readonly

3d08b97

phofl reviewed Mar 4, 2023

View reviewed changes

pandas/core/generic.py Show resolved Hide resolved

phofl reviewed Mar 4, 2023

View reviewed changes

jorisvandenbossche added 3 commits March 6, 2023 11:13

Merge remote-tracking branch 'upstream/main' into cow-to-numpy-readonly

fd9398a

update tests

11b8ecd

add failing test for np.fix

fbbd9aa

jorisvandenbossche added this to the 2.0 milestone Mar 13, 2023

jorisvandenbossche added 2 commits March 13, 2023 11:42

Merge remote-tracking branch 'upstream/main' into cow-to-numpy-readonly

2f08bf2

proper skip

1bc1fed

phofl approved these changes Mar 13, 2023

View reviewed changes

phofl merged commit bfa7e9f into pandas-dev:main Mar 13, 2023

meeseeksmachine mentioned this pull request Mar 13, 2023

Backport PR #51082 on branch 2.0.x (API / CoW: return read-only numpy arrays in .values/to_numpy()) #51933

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Mar 13, 2023

Backport PR pandas-dev#51082: API / CoW: return read-only numpy array…

7ba6f74

…s in .values/to_numpy()

jorisvandenbossche deleted the cow-to-numpy-readonly branch March 13, 2023 19:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API / CoW: return read-only numpy arrays in .values/to_numpy() #51082

API / CoW: return read-only numpy arrays in .values/to_numpy() #51082

jorisvandenbossche commented Jan 31, 2023

jorisvandenbossche commented Feb 14, 2023

phofl left a comment

phofl Mar 4, 2023

phofl Mar 4, 2023

phofl commented Mar 4, 2023

jorisvandenbossche commented Mar 6, 2023

phofl commented Mar 13, 2023

jorisvandenbossche commented Aug 13, 2023

seberg commented Aug 14, 2023

API / CoW: return read-only numpy arrays in .values/to_numpy() #51082

API / CoW: return read-only numpy arrays in .values/to_numpy() #51082

Conversation

jorisvandenbossche commented Jan 31, 2023

jorisvandenbossche commented Feb 14, 2023

phofl left a comment

Choose a reason for hiding this comment

phofl Mar 4, 2023

Choose a reason for hiding this comment

phofl Mar 4, 2023

Choose a reason for hiding this comment

phofl commented Mar 4, 2023

jorisvandenbossche commented Mar 6, 2023

phofl commented Mar 13, 2023

jorisvandenbossche commented Aug 13, 2023

seberg commented Aug 14, 2023