Skip to content

API / CoW: return read-only numpy arrays in .values/to_numpy() #51082

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Mar 13, 2023

Conversation

jorisvandenbossche
Copy link
Member

Context: with the Copy-on-Write implementation (see overview follow up issue #48998), we can avoid that mutating one pandas object doesn't update another pandas object. But users can still easily get a viewing numpy array, and mutate that one. And at that point, we don't have any control over how this mutation propagates (it might update more objects than just the one from which the user obtained it, for example if other Series/DataFrames were sharing data with this object with CoW).

This is a draft PR just starting to explore this (see how much fails in our own test suite, to get an idea of the impact)

@jorisvandenbossche
Copy link
Member Author

This needs some more tests, and currently also doesn't yet handle EAs (eg np.asarray(Series[Int64]))

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review February 15, 2023 21:31
request.node.add_marker(
pytest.mark.xfail(reason="transpose doesn't yet do CoW")
)
assert (float_frame.values[5:10] != 5).all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transpose if doing CoW now, I think you can remove the xfail (CI is failing as well here)

# 10264
df = DataFrame(
np.zeros((5, 5), dtype="int64"),
columns=["a", "b", "c", "d", "e"],
index=range(5),
)
df["f"] = 0
df_orig = df.copy()
# TODO(CoW) protect underlying values of being written to?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove the TODO?

@phofl
Copy link
Member

phofl commented Mar 4, 2023

Couple smaller comments. I think this is something we should document explicitly. But can do as follow up I guesss

@jorisvandenbossche
Copy link
Member Author

I added an explicit test for the failing np.fix case (currently this was called randomly in an unrelated indexing test).

The problem with np.fix is that this is basically implemented under the hood as:

res = np.asanyarray(np.ceil(x))
res = np.floor(x, out=res, where=np.greater_equal(x, 0))

And so the np.asanyarray result now gives a read-only array, but this is passed in a second step as the out parameter, trying to write values into it.

In this case if the np.asanyarray and np.ceil calls would be inversed, it would work (first converting to read-only numpy array, and then np.ceil operation would effectively return a new writeable array)

@jorisvandenbossche jorisvandenbossche added this to the 2.0 milestone Mar 13, 2023
@phofl phofl merged commit bfa7e9f into pandas-dev:main Mar 13, 2023
@phofl
Copy link
Member

phofl commented Mar 13, 2023

thx @jorisvandenbossche

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Mar 13, 2023
@jorisvandenbossche jorisvandenbossche deleted the cow-to-numpy-readonly branch March 13, 2023 19:01
mroeschke pushed a commit that referenced this pull request Mar 13, 2023
… arrays in .values/to_numpy()) (#51933)

Backport PR #51082: API / CoW: return read-only numpy arrays in .values/to_numpy()

Co-authored-by: Joris Van den Bossche <[email protected]>
@jorisvandenbossche
Copy link
Member Author

cc @seberg @phofl my comment above at #51082 (comment) is the case that I was remembering where our change to read-only broke a legit case of calling a numpy function on a Series object: normally, the np.ceil call should make it writeable, but because this is wrapped again in a Series (since np.ceil is a ufunc), this new writeable array becomes again read-only when converted back to numpy, and then passing this to the out param in np.floor gives an error.
This causes an actual regression from the pandas' user point of view, while it could still work. I don't know if for numpy there is a specific reason to do np.asanyarray(np.ceil(x)) instead of np.ceil(np.asanyarray(x))? Although I suppose because you want to allow the object to use its overridden version through __array_ufunc__, and converting it to an array first would defeat that.

@seberg
Copy link
Contributor

seberg commented Aug 14, 2023

The asanyarray is really there for an obscure purpose probably (numpy converting 0-D arrays to scalars mostly). But yeah, it is also needed since only arrays are considered valid out= arguments (although I have to double check __array_prepare__ might allow it, but OTOH, I don't think it ever fully worked.

In either case, since we call it along the way, you are right that it would be fine to add an asanyarray call at the beginning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants