Skip to content

BUG: Fix copy semantics in __array__ #60046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Nov 4, 2024

Conversation

seberg
Copy link
Contributor

@seberg seberg commented Oct 15, 2024

This should fix the semantics of __array__. While rejecting copy=False is OK even if unnecessary, copy=True should never have been ignored and is dangerous.

Closes gh-57739, closes gh-59932, closes #59614

  • Still needs new tests.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

I have to figure out the tests. I think a test in form of:

arr1 = np.asarray(obj, copy=True)
arr2 = np.asarray(obj, copy=True)
assert not np.may_share_memory(arr1, arr2)
# Check that without copy always works:
assert_array_equal(arr1, np.asarray(obj))

if np_ver < 2:
    return  # copy=False semantics not supported

try:
   arr1 = np.asarray(obj, copy=False)
except ValueError:
   return  # An error is acceptable for `copy=False`

# If no error is given, multiple returns must be views:
arr2 = np.asarray(obj, copy=False)
assert np.may_share_memory(arr1, arr2)

will work nicely. But, right now I am not sure if there is a convenient pattern/parametrization to steal to cover all of the __array__ implementations here.

This fixes the semantics of ``__array__``.  While rejecting ``copy=False``
is pretty harmless, ``copy=True`` should never have been ignored and is
dangerous.
ret = take_nd(self.categories._values, self._codes)
if dtype and np.dtype(dtype) != self.categories.dtype:
return np.asarray(ret, dtype)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not understand why this is needed. If dtypes match, NumPy should make it a no-op? If dtype is None, it is the same as not passing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, also not sure why this is here

@ajfriend
Copy link

Thanks for addressing this! Could this be released as a bug fix for v2.x?

@seberg
Copy link
Contributor Author

seberg commented Oct 21, 2024

FWIW, if anyone knows how to add decent tests for this, pushing to the PR is appreciated. Otherwise, I may try to figure out where to add at least tests for a few of these paths.

@chaoyihu
Copy link
Contributor

Hi, Glad to see a PR addressing #59932 ! I was trying to reproduce that issue but couldn't install NumPy >= 2.0 with the main branch, which seems to be because the environment.yml on the current main branch requires numpy<2. How did you resolve it?

And by the way, I asked about the bug fix release on Slack:

Could this be released as a bug fix for v2.x?

Quoting the reply from @rhshadrach:

I think there won't be another 2.2.x release; the next release will be 2.3.0.

@seberg
Copy link
Contributor Author

seberg commented Oct 30, 2024

You can just pip install -U numpy afterwards. Or edit the file.

@seberg
Copy link
Contributor Author

seberg commented Oct 30, 2024

Found a test that seemed reasonably to expand. I doubt it covers all the branches changed, though (and no, I don't love expanding tests, but I was lazy to split out the parametrization).

asarray did not support `copy=` on older versions of NumPy
@seberg
Copy link
Contributor Author

seberg commented Oct 30, 2024

Ping @mroeschke just because it's been a while that it was idle without any tests. Not sure if this needs a milestoned or a release note.

@@ -1686,13 +1687,20 @@ def __array__(
>>> np.asarray(cat)
array(['a', 'b'], dtype=object)
"""
if copy is False:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If lib.is_range_indexer(self._codes) i.e. the self.categories._values are all unique then, a copy could be avoided?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, but I left it out, it doesn't seem vital to try and improve it. Also fixed things up, because take_nd should presumably always return a copy.

@mroeschke mroeschke added the Compat pandas objects compatability with Numpy or Python functions label Oct 30, 2024
@mroeschke mroeschke added this to the 2.3 milestone Oct 30, 2024
@mroeschke
Copy link
Member

Sorry for the delay. Overall looks good! Could you add a note to the v2.3.0.rst release notes?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @seberg for working in this!

Added some more comments (some corner cases (like period converted to int being zero-copy) might need additional specific tests, but also happy to work on those)

ret = take_nd(self.categories._values, self._codes)
if dtype and np.dtype(dtype) != self.categories.dtype:
return np.asarray(ret, dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, also not sure why this is here

A few of these were just wrong, a few others are enhancements to
allow the cases that clearly should work without copy to still pass.

Co-authored-by: Joris Van den Bossche <[email protected]>
@seberg
Copy link
Contributor Author

seberg commented Nov 4, 2024

Pushed fixes based on the review, thanks; it's quite tricky to know whether or not a copy may have been made.

Which also means that expandign the tests would be great (with the new changes also with dtype= which I am not sure how well it works in the current test).

@jorisvandenbossche would be great if you can look into tests (I can review), since I have to hunt/think a bit more about how to build examples. But if I don't see movement in the next days, I'll hopefully dive in.

@jorisvandenbossche
Copy link
Member

@seberg thanks for the update! I added a bunch of additional tests, that I think should now cover the corner cases for which you updated the code (i.e. PeriodArray and SparseArray corner cases that allow copy=False, MultiIndex never allowing copy=False).
Also expanded the existing test a bit and added a basic base extension array test.

@@ -507,7 +507,7 @@ def to_numpy(
else:
with warnings.catch_warnings():
warnings.filterwarnings("ignore", category=RuntimeWarning)
data = self._data.astype(dtype, copy=copy)
data = np.array(self._data, dtype=dtype, copy=copy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change together with allowing to get here with copy=False, to ensure if copy=False is not actually possible depending on the passed dtype, it still properly errors

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, but of course for to_numpy() itself we want to keep the behaviour of only attempting to avoid a copy, and not raising. Will have to move this logic into __array__ then..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change (together with the self._hasna change below) make sense to me. But unfortunately, to_numpy(copy=False) change the meaning of copy=False this way. And I think this is public API?

Maybe better to special case _hasna in the branch below? (could even just ignore dtype with a comment that NumPy is fine with that).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, will have to update this to not change to_numpy(copy=False), as that is indeed public API we want to keep working that way.

Comment on lines +35 to +39
# it always gives a copy by default, but the values are cached, so results
# are still sharing memory
result_copy1 = np.asarray(idx)
result_copy2 = np.asarray(idx)
assert np.may_share_memory(result_copy1, result_copy2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be an unfortunate side effect of our caching, because a user might expect here always to get a new object. But at least when explicitly passing copy=True, then it appears to do the right thing. But is this numpy that does that? (still copy even if copy=True was passed down to __array__?) We should probably not keep relying on that, and do an explicit copy on our side in case of copy=True?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing, but maybe you tested with NumPy 1.x, which I think is the default dev environment?

The test should be failing, and is failing locally to me. I think if copy=True, we'll have to add an np.array() call to the __array__ function with a comment that it is there due to caching.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems I am testing locally with numpy 2.0.2 (so not 1.x, but also not latest 2.1.x or main)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, maybe there were fixes with 2.1.

Copy link
Contributor Author

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, the test additions look good to me modulo a nitpick.

I am worried about the change of the meaning for .to_numpy(copy=False). But maybe this is private API?

If this is private API, then it is OK, but:

  • The docs should be changed and both branches should behave the same (i.e. raise an error in the first branch if copy=False).
  • The default should be copy=None.

If not, we can move that special case to the __array__.

@@ -507,7 +507,7 @@ def to_numpy(
else:
with warnings.catch_warnings():
warnings.filterwarnings("ignore", category=RuntimeWarning)
data = self._data.astype(dtype, copy=copy)
data = np.array(self._data, dtype=dtype, copy=copy)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change (together with the self._hasna change below) make sense to me. But unfortunately, to_numpy(copy=False) change the meaning of copy=False this way. And I think this is public API?

Maybe better to special case _hasna in the branch below? (could even just ignore dtype with a comment that NumPy is fine with that).

if not zero_copy:
with pytest.raises(ValueError, match="Unable to avoid copy while creating"):
# An error is always acceptable for `copy=False`
np.array(thing, copy=False)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice to make the test specific!

result_nocopy1 = np.array(data, copy=False)
except ValueError:
# An error is always acceptable for `copy=False`
return
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make this test specific presumably.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this specific case it is harder to make it specific, because we don't know the exact array object and its expected semantics (this is eg also used by external packages implementing an ExtensionArray)

@seberg
Copy link
Contributor Author

seberg commented Nov 4, 2024

Thanks Joris. With the last two commits, I think this should be good to go in.

@jorisvandenbossche
Copy link
Member

The whatsnew note can probably be improved, but going to merge this already so I can start on backporting the code to 2.3.x.

Thanks @seberg!

@jorisvandenbossche jorisvandenbossche merged commit eacf032 into pandas-dev:main Nov 4, 2024
50 of 51 checks passed
Copy link

lumberbot-app bot commented Nov 4, 2024

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.3.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 eacf0326efb709169ebc49f040834670dfe4beb3
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #60046: BUG: Fix copy semantics in ``__array__``'
  1. Push to a named branch:
git push YOURFORK 2.3.x:auto-backport-of-pr-60046-on-2.3.x
  1. Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #60046 on branch 2.3.x (BUG: Fix copy semantics in __array__)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Nov 4, 2024
Co-authored-by: Joris Van den Bossche <[email protected]>
(cherry picked from commit eacf032)
@jorisvandenbossche
Copy link
Member

Manual backport -> #60189


if copy is True:
return arr
if copy is False or astype_is_view(values.dtype, arr.dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more issue I noticed while doing the backport: for a similar case in generic.py we changed the copy is False to copy is not True, which I think we should have done here as well.

Will try to come up with a test case that would fail because of this, and do a follow-up tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, ouch. This is clearly hard to get right without tests :/.

The release snippet looks fine to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also only matters for the copy=None case, which AFAIU is quite new, and so we didn't cover that when adding the logic here for making the resulting array read-only

Copy link
Contributor Author

@seberg seberg Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking closer, I think this is right? Note that it says copy is False or .... If copy is False we don't have to check whether it is a view (because otherwise it would have errored)?

EDIT: Not sure if it's worth to skip the check though. It seems like it may be more confusing then anything...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a good point. I assume the arr = np.array(values, dtype=dtype, copy=copy) line above will already have errored in case of copy=False if a zero-copy conversion was not possible.
So indeed the or is fine: if copy=False we know arr is always a view, and for copy=None (the one remaining case) we have astype_is_view to check it.

Maybe it's then in generic.py that we could change the copy is not True and .. to copy is False or .. (just to use a similar pattern in both cases)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking once more at it, we actually already have tests for this, because we test that we get a read-only array using np.asarray(ser), which essentially means passing copy=None.

I expanded the test with np.array(ser, copy=False) to also explicitly cover the copy=False case (and ensure this branch setting writeable to False is covered for that case) -> #60191

jorisvandenbossche added a commit that referenced this pull request Nov 5, 2024
…60189)

(cherry picked from commit eacf032)

Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Sebastian Berg <[email protected]>
@seberg seberg deleted the array-copy branch November 5, 2024 07:04
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 5, 2024

@seberg I am seeing some failures in the geopandas tests using pandas dev because of this change (https://github.com/geopandas/geopandas/actions/runs/11688928152/job/32550520328?pr=3455)

Essentially geopandas is using np.array(pandas_object, copy=False) in a few places. And in the past this has always worked, just silently returning a copy (I assume we just passed copy=False in the idea of avoiding a copy if not needed, it's not that the code relied on it returning a view).
I think for numpy itself you made this behaviour change of np.array (to start raising an error) on 2.0? Did you warn about it in advance? (I think I remember updating such a few cases as part of "make the package compatible with numpy 2.0")

I am wondering if we should leave the copy=False-raising-error part of the changes for pandas 3.0, instead of including that in pandas 2.3.

The copy=True part of the change (actually copying when asked) is I think the main bug fix we definitely should include in 2.3, as far as I understand (I think the issues people have been reporting in #57739 would be resolved by this part)

@seberg
Copy link
Contributor Author

seberg commented Nov 5, 2024

No, there was no warning about it, it was a hard change in 2.0...

You'll be the better judge for pandas users here. It is wrong to not raise, but it does seem worse to backport copy=False for sure, since copy=False will be exceedingly rare anyway in code.
If this was right from the start geopandas would have had to fix this as part of NumPy 2 transition, but that ship has sailed.

So I guess, you may only want to backport the copy=True fix, since it seems important, but not touch copy=False even if it is a bug.
If the next release is 3.0, I guess that is good enough to just do the change without deprecation then.

@simonjayhawkins
Copy link
Member

I am seeing some failures in the geopandas tests using pandas dev because of this change

@jorisvandenbossche now that this is merged and backported should we have a new issue for this and linked from the 2.3 release discussion for greater visibility?

I am wondering if we should leave the copy=False-raising-error part of the changes for pandas 3.0, instead of including that in pandas 2.3.

Seems reasonable to me that we probably don't want to introduce breaking changes in 2.3

@jorisvandenbossche
Copy link
Member

Yes, opened #60340 and labeled it for 2.3.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions
Projects
None yet
6 participants