-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix copy semantics in __array__
#60046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This fixes the semantics of ``__array__``. While rejecting ``copy=False`` is pretty harmless, ``copy=True`` should never have been ignored and is dangerous.
ret = take_nd(self.categories._values, self._codes) | ||
if dtype and np.dtype(dtype) != self.categories.dtype: | ||
return np.asarray(ret, dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not understand why this is needed. If dtypes match, NumPy should make it a no-op? If dtype is None, it is the same as not passing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, also not sure why this is here
Thanks for addressing this! Could this be released as a bug fix for v2.x? |
FWIW, if anyone knows how to add decent tests for this, pushing to the PR is appreciated. Otherwise, I may try to figure out where to add at least tests for a few of these paths. |
Hi, Glad to see a PR addressing #59932 ! I was trying to reproduce that issue but couldn't install NumPy >= 2.0 with the main branch, which seems to be because the And by the way, I asked about the bug fix release on Slack:
Quoting the reply from @rhshadrach:
|
You can just |
Found a test that seemed reasonably to expand. I doubt it covers all the branches changed, though (and no, I don't love expanding tests, but I was lazy to split out the parametrization). |
asarray did not support `copy=` on older versions of NumPy
Ping @mroeschke just because it's been a while that it was idle without any tests. Not sure if this needs a milestoned or a release note. |
@@ -1686,13 +1687,20 @@ def __array__( | |||
>>> np.asarray(cat) | |||
array(['a', 'b'], dtype=object) | |||
""" | |||
if copy is False: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If lib.is_range_indexer(self._codes)
i.e. the self.categories._values
are all unique then, a copy could be avoided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, but I left it out, it doesn't seem vital to try and improve it. Also fixed things up, because take_nd
should presumably always return a copy.
Sorry for the delay. Overall looks good! Could you add a note to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @seberg for working in this!
Added some more comments (some corner cases (like period converted to int being zero-copy) might need additional specific tests, but also happy to work on those)
ret = take_nd(self.categories._values, self._codes) | ||
if dtype and np.dtype(dtype) != self.categories.dtype: | ||
return np.asarray(ret, dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, also not sure why this is here
A few of these were just wrong, a few others are enhancements to allow the cases that clearly should work without copy to still pass. Co-authored-by: Joris Van den Bossche <[email protected]>
Pushed fixes based on the review, thanks; it's quite tricky to know whether or not a copy may have been made. Which also means that expandign the tests would be great (with the new changes also with @jorisvandenbossche would be great if you can look into tests (I can review), since I have to hunt/think a bit more about how to build examples. But if I don't see movement in the next days, I'll hopefully dive in. |
@seberg thanks for the update! I added a bunch of additional tests, that I think should now cover the corner cases for which you updated the code (i.e. PeriodArray and SparseArray corner cases that allow copy=False, MultiIndex never allowing copy=False). |
pandas/core/arrays/masked.py
Outdated
@@ -507,7 +507,7 @@ def to_numpy( | |||
else: | |||
with warnings.catch_warnings(): | |||
warnings.filterwarnings("ignore", category=RuntimeWarning) | |||
data = self._data.astype(dtype, copy=copy) | |||
data = np.array(self._data, dtype=dtype, copy=copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made this change together with allowing to get here with copy=False
, to ensure if copy=False
is not actually possible depending on the passed dtype
, it still properly errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, but of course for to_numpy()
itself we want to keep the behaviour of only attempting to avoid a copy, and not raising. Will have to move this logic into __array__
then..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change (together with the self._hasna
change below) make sense to me. But unfortunately, to_numpy(copy=False)
change the meaning of copy=False
this way. And I think this is public API?
Maybe better to special case _hasna
in the branch below? (could even just ignore dtype
with a comment that NumPy is fine with that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, will have to update this to not change to_numpy(copy=False)
, as that is indeed public API we want to keep working that way.
# it always gives a copy by default, but the values are cached, so results | ||
# are still sharing memory | ||
result_copy1 = np.asarray(idx) | ||
result_copy2 = np.asarray(idx) | ||
assert np.may_share_memory(result_copy1, result_copy2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be an unfortunate side effect of our caching, because a user might expect here always to get a new object. But at least when explicitly passing copy=True
, then it appears to do the right thing. But is this numpy that does that? (still copy even if copy=True
was passed down to __array__
?) We should probably not keep relying on that, and do an explicit copy on our side in case of copy=True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing, but maybe you tested with NumPy 1.x, which I think is the default dev environment?
The test should be failing, and is failing locally to me. I think if copy=True
, we'll have to add an np.array()
call to the __array__
function with a comment that it is there due to caching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems I am testing locally with numpy 2.0.2 (so not 1.x, but also not latest 2.1.x or main)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, maybe there were fixes with 2.1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, the test additions look good to me modulo a nitpick.
I am worried about the change of the meaning for .to_numpy(copy=False)
. But maybe this is private API?
If this is private API, then it is OK, but:
- The docs should be changed and both branches should behave the same (i.e. raise an error in the first branch if
copy=False
). - The default should be
copy=None
.
If not, we can move that special case to the __array__
.
pandas/core/arrays/masked.py
Outdated
@@ -507,7 +507,7 @@ def to_numpy( | |||
else: | |||
with warnings.catch_warnings(): | |||
warnings.filterwarnings("ignore", category=RuntimeWarning) | |||
data = self._data.astype(dtype, copy=copy) | |||
data = np.array(self._data, dtype=dtype, copy=copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change (together with the self._hasna
change below) make sense to me. But unfortunately, to_numpy(copy=False)
change the meaning of copy=False
this way. And I think this is public API?
Maybe better to special case _hasna
in the branch below? (could even just ignore dtype
with a comment that NumPy is fine with that).
if not zero_copy: | ||
with pytest.raises(ValueError, match="Unable to avoid copy while creating"): | ||
# An error is always acceptable for `copy=False` | ||
np.array(thing, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice to make the test specific!
result_nocopy1 = np.array(data, copy=False) | ||
except ValueError: | ||
# An error is always acceptable for `copy=False` | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can make this test specific presumably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this specific case it is harder to make it specific, because we don't know the exact array object and its expected semantics (this is eg also used by external packages implementing an ExtensionArray)
Thanks Joris. With the last two commits, I think this should be good to go in. |
The whatsnew note can probably be improved, but going to merge this already so I can start on backporting the code to 2.3.x. Thanks @seberg! |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
Co-authored-by: Joris Van den Bossche <[email protected]> (cherry picked from commit eacf032)
Manual backport -> #60189 |
|
||
if copy is True: | ||
return arr | ||
if copy is False or astype_is_view(values.dtype, arr.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more issue I noticed while doing the backport: for a similar case in generic.py we changed the copy is False
to copy is not True
, which I think we should have done here as well.
Will try to come up with a test case that would fail because of this, and do a follow-up tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, ouch. This is clearly hard to get right without tests :/.
The release snippet looks fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also only matters for the copy=None
case, which AFAIU is quite new, and so we didn't cover that when adding the logic here for making the resulting array read-only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking closer, I think this is right? Note that it says copy is False or ...
. If copy is False
we don't have to check whether it is a view (because otherwise it would have errored)?
EDIT: Not sure if it's worth to skip the check though. It seems like it may be more confusing then anything...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's a good point. I assume the arr = np.array(values, dtype=dtype, copy=copy)
line above will already have errored in case of copy=False
if a zero-copy conversion was not possible.
So indeed the or
is fine: if copy=False
we know arr
is always a view, and for copy=None
(the one remaining case) we have astype_is_view
to check it.
Maybe it's then in generic.py
that we could change the copy is not True and ..
to copy is False or ..
(just to use a similar pattern in both cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking once more at it, we actually already have tests for this, because we test that we get a read-only array using np.asarray(ser)
, which essentially means passing copy=None
.
I expanded the test with np.array(ser, copy=False)
to also explicitly cover the copy=False
case (and ensure this branch setting writeable to False is covered for that case) -> #60191
…60189) (cherry picked from commit eacf032) Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Sebastian Berg <[email protected]>
@seberg I am seeing some failures in the geopandas tests using pandas dev because of this change (https://github.com/geopandas/geopandas/actions/runs/11688928152/job/32550520328?pr=3455) Essentially geopandas is using I am wondering if we should leave the The |
No, there was no warning about it, it was a hard change in 2.0... You'll be the better judge for pandas users here. It is wrong to not raise, but it does seem worse to backport So I guess, you may only want to backport the |
@jorisvandenbossche now that this is merged and backported should we have a new issue for this and linked from the 2.3 release discussion for greater visibility?
Seems reasonable to me that we probably don't want to introduce breaking changes in 2.3 |
Yes, opened #60340 and labeled it for 2.3.0 |
This should fix the semantics of
__array__
. While rejectingcopy=False
is OK even if unnecessary,copy=True
should never have been ignored and is dangerous.Closes gh-57739, closes gh-59932, closes #59614
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.I have to figure out the tests. I think a test in form of:
will work nicely. But, right now I am not sure if there is a convenient pattern/parametrization to steal to cover all of the
__array__
implementations here.