BUG: Fix copy semantics in `array` #60046

seberg · 2024-10-15T11:06:31Z

This should fix the semantics of __array__. While rejecting copy=False is OK even if unnecessary, copy=True should never have been ignored and is dangerous.

Closes gh-57739, closes gh-59932, closes #59614

Still needs new tests.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

I have to figure out the tests. I think a test in form of:

arr1 = np.asarray(obj, copy=True)
arr2 = np.asarray(obj, copy=True)
assert not np.may_share_memory(arr1, arr2)
# Check that without copy always works:
assert_array_equal(arr1, np.asarray(obj))

if np_ver < 2:
    return  # copy=False semantics not supported

try:
   arr1 = np.asarray(obj, copy=False)
except ValueError:
   return  # An error is acceptable for `copy=False`

# If no error is given, multiple returns must be views:
arr2 = np.asarray(obj, copy=False)
assert np.may_share_memory(arr1, arr2)

will work nicely. But, right now I am not sure if there is a convenient pattern/parametrization to steal to cover all of the __array__ implementations here.

This fixes the semantics of ``__array__``. While rejecting ``copy=False`` is pretty harmless, ``copy=True`` should never have been ignored and is dangerous.

seberg · 2024-10-15T13:02:22Z

pandas/core/arrays/categorical.py

        ret = take_nd(self.categories._values, self._codes)
-        if dtype and np.dtype(dtype) != self.categories.dtype:
-            return np.asarray(ret, dtype)


I did not understand why this is needed. If dtypes match, NumPy should make it a no-op? If dtype is None, it is the same as not passing.

Yeah, also not sure why this is here

ajfriend · 2024-10-21T15:59:26Z

Thanks for addressing this! Could this be released as a bug fix for v2.x?

seberg · 2024-10-21T18:21:19Z

FWIW, if anyone knows how to add decent tests for this, pushing to the PR is appreciated. Otherwise, I may try to figure out where to add at least tests for a few of these paths.

chaoyihu · 2024-10-29T19:05:07Z

Hi, Glad to see a PR addressing #59932 ! I was trying to reproduce that issue but couldn't install NumPy >= 2.0 with the main branch, which seems to be because the environment.yml on the current main branch requires numpy<2. How did you resolve it?

And by the way, I asked about the bug fix release on Slack:

Could this be released as a bug fix for v2.x?

Quoting the reply from @rhshadrach:

I think there won't be another 2.2.x release; the next release will be 2.3.0.

seberg · 2024-10-30T07:55:24Z

You can just pip install -U numpy afterwards. Or edit the file.

seberg · 2024-10-30T14:54:05Z

Found a test that seemed reasonably to expand. I doubt it covers all the branches changed, though (and no, I don't love expanding tests, but I was lazy to split out the parametrization).

asarray did not support `copy=` on older versions of NumPy

seberg · 2024-10-30T20:53:52Z

Ping @mroeschke just because it's been a while that it was idle without any tests. Not sure if this needs a milestoned or a release note.

mroeschke · 2024-10-30T21:06:24Z

pandas/core/arrays/categorical.py

@@ -1686,13 +1687,20 @@ def __array__(
        >>> np.asarray(cat)
        array(['a', 'b'], dtype=object)
        """
+        if copy is False:


If lib.is_range_indexer(self._codes) i.e. the self.categories._values are all unique then, a copy could be avoided?

Probably, but I left it out, it doesn't seem vital to try and improve it. Also fixed things up, because take_nd should presumably always return a copy.

mroeschke · 2024-10-30T21:17:18Z

Sorry for the delay. Overall looks good! Could you add a note to the v2.3.0.rst release notes?

jorisvandenbossche

Thanks a lot @seberg for working in this!

Added some more comments (some corner cases (like period converted to int being zero-copy) might need additional specific tests, but also happy to work on those)

pandas/core/arrays/arrow/array.py

jorisvandenbossche · 2024-10-31T13:35:22Z

pandas/core/arrays/categorical.py

        ret = take_nd(self.categories._values, self._codes)
-        if dtype and np.dtype(dtype) != self.categories.dtype:
-            return np.asarray(ret, dtype)


Yeah, also not sure why this is here

pandas/core/arrays/datetimelike.py

pandas/core/arrays/period.py

pandas/core/arrays/sparse/array.py

pandas/core/generic.py

pandas/core/indexes/multi.py

A few of these were just wrong, a few others are enhancements to allow the cases that clearly should work without copy to still pass. Co-authored-by: Joris Van den Bossche <[email protected]>

seberg · 2024-11-04T11:06:49Z

Pushed fixes based on the review, thanks; it's quite tricky to know whether or not a copy may have been made.

Which also means that expandign the tests would be great (with the new changes also with dtype= which I am not sure how well it works in the current test).

@jorisvandenbossche would be great if you can look into tests (I can review), since I have to hunt/think a bit more about how to build examples. But if I don't see movement in the next days, I'll hopefully dive in.

…r masked arrays in case of no NAs

jorisvandenbossche · 2024-11-04T15:20:22Z

@seberg thanks for the update! I added a bunch of additional tests, that I think should now cover the corner cases for which you updated the code (i.e. PeriodArray and SparseArray corner cases that allow copy=False, MultiIndex never allowing copy=False).
Also expanded the existing test a bit and added a basic base extension array test.

jorisvandenbossche · 2024-11-04T15:23:27Z

pandas/core/arrays/masked.py

@@ -507,7 +507,7 @@ def to_numpy(
        else:
            with warnings.catch_warnings():
                warnings.filterwarnings("ignore", category=RuntimeWarning)
-                data = self._data.astype(dtype, copy=copy)
+                data = np.array(self._data, dtype=dtype, copy=copy)


Made this change together with allowing to get here with copy=False, to ensure if copy=False is not actually possible depending on the passed dtype, it still properly errors

Hmm, but of course for to_numpy() itself we want to keep the behaviour of only attempting to avoid a copy, and not raising. Will have to move this logic into __array__ then..

This change (together with the self._hasna change below) make sense to me. But unfortunately, to_numpy(copy=False) change the meaning of copy=False this way. And I think this is public API?

Maybe better to special case _hasna in the branch below? (could even just ignore dtype with a comment that NumPy is fine with that).

Yeah, will have to update this to not change to_numpy(copy=False), as that is indeed public API we want to keep working that way.

jorisvandenbossche · 2024-11-04T15:27:01Z

pandas/tests/indexes/multi/test_conversion.py

+    # it always gives a copy by default, but the values are cached, so results
+    # are still sharing memory
+    result_copy1 = np.asarray(idx)
+    result_copy2 = np.asarray(idx)
+    assert np.may_share_memory(result_copy1, result_copy2)


This might be an unfortunate side effect of our caching, because a user might expect here always to get a new object. But at least when explicitly passing copy=True, then it appears to do the right thing. But is this numpy that does that? (still copy even if copy=True was passed down to __array__?) We should probably not keep relying on that, and do an explicit copy on our side in case of copy=True?

This is confusing, but maybe you tested with NumPy 1.x, which I think is the default dev environment?

The test should be failing, and is failing locally to me. I think if copy=True, we'll have to add an np.array() call to the __array__ function with a comment that it is there due to caching.

It seems I am testing locally with numpy 2.0.2 (so not 1.x, but also not latest 2.1.x or main)

Ah, maybe there were fixes with 2.1.

seberg

Thanks a lot, the test additions look good to me modulo a nitpick.

I am worried about the change of the meaning for .to_numpy(copy=False). But maybe this is private API?

If this is private API, then it is OK, but:

The docs should be changed and both branches should behave the same (i.e. raise an error in the first branch if copy=False).
The default should be copy=None.

If not, we can move that special case to the __array__.

seberg · 2024-11-04T15:29:51Z

pandas/core/arrays/masked.py

@@ -507,7 +507,7 @@ def to_numpy(
        else:
            with warnings.catch_warnings():
                warnings.filterwarnings("ignore", category=RuntimeWarning)
-                data = self._data.astype(dtype, copy=copy)
+                data = np.array(self._data, dtype=dtype, copy=copy)


This change (together with the self._hasna change below) make sense to me. But unfortunately, to_numpy(copy=False) change the meaning of copy=False this way. And I think this is public API?

Maybe better to special case _hasna in the branch below? (could even just ignore dtype with a comment that NumPy is fine with that).

seberg · 2024-11-04T15:33:16Z

pandas/tests/base/test_conversion.py

+    if not zero_copy:
+        with pytest.raises(ValueError, match="Unable to avoid copy while creating"):
+            # An error is always acceptable for `copy=False`
+            np.array(thing, copy=False)


Very nice to make the test specific!

seberg · 2024-11-04T15:34:29Z

pandas/tests/extension/base/interface.py

+            result_nocopy1 = np.array(data, copy=False)
+        except ValueError:
+            # An error is always acceptable for `copy=False`
+            return


You can make this test specific presumably.

For this specific case it is harder to make it specific, because we don't know the exact array object and its expected semantics (this is eg also used by external packages implementing an ExtensionArray)

seberg · 2024-11-04T17:53:15Z

Thanks Joris. With the last two commits, I think this should be good to go in.

jorisvandenbossche · 2024-11-04T20:19:43Z

The whatsnew note can probably be improved, but going to merge this already so I can start on backporting the code to 2.3.x.

Thanks @seberg!

lumberbot-app · 2024-11-04T20:20:38Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 2.3.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 eacf0326efb709169ebc49f040834670dfe4beb3

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #60046: BUG: Fix copy semantics in ``__array__``'

Push to a named branch:

git push YOURFORK 2.3.x:auto-backport-of-pr-60046-on-2.3.x

Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #60046 on branch 2.3.x (BUG: Fix copy semantics in __array__)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

Co-authored-by: Joris Van den Bossche <[email protected]> (cherry picked from commit eacf032)

jorisvandenbossche · 2024-11-04T20:32:33Z

Manual backport -> #60189

jorisvandenbossche · 2024-11-04T20:36:56Z

pandas/core/series.py

+
+        if copy is True:
+            return arr
+        if copy is False or astype_is_view(values.dtype, arr.dtype):


One more issue I noticed while doing the backport: for a similar case in generic.py we changed the copy is False to copy is not True, which I think we should have done here as well.

Will try to come up with a test case that would fail because of this, and do a follow-up tomorrow.

Thanks, ouch. This is clearly hard to get right without tests :/.

The release snippet looks fine to me.

It also only matters for the copy=None case, which AFAIU is quite new, and so we didn't cover that when adding the logic here for making the resulting array read-only

Looking closer, I think this is right? Note that it says copy is False or .... If copy is False we don't have to check whether it is a view (because otherwise it would have errored)?

EDIT: Not sure if it's worth to skip the check though. It seems like it may be more confusing then anything...

Ah, that's a good point. I assume the arr = np.array(values, dtype=dtype, copy=copy) line above will already have errored in case of copy=False if a zero-copy conversion was not possible.
So indeed the or is fine: if copy=False we know arr is always a view, and for copy=None (the one remaining case) we have astype_is_view to check it.

Maybe it's then in generic.py that we could change the copy is not True and .. to copy is False or .. (just to use a similar pattern in both cases)

Looking once more at it, we actually already have tests for this, because we test that we get a read-only array using np.asarray(ser), which essentially means passing copy=None.

I expanded the test with np.array(ser, copy=False) to also explicitly cover the copy=False case (and ensure this branch setting writeable to False is covered for that case) -> #60191

…60189) (cherry picked from commit eacf032) Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Sebastian Berg <[email protected]>

jorisvandenbossche · 2024-11-05T17:16:28Z

@seberg I am seeing some failures in the geopandas tests using pandas dev because of this change (https://github.com/geopandas/geopandas/actions/runs/11688928152/job/32550520328?pr=3455)

Essentially geopandas is using np.array(pandas_object, copy=False) in a few places. And in the past this has always worked, just silently returning a copy (I assume we just passed copy=False in the idea of avoiding a copy if not needed, it's not that the code relied on it returning a view).
I think for numpy itself you made this behaviour change of np.array (to start raising an error) on 2.0? Did you warn about it in advance? (I think I remember updating such a few cases as part of "make the package compatible with numpy 2.0")

I am wondering if we should leave the copy=False-raising-error part of the changes for pandas 3.0, instead of including that in pandas 2.3.

The copy=True part of the change (actually copying when asked) is I think the main bug fix we definitely should include in 2.3, as far as I understand (I think the issues people have been reporting in #57739 would be resolved by this part)

seberg · 2024-11-05T21:21:35Z

No, there was no warning about it, it was a hard change in 2.0...

You'll be the better judge for pandas users here. It is wrong to not raise, but it does seem worse to backport copy=False for sure, since copy=False will be exceedingly rare anyway in code.
If this was right from the start geopandas would have had to fix this as part of NumPy 2 transition, but that ship has sailed.

So I guess, you may only want to backport the copy=True fix, since it seems important, but not touch copy=False even if it is a bug.
If the next release is 3.0, I guess that is good enough to just do the change without deprecation then.

simonjayhawkins · 2024-11-15T12:17:52Z

I am seeing some failures in the geopandas tests using pandas dev because of this change

@jorisvandenbossche now that this is merged and backported should we have a new issue for this and linked from the 2.3 release discussion for greater visibility?

I am wondering if we should leave the copy=False-raising-error part of the changes for pandas 3.0, instead of including that in pandas 2.3.

Seems reasonable to me that we probably don't want to introduce breaking changes in 2.3

jorisvandenbossche · 2024-11-16T19:40:14Z

Yes, opened #60340 and labeled it for 2.3.0

BUG: Fix copy semantics in __array__

1c5195c

This fixes the semantics of ``__array__``. While rejecting ``copy=False`` is pretty harmless, ``copy=True`` should never have been ignored and is dangerous.

seberg force-pushed the array-copy branch from 9a02772 to 1c5195c Compare October 15, 2024 11:20

seberg added 3 commits October 15, 2024 13:34

BUG: Fix one more path not translating copy= correctly

2183861

BUG: Avoid asarray with copy= (it was added in 2.0)

404827b

More fixes found by typing checks (or working around them)

ec08728

seberg commented Oct 15, 2024

View reviewed changes

TST: Add test for __array__ copy behavior

4ac6323

TST: Fixup test to use array rather than asarray

9b6c209

asarray did not support `copy=` on older versions of NumPy

mroeschke reviewed Oct 30, 2024

View reviewed changes

mroeschke added the Compat pandas objects compatability with Numpy or Python functions label Oct 30, 2024

mroeschke added this to the 2.3 milestone Oct 30, 2024

jorisvandenbossche reviewed Oct 31, 2024

View reviewed changes

BUG: Fixup __array__ copy paths based on review

77058df

A few of these were just wrong, a few others are enhancements to allow the cases that clearly should work without copy to still pass. Co-authored-by: Joris Van den Bossche <[email protected]>

jorisvandenbossche added 5 commits November 4, 2024 15:39

fix period case and add specific test

6799f55

update test to be explicit about copy vs nocopy + allow copy=False fo…

4217baf

…r masked arrays in case of no NAs

add similar test to base extension tests

5e4cb87

add specific test for sparse corner case

357f8a0

add specific test for MultiIndex

9927903

jorisvandenbossche reviewed Nov 4, 2024

View reviewed changes

seberg commented Nov 4, 2024

View reviewed changes

jorisvandenbossche added 2 commits November 4, 2024 17:58

fix MultiIndex copy=True case for recent numpy

3b000be

fix copy=False case for masked array

421f904

mroeschke requested a review from jorisvandenbossche November 4, 2024 18:22

mroeschke approved these changes Nov 4, 2024

View reviewed changes

jorisvandenbossche added 2 commits November 4, 2024 20:40

add whatsnew note

d70405e

Merge remote-tracking branch 'upstream/main' into array-copy

5289a82

jorisvandenbossche approved these changes Nov 4, 2024

View reviewed changes

jorisvandenbossche merged commit eacf032 into pandas-dev:main Nov 4, 2024
50 of 51 checks passed

lumberbot-app bot added the Still Needs Manual Backport label Nov 4, 2024

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Nov 4, 2024

BUG: Fix copy semantics in __array__ (pandas-dev#60046)

3088043

Co-authored-by: Joris Van den Bossche <[email protected]> (cherry picked from commit eacf032)

jorisvandenbossche mentioned this pull request Nov 4, 2024

[backport 2.3.x] BUG: Fix copy semantics in __array__ (#60046) #60189

Merged

jorisvandenbossche removed the Still Needs Manual Backport label Nov 4, 2024

jorisvandenbossche reviewed Nov 4, 2024

View reviewed changes

seberg deleted the array-copy branch November 5, 2024 07:04

jorisvandenbossche mentioned this pull request Nov 5, 2024

TST: add extra test case for np.array(obj, copy=False) read-only behaviour #60191

Merged

jorisvandenbossche mentioned this pull request Nov 5, 2024

TST: fix imports tests with latest pytest-cov geopandas/geopandas#3455

Merged

simonjayhawkins mentioned this pull request Nov 15, 2024

BUG: numpy.ma.fix_invalid makes changes in-place in numpy 2.1.0 even with copy=True #59614

Closed

3 tasks

jorisvandenbossche mentioned this pull request Nov 16, 2024

DEPR: deprecate / warn about raising an error in __array__ when copy=False cannot be honore #60340

Closed

BUG: Fix copy semantics in __array__ #60046

BUG: Fix copy semantics in __array__ #60046

Conversation

seberg commented Oct 15, 2024 • edited by jorisvandenbossche Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajfriend commented Oct 21, 2024

seberg commented Oct 21, 2024

chaoyihu commented Oct 29, 2024

seberg commented Oct 30, 2024

seberg commented Oct 30, 2024

seberg commented Oct 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Oct 30, 2024

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg commented Nov 4, 2024

jorisvandenbossche commented Nov 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg commented Nov 4, 2024

jorisvandenbossche commented Nov 4, 2024

lumberbot-app bot commented Nov 4, 2024

jorisvandenbossche commented Nov 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 5, 2024 • edited Loading

seberg commented Nov 5, 2024

simonjayhawkins commented Nov 15, 2024

jorisvandenbossche commented Nov 16, 2024

BUG: Fix copy semantics in `array` #60046

BUG: Fix copy semantics in `array` #60046

seberg commented Oct 15, 2024 •

edited by jorisvandenbossche

Loading

seberg Nov 4, 2024 •

edited

Loading

jorisvandenbossche commented Nov 5, 2024 •

edited

Loading