CI: fix with new numpy nightly #50129

MarcoGorelli · 2022-12-08T16:40:30Z

xref CI: PY311 failures #50124 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

MarcoGorelli · 2022-12-08T16:49:11Z

pandas/tests/frame/methods/test_compare.py

+    if (
+        val1 is pd.NA
+        and parse_version(np.__version__) > parse_version("1.24.0")
+        and "dev" in np.__version__
+    ):
+        # can't compare with numpy array if it contains pd.NA
+        with pytest.raises(TypeError, match="boolean value of NA is ambiguous"):
+            result = df1.compare(df2, keep_shape=True)
+    else:
+        result = df1.compare(df2, keep_shape=True)
+        tm.assert_frame_equal(result, expected)


@jbrockmendel @phofl ideas on what to do about this?

In the new numpy, comparisons with an array which contains pd.NA fail, see #50124 (comment)

pd.NA inside ndarray[object] is never going to work nicely. my inclination is to explicitly say it isnt supported and say that raising in this situation is expected

Is the main concern here that equality comparisons to an array with object dtype will return an object array instead of a boolean?

equality comparisons to an array with object dtype will return an object array

I may have misunderstood what you're asking, but comparisons with arrays of object dtype which contain pd.NA will raise:

In [1]: np.array([1, 3]) == np.array([4, pd.NA]) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[1], line 1 ----> 1 np.array([1, 3]) == np.array([4, pd.NA]) File ~/pandas-dev/pandas/_libs/missing.pyx:413, in pandas._libs.missing.NAType.__bool__() 411 412 def __bool__(self): --> 413 raise TypeError("boolean value of NA is ambiguous") 414 415 def __hash__(self): TypeError: boolean value of NA is ambiguous

Or are you suggesting that, on the pandas side, we catch this and return NA for this comparison?

specifically, this line

pandas/pandas/core/arrays/masked.py

Line 730 in 0ca49fc

method = getattr(self._data, f"__{op.__name__}__")

is where the comparison would happen

I think we want consistency with scalar comparisons

>>> import pandas as pd >>> pd.NA == pd.NA <NA> >>> pd.NA == 3 <NA>

Not sure I would expect different behavior when contained in a numpy array

Is this possible? Because when we run

np.array([1, 3]) == np.array([4, pd.NA])

we get to NA.__eq__, which returns:

In [1]: pd.NA.__eq__(3) Out[1]: <NA>

After that, I presume (but haven't checked) numpy calls bool on the result, and we end up with NA.__bool__ being called, but that returns TypeError

pandas/pandas/_libs/missing.pyx

Lines 412 to 413 in d660831

def __bool__(self):

raise TypeError("boolean value of NA is ambiguous")

shall we move the conversation about what should be done to the issue, to not have it too scattered? #50124 (comment)

MarcoGorelli · 2022-12-08T16:58:18Z

pandas/tests/indexes/object/test_indexing.py

+    @pytest.mark.skipif(
+        parse_version(np.__version__) > parse_version("1.24.0")
+        and "dev" in np.__version__,
+        reason="GH50124",
+    )


I've marked this one as failing because in the new nightly numpy there's

1 nat_1 = np.datetime64('NaT', 'Y') 2 nat_2 = np.datetime64('NaT', 'ps') ----> 3 {nat_1, nat_2} OverflowError: Integer overflow getting a common metadata divisor for NumPy datetime metadata [Y] and [ps].

I don't see why we can't put them both in a set, there shouldn't be any comparision between them going on - reckon this is a bug in numpy? If so, is it OK to skip this test for now?

Also, is there a better way to write this condition? parse_version(np.__version__) > parse_version("1.25.0") returns False unfortunately

There's a pandas.compat.is_numpy_dev

pandas/pandas/compat/__init__.py

Line 26 in 0119bdc

is_numpy_dev,

MarcoGorelli · 2022-12-08T19:29:11Z

pandas/tests/dtypes/test_missing.py

@@ -460,7 +461,7 @@ def test_array_equivalent_series(val):
    cm = (
        # stacklevel is chosen to make sense when called from .equals
        tm.assert_produces_warning(FutureWarning, match=msg, check_stacklevel=False)
-        if isinstance(val, str)
+        if isinstance(val, str) and not is_numpy_dev


before we had:

In [1]: np.array([1, 3]) == 'abc' <ipython-input-1-4b7b4f567841>:1: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison np.array([1, 3]) == 'abc' Out[1]: False

With the new nightly:

In [2]: np.array([1, 3]) == 'abc' Out[2]: array([False, False])

MarcoGorelli · 2022-12-08T19:37:49Z

pandas/tests/series/methods/test_equals.py

-        if isinstance(val, str)
+        if isinstance(val, str) and not is_numpy_dev


same reason as https://github.com/pandas-dev/pandas/pull/50129/files#r1043742327

WillAyd · 2022-12-09T18:13:28Z

pandas/tests/frame/methods/test_compare.py

-    tm.assert_frame_equal(result, expected)
+    if val1 is pd.NA and is_numpy_dev:
+        # can't compare with numpy array if it contains pd.NA
+        with pytest.raises(TypeError, match="boolean value of NA is ambiguous"):


Hmm why wouldn't we allow this? The returned array should just be object dtype with pd.NA no?

comparing with an object ndarray which contains pd.NA now just raises #50124 (comment)

Brock had commented here about this here, saying it'd be his inclination to say that raising is expected #50129 (comment)

Could putting NA in numpy arrays be a sign of a bug/design flaw in the DataFrame.compare method? Maybe there should be a special path when dealing with masked arrays to take into account that bool(NA) will fail?

In this test, the column is being designed to be of dtype object with pd.NA in it:

arr = [4.0, val1] # val1 is pd.NA from the fixture df1 = pd.DataFrame({"a": arr, "b": [1.0, 2]})

So I think DataFrame.compare is doing the right thing - df1['a'] is an ordinary object array, not a masked one

Arguably it's the test that's doing something unsupported and should be rewritten so that df['a'] is a nullable type (which can handle pd.NA just fine)

Our extension arrays are 1D, as soon as you have to operate on more than one column, we coerce to object dtype. Not saying this is the case here, but this is inevitable in some places.

Adding to this: the DataFrame in the test has dtype object, eg no masked array

True, though fortunately the error's not come up in other places (yet! 😨 ). For EAs, if an error comes up as a result of this, then perhaps it's up to pandas to find a workaround

But for object arrays (non-EA) which contain pd.NA, do you agree that this is unsupported?

I was mostly responding to the design flaw bit. I totally agree that users shouldn’t do this

Our extension arrays are 1D, as soon as you have to operate on more than one column, we coerce to object dtype. Not saying this is the case here, but this is inevitable in some places.

This is a problem when processing masked arrays in Dataframe methods/functions. Maybe there could be made an internal dataframe-wide na mask method, so the we can deal with NA's properly on a dataframe level, for example in the compare method.

But for object arrays (non-EA) which contain pd.NA, do you agree that this is unsupported?

Yes, it's inherently usupported, as bool(NA) fails (and should fail). Seeing pd.NA in numpy array may be code smells that we should avoid and work with the masks instead.

topper-123 · 2022-12-10T14:58:02Z

pandas/tests/frame/methods/test_compare.py

-    tm.assert_frame_equal(result, expected)
+    if val1 is pd.NA and is_numpy_dev:
+        # can't compare with numpy array if it contains pd.NA
+        with pytest.raises(TypeError, match="boolean value of NA is ambiguous"):


Could putting NA in numpy arrays be a sign of a bug/design flaw in the DataFrame.compare method? Maybe there should be a special path when dealing with masked arrays to take into account that bool(NA) will fail?

topper-123 · 2022-12-10T15:05:08Z

pandas/tests/indexes/object/test_indexing.py

+                # Numpy will raise on uncomparable types, like
+                # np.datetime64('NaT', 'Y') and np.datetime64('NaT', 'ps')
+                # https://github.com/numpy/numpy/issues/22762
+                return


Is this failure only on numpy dev, i.e. numpy > 1.23.x? Could this be better by having a p_version_lte1p23 = _nlv <= Version("1.23") in compat.py and then do:

if p_version_lte1p23: return np_nat_fixture == np_nat_fixture2 ...

?

That would make this easier to find and remove when we drop support for numpy 1.23 at some point in the future).

true, though comparisions between timedelta64 NaT and datetime64 NaT will still work

maybe that can be split out - I'll do that, thanks!

Maybe in a followup, so we can get the CI to work again?

sure! noted to self

phofl · 2022-12-10T15:22:31Z

Can we merge this to get ci running again and address the topics raised in follow ups?

topper-123 · 2022-12-10T16:35:39Z

Can we merge this to get ci running again and address the topics raised in follow ups?

That's a good idea, IMO.

MarcoGorelli · 2022-12-10T16:37:13Z

OK, I've removed the closes keyword, so we can keep #50124 open to keep discussing this

MarcoGorelli · 2022-12-10T17:02:06Z

Can we merge this to get ci running again and address the topics raised in follow ups?

That's a good idea, IMO.

Merging then, let's take this conversation forwards in the issue #50124

lumberbot-app · 2022-12-10T17:03:24Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 1.5.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 5fad2e4e90be46c14e947ac1a7eba3275c4b656d

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #50129: CI: fix with new numpy nightly'

Push to a named branch:

git push YOURFORK 1.5.x:auto-backport-of-pr-50129-on-1.5.x

Create a PR against branch 1.5.x, I would have named this PR:

"Backport PR #50129 on branch 1.5.x (CI: fix with new numpy nightly)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

…50167) Backport PR #50129: CI: fix with new numpy nightly

MarcoGorelli added 2 commits December 8, 2022 15:53

numpy compat

3b105ff

use is_numpy_dev

c7a1ed2

MarcoGorelli force-pushed the new-numpy-compat branch 3 times, most recently from 84fc290 to b467e85 Compare December 8, 2022 16:47

MarcoGorelli commented Dec 8, 2022

View reviewed changes

MarcoGorelli marked this pull request as ready for review December 8, 2022 16:59

MarcoGorelli force-pushed the new-numpy-compat branch from b467e85 to c7a1ed2 Compare December 8, 2022 17:35

dont skip test

429b25c

MarcoGorelli mentioned this pull request Dec 8, 2022

CI: PY311 failures #50124

Closed

MarcoGorelli commented Dec 8, 2022

View reviewed changes

MarcoGorelli added CI Continuous Integration Compat pandas objects compatability with Numpy or Python functions labels Dec 8, 2022

WillAyd reviewed Dec 9, 2022

View reviewed changes

topper-123 reviewed Dec 10, 2022

View reviewed changes

MarcoGorelli modified the milestones: 1.5.3, 2.0 Dec 10, 2022

MarcoGorelli merged commit 5fad2e4 into pandas-dev:main Dec 10, 2022

lumberbot-app bot added the Still Needs Manual Backport label Dec 10, 2022

MarcoGorelli added a commit to MarcoGorelli/pandas that referenced this pull request Dec 10, 2022

Backport PR pandas-dev#50129: CI: fix with new numpy nightly

ee533dd

MarcoGorelli removed the Still Needs Manual Backport label Dec 10, 2022

MarcoGorelli added a commit that referenced this pull request Dec 10, 2022

Backport PR #50129 on branch 1.5.x (CI: fix with new numpy nightly) (#…

08a4e74

…50167) Backport PR #50129: CI: fix with new numpy nightly

	def __bool__(self):
	raise TypeError("boolean value of NA is ambiguous")

		if isinstance(val, str)
		if isinstance(val, str) and not is_numpy_dev

Uh oh!

CI: fix with new numpy nightly #50129

CI: fix with new numpy nightly #50129

Uh oh!

Conversation

MarcoGorelli commented Dec 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli Dec 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli Dec 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phofl Dec 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

topper-123 Dec 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

topper-123 Dec 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phofl commented Dec 10, 2022

Uh oh!

topper-123 commented Dec 10, 2022

Uh oh!

MarcoGorelli commented Dec 8, 2022 •

edited

Loading

MarcoGorelli Dec 9, 2022 •

edited

Loading

MarcoGorelli Dec 9, 2022 •

edited

Loading

phofl Dec 10, 2022 •

edited

Loading

topper-123 Dec 10, 2022 •

edited

Loading

topper-123 Dec 10, 2022 •

edited

Loading