Skip to content

CI: fix with new numpy nightly #50129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 10, 2022

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Dec 8, 2022

@MarcoGorelli MarcoGorelli force-pushed the new-numpy-compat branch 3 times, most recently from 84fc290 to b467e85 Compare December 8, 2022 16:47
Comment on lines 261 to 268
if (
val1 is pd.NA
and parse_version(np.__version__) > parse_version("1.24.0")
and "dev" in np.__version__
):
# can't compare with numpy array if it contains pd.NA
with pytest.raises(TypeError, match="boolean value of NA is ambiguous"):
result = df1.compare(df2, keep_shape=True)
else:
result = df1.compare(df2, keep_shape=True)
tm.assert_frame_equal(result, expected)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel @phofl ideas on what to do about this?

In the new numpy, comparisons with an array which contains pd.NA fail, see #50124 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.NA inside ndarray[object] is never going to work nicely. my inclination is to explicitly say it isnt supported and say that raising in this situation is expected

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the main concern here that equality comparisons to an array with object dtype will return an object array instead of a boolean?

Copy link
Member Author

@MarcoGorelli MarcoGorelli Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equality comparisons to an array with object dtype will return an object array

I may have misunderstood what you're asking, but comparisons with arrays of object dtype which contain pd.NA will raise:

In [1]: np.array([1, 3]) == np.array([4, pd.NA])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[1], line 1
----> 1 np.array([1, 3]) == np.array([4, pd.NA])

File ~/pandas-dev/pandas/_libs/missing.pyx:413, in pandas._libs.missing.NAType.__bool__()
    411 
    412     def __bool__(self):
--> 413         raise TypeError("boolean value of NA is ambiguous")
    414 
    415     def __hash__(self):

TypeError: boolean value of NA is ambiguous

Or are you suggesting that, on the pandas side, we catch this and return NA for this comparison?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specifically, this line

method = getattr(self._data, f"__{op.__name__}__")

is where the comparison would happen

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want consistency with scalar comparisons

>>> import pandas as pd
>>> pd.NA == pd.NA
<NA>
>>> pd.NA == 3
<NA>

Not sure I would expect different behavior when contained in a numpy array

Copy link
Member Author

@MarcoGorelli MarcoGorelli Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this possible? Because when we run

np.array([1, 3]) == np.array([4, pd.NA])

we get to NA.__eq__, which returns:

In [1]: pd.NA.__eq__(3)
Out[1]: <NA>

After that, I presume (but haven't checked) numpy calls bool on the result, and we end up with NA.__bool__ being called, but that returns TypeError

def __bool__(self):
raise TypeError("boolean value of NA is ambiguous")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we move the conversation about what should be done to the issue, to not have it too scattered? #50124 (comment)

Comment on lines 96 to 100
@pytest.mark.skipif(
parse_version(np.__version__) > parse_version("1.24.0")
and "dev" in np.__version__,
reason="GH50124",
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've marked this one as failing because in the new nightly numpy there's

      1 nat_1 = np.datetime64('NaT', 'Y')
      2 nat_2 = np.datetime64('NaT', 'ps')
----> 3 {nat_1, nat_2}

OverflowError: Integer overflow getting a common metadata divisor for NumPy datetime metadata [Y] and [ps].

I don't see why we can't put them both in a set, there shouldn't be any comparision between them going on - reckon this is a bug in numpy? If so, is it OK to skip this test for now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is there a better way to write this condition? parse_version(np.__version__) > parse_version("1.25.0") returns False unfortunately

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a pandas.compat.is_numpy_dev

is_numpy_dev,

@MarcoGorelli MarcoGorelli marked this pull request as ready for review December 8, 2022 16:59
@MarcoGorelli MarcoGorelli mentioned this pull request Dec 8, 2022
@@ -460,7 +461,7 @@ def test_array_equivalent_series(val):
cm = (
# stacklevel is chosen to make sense when called from .equals
tm.assert_produces_warning(FutureWarning, match=msg, check_stacklevel=False)
if isinstance(val, str)
if isinstance(val, str) and not is_numpy_dev
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before we had:

In [1]: np.array([1, 3]) == 'abc'
<ipython-input-1-4b7b4f567841>:1: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  np.array([1, 3]) == 'abc'
Out[1]: False

With the new nightly:

In [2]: np.array([1, 3]) == 'abc'
Out[2]: array([False, False])

Comment on lines -53 to +54
if isinstance(val, str)
if isinstance(val, str) and not is_numpy_dev
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli MarcoGorelli added CI Continuous Integration Compat pandas objects compatability with Numpy or Python functions labels Dec 8, 2022
tm.assert_frame_equal(result, expected)
if val1 is pd.NA and is_numpy_dev:
# can't compare with numpy array if it contains pd.NA
with pytest.raises(TypeError, match="boolean value of NA is ambiguous"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm why wouldn't we allow this? The returned array should just be object dtype with pd.NA no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comparing with an object ndarray which contains pd.NA now just raises #50124 (comment)

Brock had commented here about this here, saying it'd be his inclination to say that raising is expected #50129 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could putting NA in numpy arrays be a sign of a bug/design flaw in the DataFrame.compare method? Maybe there should be a special path when dealing with masked arrays to take into account that bool(NA) will fail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test, the column is being designed to be of dtype object with pd.NA in it:

arr = [4.0, val1]  # val1 is pd.NA from the fixture
df1 = pd.DataFrame({"a": arr, "b": [1.0, 2]})

So I think DataFrame.compare is doing the right thing - df1['a'] is an ordinary object array, not a masked one

Arguably it's the test that's doing something unsupported and should be rewritten so that df['a'] is a nullable type (which can handle pd.NA just fine)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our extension arrays are 1D, as soon as you have to operate on more than one column, we coerce to object dtype. Not saying this is the case here, but this is inevitable in some places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding to this: the DataFrame in the test has dtype object, eg no masked array

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, though fortunately the error's not come up in other places (yet! 😨 ). For EAs, if an error comes up as a result of this, then perhaps it's up to pandas to find a workaround

But for object arrays (non-EA) which contain pd.NA, do you agree that this is unsupported?

Copy link
Member

@phofl phofl Dec 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mostly responding to the design flaw bit. I totally agree that users shouldn’t do this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our extension arrays are 1D, as soon as you have to operate on more than one column, we coerce to object dtype. Not saying this is the case here, but this is inevitable in some places.

This is a problem when processing masked arrays in Dataframe methods/functions. Maybe there could be made an internal dataframe-wide na mask method, so the we can deal with NA's properly on a dataframe level, for example in the compare method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for object arrays (non-EA) which contain pd.NA, do you agree that this is unsupported?

Yes, it's inherently usupported, as bool(NA) fails (and should fail). Seeing pd.NA in numpy array may be code smells that we should avoid and work with the masks instead.

tm.assert_frame_equal(result, expected)
if val1 is pd.NA and is_numpy_dev:
# can't compare with numpy array if it contains pd.NA
with pytest.raises(TypeError, match="boolean value of NA is ambiguous"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could putting NA in numpy arrays be a sign of a bug/design flaw in the DataFrame.compare method? Maybe there should be a special path when dealing with masked arrays to take into account that bool(NA) will fail?

# Numpy will raise on uncomparable types, like
# np.datetime64('NaT', 'Y') and np.datetime64('NaT', 'ps')
# https://github.com/numpy/numpy/issues/22762
return
Copy link
Contributor

@topper-123 topper-123 Dec 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this failure only on numpy dev, i.e. numpy > 1.23.x? Could this be better by having a p_version_lte1p23 = _nlv <= Version("1.23") in compat.py and then do:

if p_version_lte1p23:
    return

np_nat_fixture == np_nat_fixture2
...

?

That would make this easier to find and remove when we drop support for numpy 1.23 at some point in the future).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, though comparisions between timedelta64 NaT and datetime64 NaT will still work

maybe that can be split out - I'll do that, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@topper-123 topper-123 Dec 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in a followup, so we can get the CI to work again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure! noted to self

@phofl
Copy link
Member

phofl commented Dec 10, 2022

Can we merge this to get ci running again and address the topics raised in follow ups?

@topper-123
Copy link
Contributor

Can we merge this to get ci running again and address the topics raised in follow ups?

That's a good idea, IMO.

@MarcoGorelli
Copy link
Member Author

OK, I've removed the closes keyword, so we can keep #50124 open to keep discussing this

@MarcoGorelli
Copy link
Member Author

Can we merge this to get ci running again and address the topics raised in follow ups?

That's a good idea, IMO.

Merging then, let's take this conversation forwards in the issue #50124

@MarcoGorelli MarcoGorelli modified the milestones: 1.5.3, 2.0 Dec 10, 2022
@MarcoGorelli MarcoGorelli merged commit 5fad2e4 into pandas-dev:main Dec 10, 2022
@lumberbot-app
Copy link

lumberbot-app bot commented Dec 10, 2022

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 1.5.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 5fad2e4e90be46c14e947ac1a7eba3275c4b656d
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #50129: CI: fix with new numpy nightly'
  1. Push to a named branch:
git push YOURFORK 1.5.x:auto-backport-of-pr-50129-on-1.5.x
  1. Create a PR against branch 1.5.x, I would have named this PR:

"Backport PR #50129 on branch 1.5.x (CI: fix with new numpy nightly)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

MarcoGorelli added a commit that referenced this pull request Dec 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration Compat pandas objects compatability with Numpy or Python functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants