-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
CI: fix with new numpy nightly #50129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
84fc290
to
b467e85
Compare
if ( | ||
val1 is pd.NA | ||
and parse_version(np.__version__) > parse_version("1.24.0") | ||
and "dev" in np.__version__ | ||
): | ||
# can't compare with numpy array if it contains pd.NA | ||
with pytest.raises(TypeError, match="boolean value of NA is ambiguous"): | ||
result = df1.compare(df2, keep_shape=True) | ||
else: | ||
result = df1.compare(df2, keep_shape=True) | ||
tm.assert_frame_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel @phofl ideas on what to do about this?
In the new numpy, comparisons with an array which contains pd.NA
fail, see #50124 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pd.NA inside ndarray[object] is never going to work nicely. my inclination is to explicitly say it isnt supported and say that raising in this situation is expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the main concern here that equality comparisons to an array with object dtype will return an object array instead of a boolean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
equality comparisons to an array with object dtype will return an object array
I may have misunderstood what you're asking, but comparisons with arrays of object dtype which contain pd.NA will raise:
In [1]: np.array([1, 3]) == np.array([4, pd.NA])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[1], line 1
----> 1 np.array([1, 3]) == np.array([4, pd.NA])
File ~/pandas-dev/pandas/_libs/missing.pyx:413, in pandas._libs.missing.NAType.__bool__()
411
412 def __bool__(self):
--> 413 raise TypeError("boolean value of NA is ambiguous")
414
415 def __hash__(self):
TypeError: boolean value of NA is ambiguous
Or are you suggesting that, on the pandas side, we catch this and return NA
for this comparison?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specifically, this line
pandas/pandas/core/arrays/masked.py
Line 730 in 0ca49fc
method = getattr(self._data, f"__{op.__name__}__") |
is where the comparison would happen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want consistency with scalar comparisons
>>> import pandas as pd
>>> pd.NA == pd.NA
<NA>
>>> pd.NA == 3
<NA>
Not sure I would expect different behavior when contained in a numpy array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this possible? Because when we run
np.array([1, 3]) == np.array([4, pd.NA])
we get to NA.__eq__
, which returns:
In [1]: pd.NA.__eq__(3)
Out[1]: <NA>
After that, I presume (but haven't checked) numpy calls bool
on the result, and we end up with NA.__bool__
being called, but that returns TypeError
pandas/pandas/_libs/missing.pyx
Lines 412 to 413 in d660831
def __bool__(self): | |
raise TypeError("boolean value of NA is ambiguous") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we move the conversation about what should be done to the issue, to not have it too scattered? #50124 (comment)
@pytest.mark.skipif( | ||
parse_version(np.__version__) > parse_version("1.24.0") | ||
and "dev" in np.__version__, | ||
reason="GH50124", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've marked this one as failing because in the new nightly numpy there's
1 nat_1 = np.datetime64('NaT', 'Y')
2 nat_2 = np.datetime64('NaT', 'ps')
----> 3 {nat_1, nat_2}
OverflowError: Integer overflow getting a common metadata divisor for NumPy datetime metadata [Y] and [ps].
I don't see why we can't put them both in a set, there shouldn't be any comparision between them going on - reckon this is a bug in numpy? If so, is it OK to skip this test for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is there a better way to write this condition? parse_version(np.__version__) > parse_version("1.25.0")
returns False
unfortunately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a pandas.compat.is_numpy_dev
pandas/pandas/compat/__init__.py
Line 26 in 0119bdc
is_numpy_dev, |
b467e85
to
c7a1ed2
Compare
@@ -460,7 +461,7 @@ def test_array_equivalent_series(val): | |||
cm = ( | |||
# stacklevel is chosen to make sense when called from .equals | |||
tm.assert_produces_warning(FutureWarning, match=msg, check_stacklevel=False) | |||
if isinstance(val, str) | |||
if isinstance(val, str) and not is_numpy_dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before we had:
In [1]: np.array([1, 3]) == 'abc'
<ipython-input-1-4b7b4f567841>:1: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
np.array([1, 3]) == 'abc'
Out[1]: False
With the new nightly:
In [2]: np.array([1, 3]) == 'abc'
Out[2]: array([False, False])
if isinstance(val, str) | ||
if isinstance(val, str) and not is_numpy_dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tm.assert_frame_equal(result, expected) | ||
if val1 is pd.NA and is_numpy_dev: | ||
# can't compare with numpy array if it contains pd.NA | ||
with pytest.raises(TypeError, match="boolean value of NA is ambiguous"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm why wouldn't we allow this? The returned array should just be object dtype with pd.NA no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comparing with an object
ndarray which contains pd.NA
now just raises #50124 (comment)
Brock had commented here about this here, saying it'd be his inclination to say that raising is expected #50129 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could putting NA
in numpy arrays be a sign of a bug/design flaw in the DataFrame.compare
method? Maybe there should be a special path when dealing with masked arrays to take into account that bool(NA)
will fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this test, the column is being designed to be of dtype object
with pd.NA
in it:
arr = [4.0, val1] # val1 is pd.NA from the fixture
df1 = pd.DataFrame({"a": arr, "b": [1.0, 2]})
So I think DataFrame.compare
is doing the right thing - df1['a']
is an ordinary object
array, not a masked one
Arguably it's the test that's doing something unsupported and should be rewritten so that df['a']
is a nullable type (which can handle pd.NA
just fine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our extension arrays are 1D, as soon as you have to operate on more than one column, we coerce to object dtype. Not saying this is the case here, but this is inevitable in some places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding to this: the DataFrame in the test has dtype object, eg no masked array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, though fortunately the error's not come up in other places (yet! 😨 ). For EAs, if an error comes up as a result of this, then perhaps it's up to pandas to find a workaround
But for object
arrays (non-EA) which contain pd.NA
, do you agree that this is unsupported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mostly responding to the design flaw bit. I totally agree that users shouldn’t do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our extension arrays are 1D, as soon as you have to operate on more than one column, we coerce to object dtype. Not saying this is the case here, but this is inevitable in some places.
This is a problem when processing masked arrays in Dataframe methods/functions. Maybe there could be made an internal dataframe-wide na mask method, so the we can deal with NA's properly on a dataframe level, for example in the compare
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But for object arrays (non-EA) which contain pd.NA, do you agree that this is unsupported?
Yes, it's inherently usupported, as bool(NA)
fails (and should fail). Seeing pd.NA in numpy array may be code smells that we should avoid and work with the masks instead.
tm.assert_frame_equal(result, expected) | ||
if val1 is pd.NA and is_numpy_dev: | ||
# can't compare with numpy array if it contains pd.NA | ||
with pytest.raises(TypeError, match="boolean value of NA is ambiguous"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could putting NA
in numpy arrays be a sign of a bug/design flaw in the DataFrame.compare
method? Maybe there should be a special path when dealing with masked arrays to take into account that bool(NA)
will fail?
# Numpy will raise on uncomparable types, like | ||
# np.datetime64('NaT', 'Y') and np.datetime64('NaT', 'ps') | ||
# https://github.com/numpy/numpy/issues/22762 | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this failure only on numpy dev, i.e. numpy > 1.23.x? Could this be better by having a p_version_lte1p23 = _nlv <= Version("1.23")
in compat.py and then do:
if p_version_lte1p23:
return
np_nat_fixture == np_nat_fixture2
...
?
That would make this easier to find and remove when we drop support for numpy 1.23 at some point in the future).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, though comparisions between timedelta64 NaT
and datetime64 NaT
will still work
maybe that can be split out - I'll do that, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe in a followup, so we can get the CI to work again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure! noted to self
Can we merge this to get ci running again and address the topics raised in follow ups? |
That's a good idea, IMO. |
OK, I've removed the |
Merging then, let's take this conversation forwards in the issue #50124 |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.