-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: membership checks on ExtensionArray containing NA values #37867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: membership checks on ExtensionArray containing NA values #37867
Conversation
589d0d3
to
625aa1f
Compare
pandas/core/arrays/base.py
Outdated
""" | ||
# comparisons of any item to pd.NA always return pd.NA, so e.g. "a" in [pd.NA] | ||
# would raise a TypeError. The implementation below works around that. | ||
if isna(item): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_valid_nat_for_dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand you here. Are you saying that pd.NaT in arr
should return False if the array is not a datetime-like array? Can you expand a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying that pd.NaT in arr should return False if the array is not a datetime-like array?
correct (unless maybe object dtype). Some other examples:
dta = pd.date_range("2016-01-01", periods=3)._data # <-- DateTimeArray
dta[-1] = pd.NaT
pa = dta.to_period("D") # <-- PeriodArray
tda = dta - dta # <-- TimedeltaArray
arr = pd.array([1, 2 ,3, pd.NA]) # <-- IntegerArray
flt = pd.array([1.0, 2.0, 3.0, np.nan]) # <-- FloatArray
tdnat = np.timedelta64("NaT", "ns")
dtnat = np.timedelta64("NaT", "ns")
>>> tdnat in dta
False
>>> dtnat in dta
True # <-- actually this returns False, but thats a newly-discovered bug
>>> tdnat in pa
False
>>> dtnat in pa
False
>>> tdnat in tda
True # <-- actually this raises TypeError, but thats a newly-discovered bug
>>> dtnat in tda
False # <-- actually this raises TypeError, but thats a newly-discovered bug
>>> tdnat in flt
False
>>> dtnat in flt
False # <-- actually this raises TypeError, but thats a newly-discovered bug
>>> tdnat in arr
False
>>> dtnat in arr
False # <-- actually this raises TypeError, but thats a newly-discovered bug
side-note after checking all of these: yikes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't use datetime-likes very much, so I'm a bit weak on what these comparisons of nan-likes should return in datatime-likes and you certainly didn't help :-)
Maybe do as @jreback suggests, and handle these later/separately? An alternative could be in the base ExtensionArray
check for pd.NA
only. It sub-classes wan to do it differently, they can implement a __contains__
method themselves. That seems the cleanest to me.
ok so this looks fine. merge and then open issues / followup for recently discovered cases? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should first discuss / decide on what behaviour we actually want for containment involving missing values (as the item being checked).
As it seems (even apart from the noted errors with pd.NA) an inconsistent situation right now? (np.nan gives False, pd.NaT gives True)
Also, can you add a base extension test?
Now, I don't really know myself at the moment what I would expect for behaviour here. Listing all possible options (eg for
Quickly looking at some other languages to see what they do: R returns True: > NA %in% c(1, 2, 3, NA)
[1] TRUE Julia is "strict" about the unknown part of julia> arr = [1 2 3 missing]
1×4 Array{Union{Missing, Int64},2}:
1 2 3 missing
julia> 1 in arr
true
julia> 5 in arr
missing
julia> missing in arr
missing
julia> arr = [1 2 3]
1×3 Array{Int64,2}:
1 2 3
julia> 5 in arr
false
julia> missing in arr
missing |
Yes, @jorisvandenbossche, my doubts on nan-likes as well: There doesn't seem to be any inherantly correct choice, as it look to me. We have however already in pandas that: >>> arr2 = pd.array([pd.NA, "a"])
>>> pd.NA in arr2 # see OP
True
>>> dta = pd.date_range("2016-01-01", periods=2)._data
>>> dta[0] = pd.NaT
>>> pd.NaT in dta
True So from at backward compat stand point there is some argument for returning True in Pandas, if looking for a nan-like in the ExtensionArray. |
Specifically that The fact that it returns True for pd.NaT certainly has a much longer history, but I don't know to what extent this was intentional (@jbrockmendel ?). For example the numpy counterpart actually returns False:
Something else, the current |
I'm pretty sure that is unintentional. I've been fixing some
I'd advocate updating
That is surprising. My intuition is that |
For NaN/NaT, those values are not equal to itself, and I think numpy uses equality to implement |
My intuition is based more around our Index implementations that use What would the idiomatic way of checking |
I think there will be be a surprise if users check for nan-likes using
A solution to avoid ambiguity is to raise a TypeError, if a nan-like is supplied to def __contains__(self, item) -> bool:
if isna(item):
raise TypeError("NA value not allowed. Use .isna method to check for NA values.")
else:
return (item == self).any() This would at least ensure that |
My complaint about that is it doesn't distinguish between different NAs, most relevant for object dtype.
Short-term, can we fix this without taking a stance on the NA topic? My intuition is that |
agree with @jbrockmendel I think have NaN != itself is different then a membership test. we do not want special cases. |
How is fixing this not taking a stance? ;) The So that basically gives two possible interpretations / implementations of
While numpy uses the first one, it seems to make sense to use the second option for membership test like But, I think there is also something to say for using the "unknown value" interpretation of |
|
||
def test_contains(): | ||
# GH-37867 | ||
arr = pd.arrays.StringArray(np.array(["a", "b"], dtype=object)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arr = pd.arrays.StringArray(np.array(["a", "b"], dtype=object)) | |
arr = pd.array(["a", "b"], dtype="string") |
is a bit shorter to construct those
7f423a6
to
547388c
Compare
@topper-123 since you are adding a method on the base extension array class, I think we should also add a test for this in the generic base extension tests. There are fixtures for data with/without missing values and for the na_value, so it should be possible to test this in general. |
I've looked, but can't find the base extension tests. Could you say where they're located? |
In |
I was specifically referring to "This would at least ensure that |
Ah, yes, that's something we certainly agree on I think that this at least shouldn't raise! Now, I am fine with following our behaviour for Index containment / indexing, which is to treat NaN and pd.NA etc equal to itself (in contrast to |
I've updated the PR. For comparisons with nan_likes, it nows returns True if the na value is |
c0510e0
to
122a6fe
Compare
The Travis failure looks unrelated. I'll run the PR again after comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, some comments.
pandas/core/arrays/base.py
Outdated
# comparisons of any item to pd.NA always return pd.NA, so e.g. "a" in [pd.NA] | ||
# would raise a TypeError. The implementation below works around that. | ||
if item is self.dtype.na_value: | ||
return isna(self).any() if self._can_hold_na else False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use self.isna()
# the settled on rule is: `nan_like in arr` is True if nan_like is | ||
# arr.dtype.na_value and arr.isna().any() is True. Else the check returns False. | ||
|
||
for this_data in [data, data_missing]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comments below on how to actually parameterize this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't its possible to parametrize fixtures? do have an example or hint on how that's done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grep for
pytest.getfixturevalue in tests
we just had a use of this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't get it to work, I'll try it again tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I would simply leave it as is, using pytest.getfixturevalue
only makes it way more complicated as it needs to be
Co-authored-by: Joris Van den Bossche <[email protected]>
@@ -50,6 +50,10 @@ def test_view(self, data): | |||
# __setitem__ does not work, so we only have a smoke-test | |||
data.view() | |||
|
|||
@pytest.mark.xfail(raises=AssertionError, reason="Not implemented yet") | |||
def test_contains(self, data, data_missing): | |||
super().test_contains(data, data_missing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Arrow arrays have Arrow Nulls as the dtype.na_value, while data_missing[0] gives ´None´.
Maybe @TomAugspurger could look into that as part of his Arrow work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Arrow arrays have Arrow Nulls as the dtype.na_value, while data_missing[0] gives ´None´.
It's fine to simply skip it as you did. This are only test EAs, that were needed for certain specific tests, so no problem this don't fully work otherwise.
The actual public arrays using Arrow under the hood (eg the new ArrowStringArray) has a different implementation and is fully tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also base these test arrays on pd.NA
? Otherwise we can use the same code as in the string array to replace the scalar return value of None
with na_value
.
The travis failure looks unrelated and I can't reproduce it on my computer. So IMo this PR passes currently, but I could make another run, if there's agreement on the content. |
Yes, don't worry about the travis failure on this PR |
def test_contains(self, data, data_missing): | ||
# GH-37867 | ||
# na value handling in Categorical.__contains__ is deprecated. | ||
# See base.BaseInterFaceTests.test_contains for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This duplicates what you have in tests/arrays/categorical/test_operators.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not: Categorical
already had a __contains__
method and it's more permissive than the new one. So, in this file we have (below) assert na_value_type in data_missing
, while the base tests method is assert na_value_type not in data_missing
(notice the not
).
na values is also more complicated in categoricals, because in some cases we want to accept pd.NaT
and in other cases not. I'd like to take it in another round (or let it slide)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I was not referring to the base tests that this is overriding, but the original test you added to tests/arrays/categorical/test_operators.py
which also tests this more permissive behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I'll delete the ones in tests/arrays/categorical/test_operators.py
.
pandas/core/arrays/numpy_.py
Outdated
@@ -51,6 +51,13 @@ def numpy_dtype(self) -> np.dtype: | |||
""" | |||
return self._dtype | |||
|
|||
@property | |||
def na_value(self) -> object: | |||
if issubclass(self.type, np.floating): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this always nan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem was that np.float64("nan") is not np.nan
. However, I've fixed this, so this method is actuallt not necessary.
@@ -143,6 +143,11 @@ def test_custom_asserts(self): | |||
with pytest.raises(AssertionError, match=msg): | |||
self.assert_frame_equal(a.to_frame(), b.to_frame()) | |||
|
|||
@pytest.mark.xfail(reason="comparison method not implemented on JSONArray") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is than issue number for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue number is the first line in the method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the xfail pls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
5b6e12b
to
a1583e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good and fine merging like this, but should change to use the nulls_fixture which is more comprehensive. if you can do this today can get this in
assert na_value in data_missing | ||
assert na_value not in data | ||
|
||
# the data can never contain other nan-likes than na_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might consider actually using the nulls_fixture as this is more comprehensive (sure it will run the other checks multiple times but no big deal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am personally -1 on using a fixture for this (we could move the list of nulls that is currently used for defining the fixture to constant in _testing.py
and use that constant in both places, though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case?
we already have a nulls fixture and use it in a great many places
we need to be comprehensive and general in testing - specific cases are ok sometimes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok here's the fixture contents
@pytest.fixture(params=[None, np.nan, pd.NaT, float("nan"), pd.NA], ids=str)
so if you add float('nan')
then this should cover
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the generality of using the nulls_fixture too. I've added it in the newest commit.
assert na_value in data_missing | ||
assert na_value not in data | ||
|
||
# the data can never contain other nan-likes than na_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment here.
lgtm. thanks @topper-123 |
Membership checks on ExtensionArrays containing
NA
values raises ValueError in some circumstances (but not in other):So overall quite random failures. This PR fixes this problem by adding a custom
__contains__
method onExtensionArray
.I assume that we want
pd.NA in arr1
to keep returning True. Note however thatnp.nan in np.array([np.nan])
return False, so pandas' behaviour is different.