API: membership checks on ExtensionArray containing NA values #37867

topper-123 · 2020-11-15T17:22:18Z

Membership checks on ExtensionArrays containing NA values raises ValueError in some circumstances (but not in other):

>>> arr1 = pd.array(["a", pd.NA])
>>> arr2 = pd.array([pd.NA, "a"])
>>> "a" in arr1
True  # ok
>>> "a" in arr2
TypeError: boolean value of NA is ambiguous  # not ok
>>> pd.NA in arr1
TypeError: boolean value of NA is ambiguous  # not ok
>>> pd.NA in arr2
True  # ok

So overall quite random failures. This PR fixes this problem by adding a custom __contains__ method on ExtensionArray.

I assume that we want pd.NA in arr1 to keep returning True. Note however that np.nan in np.array([np.nan]) return False, so pandas' behaviour is different.

pandas/core/arrays/base.py

jbrockmendel · 2020-11-15T23:06:44Z

pandas/core/arrays/base.py

+        """
+        # comparisons of any item to pd.NA always return pd.NA, so e.g. "a" in [pd.NA]
+        # would raise a TypeError. The implementation below works around that.
+        if isna(item):


is_valid_nat_for_dtype

I don't understand you here. Are you saying that pd.NaT in arr should return False if the array is not a datetime-like array? Can you expand a bit?

Are you saying that pd.NaT in arr should return False if the array is not a datetime-like array?

correct (unless maybe object dtype). Some other examples:

dta = pd.date_range("2016-01-01", periods=3)._data # <-- DateTimeArray dta[-1] = pd.NaT pa = dta.to_period("D") # <-- PeriodArray tda = dta - dta # <-- TimedeltaArray arr = pd.array([1, 2 ,3, pd.NA]) # <-- IntegerArray flt = pd.array([1.0, 2.0, 3.0, np.nan]) # <-- FloatArray tdnat = np.timedelta64("NaT", "ns") dtnat = np.timedelta64("NaT", "ns") >>> tdnat in dta False >>> dtnat in dta True # <-- actually this returns False, but thats a newly-discovered bug >>> tdnat in pa False >>> dtnat in pa False >>> tdnat in tda True # <-- actually this raises TypeError, but thats a newly-discovered bug >>> dtnat in tda False # <-- actually this raises TypeError, but thats a newly-discovered bug >>> tdnat in flt False >>> dtnat in flt False # <-- actually this raises TypeError, but thats a newly-discovered bug >>> tdnat in arr False >>> dtnat in arr False # <-- actually this raises TypeError, but thats a newly-discovered bug

side-note after checking all of these: yikes!

I don't use datetime-likes very much, so I'm a bit weak on what these comparisons of nan-likes should return in datatime-likes and you certainly didn't help :-)

Maybe do as @jreback suggests, and handle these later/separately? An alternative could be in the base ExtensionArray check for pd.NA only. It sub-classes wan to do it differently, they can implement a __contains__ method themselves. That seems the cleanest to me.

jreback · 2020-11-17T01:36:30Z

ok so this looks fine. merge and then open issues / followup for recently discovered cases?

jorisvandenbossche

I think we should first discuss / decide on what behaviour we actually want for containment involving missing values (as the item being checked).

As it seems (even apart from the noted errors with pd.NA) an inconsistent situation right now? (np.nan gives False, pd.NaT gives True)

Also, can you add a base extension test?

jorisvandenbossche · 2020-11-17T21:26:54Z

Now, I don't really know myself at the moment what I would expect for behaviour here. Listing all possible options (eg for pd.NA in pd.array([1, 2, pd.NA]):

Return True (because pd.NA is present in the array)
Return False (because pd.NA is not equal to itself, so cannot be "found")
Raise an error (because it is ambiguous (since there is an argument for both True/False), or because there is simply no "correct" answer, in the same logic as we also raise an error for bool(pd.NA))
- Counterargument: in dictionaries pd.NA can be used as key, so you could argue that containement is similar as dictionary key checking
Return pd.NA (similar reasons as above, + since the pd.NA values are considered as "unknown", it is also "unknown" if the unknown value is contained in the array)

Quickly looking at some other languages to see what they do:

R returns True:

> NA %in% c(1, 2, 3, NA)
[1] TRUE

Julia is "strict" about the unknown part of missing, and if a missing value is present in the array, even any other value not present in the array is not known to not be in the array (so not even returns False):

julia> arr = [1 2 3 missing]
1×4 Array{Union{Missing, Int64},2}:
 1  2  3  missing

julia> 1 in arr
true

julia> 5 in arr
missing

julia> missing in arr
missing

julia> arr = [1 2 3]
1×3 Array{Int64,2}:
 1  2  3

julia> 5 in arr
false

julia> missing in arr
missing

topper-123 · 2020-11-17T23:20:02Z

Yes, @jorisvandenbossche, my doubts on nan-likes as well: There doesn't seem to be any inherantly correct choice, as it look to me.

We have however already in pandas that:

>>> arr2 = pd.array([pd.NA, "a"])
>>> pd.NA in arr2  # see OP
True
>>> dta = pd.date_range("2016-01-01", periods=2)._data
>>> dta[0] = pd.NaT
>>> pd.NaT in dta
True

So from at backward compat stand point there is some argument for returning True in Pandas, if looking for a nan-like in the ExtensionArray.

jorisvandenbossche · 2020-11-18T08:41:17Z

Specifically that pd.NA in pd.array([pd.NA, "a"]) is returning True, I wouldn't use as a "prior" art argument, as I think this is mostly accidental behaviour and we never really considered yet the desired behaviour for __contains__ for nullable EAs (and those types are not yet considered stable, so we can change behaviour if we decide this is needed).

The fact that it returns True for pd.NaT certainly has a much longer history, but I don't know to what extent this was intentional (@jbrockmendel ?). For example the numpy counterpart actually returns False:

In [20]: arr = np.array(["NaT"], dtype="datetime64[ns]")

In [21]: arr[0]
Out[21]: numpy.datetime64('NaT')

In [22]: arr[0] in arr
Out[22]: False

Something else, the current isna(item) check is also too broad, as (I think from looking at the code), it will make that things like None in pd.array([pd.NA, "a"]) or np.nan in pd.array([pd.NA, "a"]) will also return True, which I think we certainly don't want?

jbrockmendel · 2020-11-19T20:44:16Z

The fact that it returns True for pd.NaT certainly has a much longer history, but I don't know to what extent this was intentional

I'm pretty sure that is unintentional. I've been fixing some isna checks to use is_valid_nat_for_dtype, but haven't gotten all of them.

Something else, the current isna(item) check is also too broad, as (I think from looking at the code), it will make that things like None in pd.array([pd.NA, "a"]) or np.nan in pd.array([pd.NA, "a"]) will also return True, which I think we certainly don't want?

I'd advocate updating is_valid_nat_for_dtype to reflect those exclusions. (And renaming valid_nat -> valid_na)

In [22]: arr[0] in arr
Out[22]: False

That is surprising. My intuition is that arr[0] in arr should be True for any arr. I'm open to being convinced otherwise.

jorisvandenbossche · 2020-11-19T20:49:16Z

That is surprising. My intuition is that arr[0] in arr should be True for any arr. I'm open to being convinced otherwise.

For NaN/NaT, those values are not equal to itself, and I think numpy uses equality to implement __contains__

jbrockmendel · 2020-11-19T21:05:28Z

For NaN/NaT, those values are not equal to itself, and I think numpy uses equality to implement contains

My intuition is based more around our Index implementations that use get_loc.

What would the idiomatic way of checking any(x is pd.NA for x in arr) be?

topper-123 · 2020-11-21T23:05:08Z

What would the idiomatic way of checking any(x is pd.NA for x in arr) be?

arr.isna().any() IMO.

I think there will be be a surprise if users check for nan-likes using __contains__ for some users, because there are two reasoably choices:

pd.NA in arr should be interpreted as (pd.NA == arr).any(), resulting in False even if a nan-like is in arr
pd.NA in arr should be interpreted as arr.isna().any(), resulting in True if a nan-like is in arr, but will also return True for e.g. pd.NaT in arr, which may be unintended (or has to be decided upon.

A solution to avoid ambiguity is to raise a TypeError, if a nan-like is supplied to __contains__:

    def __contains__(self, item) -> bool:
        if isna(item):
            raise TypeError("NA value not allowed. Use .isna method to check for NA values.")
        else:
            return (item == self).any()

This would at least ensure that "a" in arr doesn't raise unexpectedly.

jbrockmendel · 2020-11-22T15:47:10Z

What would the idiomatic way of checking any(x is pd.NA for x in arr) be?

arr.isna().any() IMO.

My complaint about that is it doesn't distinguish between different NAs, most relevant for object dtype.

This would at least ensure that "a" in arr doesn't raise unexpectedly.

Short-term, can we fix this without taking a stance on the NA topic?

My intuition is that obj in pd.array(arr) should match obj in pd.Index(arr) and obj in pd.Series(arr). Is this controversial?

jreback · 2020-11-24T14:05:53Z

What would the idiomatic way of checking any(x is pd.NA for x in arr) be?

arr.isna().any() IMO.

My complaint about that is it doesn't distinguish between different NAs, most relevant for object dtype.

This would at least ensure that "a" in arr doesn't raise unexpectedly.

Short-term, can we fix this without taking a stance on the NA topic?

My intuition is that obj in pd.array(arr) should match obj in pd.Index(arr) and obj in pd.Series(arr). Is this controversial?

agree with @jbrockmendel

I think have NaN != itself is different then a membership test. we do not want special cases.

pandas/tests/arrays/categorical/test_operators.py

pandas/tests/arrays/string_/test_string.py

jorisvandenbossche · 2020-11-24T16:46:51Z

Short-term, can we fix this without taking a stance on the NA topic?

How is fixing this not taking a stance? ;)
Because to fix it we need to choose a certain behaviour regardless, which then defacto becomes the __contains__ behaviour for pd.NA ..

The obj in pd.Index(arr) example from @jbrockmendel is a good comparison, because it basically does a engine/hashtable lookup operation under the hood and so I suppose will be consistent with indexing (series.loc[pd.NA] if the index has missing values).

So that basically gives two possible interpretations / implementations of __contains__:

equality-based: ((arr == obj).any())
lookup-based (eg pd.NA in dict.fromkeys(arr), or our indexing engine lookup)

While numpy uses the first one, it seems to make sense to use the second option for membership test like __contains__.

But, I think there is also something to say for using the "unknown value" interpretation of pd.NA. And then we don't know if some "unknown value" is present in the array, and we shouldn't return True or False.

jorisvandenbossche · 2020-11-24T16:54:52Z

pandas/tests/arrays/string_/test_string.py

+
+def test_contains():
+    # GH-37867
+    arr = pd.arrays.StringArray(np.array(["a", "b"], dtype=object))


Suggested change

arr = pd.arrays.StringArray(np.array(["a", "b"], dtype=object))

arr = pd.array(["a", "b"], dtype="string")

is a bit shorter to construct those

jorisvandenbossche · 2020-11-24T19:40:28Z

@topper-123 since you are adding a method on the base extension array class, I think we should also add a test for this in the generic base extension tests. There are fixtures for data with/without missing values and for the na_value, so it should be possible to test this in general.

topper-123 · 2020-11-24T19:47:22Z

@topper-123 since you are adding a method on the base extension array class, I think we should also add a test for this in the generic base extension tests. There are fixtures for data with/without missing values and for the na_value, so it should be possible to test this in general.

I've looked, but can't find the base extension tests. Could you say where they're located?

jorisvandenbossche · 2020-11-24T20:11:06Z

In pandas/tests/extension/base/, and then in one of those files. Not sure which file is best fitting, maybe interface.py, or methods.py

jbrockmendel · 2020-11-25T02:38:56Z

Short-term, can we fix this without taking a stance on the NA topic?

How is fixing this not taking a stance? ;)

I was specifically referring to "This would at least ensure that "a" in arr doesn't raise unexpectedly.", which I think we can ensure without taking a stance on the behavior of pd.NA in arr.

jorisvandenbossche · 2020-11-25T12:41:57Z

I was specifically referring to "This would at least ensure that "a" in arr doesn't raise unexpectedly."

Ah, yes, that's something we certainly agree on I think that this at least shouldn't raise!
(but to fix that we still need to choose a behaviour for pd.NA in .. as well)

Now, I am fine with following our behaviour for Index containment / indexing, which is to treat NaN and pd.NA etc equal to itself (in contrast to == equality).

topper-123 · 2020-11-25T19:39:20Z

I've updated the PR.

For comparisons with nan_likes, it nows returns True if the na value is arr.dtype.na_value else False.

topper-123 · 2020-11-26T07:06:48Z

The Travis failure looks unrelated. I'll run the PR again after comments.

jreback

lgtm, some comments.

jreback · 2020-11-26T15:57:37Z

pandas/core/arrays/base.py

+        # comparisons of any item to pd.NA always return pd.NA, so e.g. "a" in [pd.NA]
+        # would raise a TypeError. The implementation below works around that.
+        if item is self.dtype.na_value:
+            return isna(self).any() if self._can_hold_na else False


can you use self.isna()

pandas/tests/extension/base/interface.py

pandas/core/arrays/base.py

jreback · 2020-11-26T18:02:59Z

pandas/tests/extension/base/interface.py

+        # the settled on rule is: `nan_like in arr` is True if nan_like is
+        # arr.dtype.na_value and arr.isna().any() is True. Else the check returns False.
+
+        for this_data in [data, data_missing]:


see my comments below on how to actually parameterize this

I don't its possible to parametrize fixtures? do have an example or hint on how that's done?

grep for

pytest.getfixturevalue in tests
we just had a use of this

Couldn't get it to work, I'll try it again tonight.

Sorry, but I would simply leave it as is, using pytest.getfixturevalue only makes it way more complicated as it needs to be

pandas/tests/extension/base/interface.py

Co-authored-by: Joris Van den Bossche <[email protected]>

topper-123 · 2020-11-28T09:34:43Z

pandas/tests/extension/arrow/test_bool.py

@@ -50,6 +50,10 @@ def test_view(self, data):
        # __setitem__ does not work, so we only have a smoke-test
        data.view()

+    @pytest.mark.xfail(raises=AssertionError, reason="Not implemented yet")
+    def test_contains(self, data, data_missing):
+        super().test_contains(data, data_missing)


The Arrow arrays have Arrow Nulls as the dtype.na_value, while data_missing[0] gives ´None´.

Maybe @TomAugspurger could look into that as part of his Arrow work?

The Arrow arrays have Arrow Nulls as the dtype.na_value, while data_missing[0] gives ´None´.

It's fine to simply skip it as you did. This are only test EAs, that were needed for certain specific tests, so no problem this don't fully work otherwise.

The actual public arrays using Arrow under the hood (eg the new ArrowStringArray) has a different implementation and is fully tested.

Should we also base these test arrays on pd.NA? Otherwise we can use the same code as in the string array to replace the scalar return value of None with na_value.

topper-123 · 2020-11-28T12:46:01Z

The travis failure looks unrelated and I can't reproduce it on my computer. So IMo this PR passes currently, but I could make another run, if there's agreement on the content.

jorisvandenbossche · 2020-11-28T14:52:13Z

Yes, don't worry about the travis failure on this PR

pandas/core/arrays/categorical.py

jorisvandenbossche · 2020-11-28T15:04:07Z

pandas/tests/extension/test_categorical.py

+    def test_contains(self, data, data_missing):
+        # GH-37867
+        # na value handling in Categorical.__contains__ is deprecated.
+        # See base.BaseInterFaceTests.test_contains for more details.


This duplicates what you have in tests/arrays/categorical/test_operators.py ?

Unfortunately not: Categorical already had a __contains__ method and it's more permissive than the new one. So, in this file we have (below) assert na_value_type in data_missing, while the base tests method is assert na_value_type not in data_missing (notice the not).

na values is also more complicated in categoricals, because in some cases we want to accept pd.NaT and in other cases not. I'd like to take it in another round (or let it slide)

To be clear, I was not referring to the base tests that this is overriding, but the original test you added to tests/arrays/categorical/test_operators.py which also tests this more permissive behaviour?

Yeah. I'll delete the ones in tests/arrays/categorical/test_operators.py.

jreback · 2020-11-28T17:23:54Z

pandas/core/arrays/numpy_.py

@@ -51,6 +51,13 @@ def numpy_dtype(self) -> np.dtype:
        """
        return self._dtype

+    @property
+    def na_value(self) -> object:
+        if issubclass(self.type, np.floating):


isn't this always nan?

The problem was that np.float64("nan") is not np.nan. However, I've fixed this, so this method is actuallt not necessary.

pandas/tests/extension/decimal/array.py

jreback · 2020-11-28T17:24:39Z

pandas/tests/extension/json/test_json.py

@@ -143,6 +143,11 @@ def test_custom_asserts(self):
        with pytest.raises(AssertionError, match=msg):
            self.assert_frame_equal(a.to_frame(), b.to_frame())

+    @pytest.mark.xfail(reason="comparison method not implemented on JSONArray")


is than issue number for this?

The issue number is the first line in the method?

in the xfail pls

jreback

looks good and fine merging like this, but should change to use the nulls_fixture which is more comprehensive. if you can do this today can get this in

jreback · 2020-11-29T16:05:14Z

pandas/tests/extension/base/interface.py

+        assert na_value in data_missing
+        assert na_value not in data
+
+        # the data can never contain other nan-likes than na_value


might consider actually using the nulls_fixture as this is more comprehensive (sure it will run the other checks multiple times but no big deal).

I am personally -1 on using a fixture for this (we could move the list of nulls that is currently used for defining the fixture to constant in _testing.py and use that constant in both places, though)

in this case?

we already have a nulls fixture and use it in a great many places

we need to be comprehensive and general in testing - specific cases are ok sometimes

ok here's the fixture contents

@pytest.fixture(params=[None, np.nan, pd.NaT, float("nan"), pd.NA], ids=str)

so if you add float('nan') then this should cover

I like the generality of using the nulls_fixture too. I've added it in the newest commit.

jreback · 2020-11-29T16:05:39Z

pandas/tests/extension/test_categorical.py

+        assert na_value in data_missing
+        assert na_value not in data
+
+        # the data can never contain other nan-likes than na_value


same comment here.

jreback · 2020-11-29T19:11:24Z

lgtm. thanks @topper-123

topper-123 force-pushed the ExtensionArray.__contains__ branch from 589d0d3 to 625aa1f Compare November 15, 2020 17:23

jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves labels Nov 15, 2020

jreback added this to the 1.2 milestone Nov 15, 2020

jbrockmendel reviewed Nov 15, 2020

View reviewed changes

pandas/core/arrays/base.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Nov 15, 2020

View reviewed changes

jorisvandenbossche requested changes Nov 17, 2020

View reviewed changes

jreback requested changes Nov 24, 2020

View reviewed changes

pandas/tests/arrays/categorical/test_operators.py Outdated Show resolved Hide resolved

pandas/tests/arrays/string_/test_string.py Outdated Show resolved Hide resolved

jorisvandenbossche changed the title ~~BUG: membership checks on ExtensionArray containing NA values~~ API: membership checks on ExtensionArray containing NA values Nov 24, 2020

jorisvandenbossche reviewed Nov 24, 2020

View reviewed changes

topper-123 force-pushed the ExtensionArray.__contains__ branch from 7f423a6 to 547388c Compare November 24, 2020 18:50

topper-123 force-pushed the ExtensionArray.__contains__ branch from c0510e0 to 122a6fe Compare November 25, 2020 22:54

jreback requested changes Nov 26, 2020

View reviewed changes

minor changes

fdb9deb

jreback requested changes Nov 26, 2020

View reviewed changes

minor issues

52e2b43

jorisvandenbossche reviewed Nov 27, 2020

View reviewed changes

pandas/tests/extension/base/interface.py Outdated Show resolved Hide resolved

Update pandas/tests/extension/base/interface.py

f21890e

Co-authored-by: Joris Van den Bossche <[email protected]>

jorisvandenbossche mentioned this pull request Nov 27, 2020

Follow-up on basic FloatingArray implementation #38110

Open

10 tasks

topper-123 added 2 commits November 28, 2020 07:51

Allow for na values that are of same type as the data

6f633c7

cleanups

d8bdb2e

topper-123 commented Nov 28, 2020

View reviewed changes

jorisvandenbossche reviewed Nov 28, 2020

View reviewed changes

jreback requested changes Nov 28, 2020

View reviewed changes

topper-123 added 2 commits November 28, 2020 18:04

Fixes

4e4dbc4

remove text in categorical.py

a1583e7

topper-123 force-pushed the ExtensionArray.__contains__ branch from 5b6e12b to a1583e7 Compare November 28, 2020 18:05

topper-123 added 4 commits November 28, 2020 18:12

doc fix

3c2c2b0

add gh number

237fe45

linting

37219c3

clean tests

c4a6c36

jreback requested changes Nov 29, 2020

View reviewed changes

use nulls_fixture

245c99a

jreback approved these changes Nov 29, 2020

View reviewed changes

jreback merged commit 47d0da6 into pandas-dev:master Nov 29, 2020

topper-123 deleted the ExtensionArray.__contains__ branch November 29, 2020 19:50

martinfleis mentioned this pull request Dec 14, 2020

TST: pandas TestInterface.test_contains fails on fixture 'nulls_fixture' not found geopandas/geopandas#1735

Closed

jorisvandenbossche mentioned this pull request Dec 15, 2020

TST: don't use global fixture in the base extension tests #38494

Merged

jorisvandenbossche mentioned this pull request Jan 10, 2021

fix series.isin slow issue with Dtype IntegerArray #38379

Merged

5 tasks

	arr = pd.arrays.StringArray(np.array(["a", "b"], dtype=object))
	arr = pd.array(["a", "b"], dtype="string")

API: membership checks on ExtensionArray containing NA values #37867

API: membership checks on ExtensionArray containing NA values #37867

Conversation

topper-123 commented Nov 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 Nov 17, 2020 • edited Loading

Choose a reason for hiding this comment

jreback commented Nov 17, 2020

jorisvandenbossche left a comment • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 17, 2020

topper-123 commented Nov 17, 2020

jorisvandenbossche commented Nov 18, 2020

jbrockmendel commented Nov 19, 2020

jorisvandenbossche commented Nov 19, 2020

jbrockmendel commented Nov 19, 2020

topper-123 commented Nov 21, 2020 • edited Loading

jbrockmendel commented Nov 22, 2020

jreback commented Nov 24, 2020

jorisvandenbossche commented Nov 24, 2020 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 24, 2020

topper-123 commented Nov 24, 2020

jorisvandenbossche commented Nov 24, 2020

jbrockmendel commented Nov 25, 2020

jorisvandenbossche commented Nov 25, 2020

topper-123 commented Nov 25, 2020 • edited Loading

topper-123 commented Nov 26, 2020

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Nov 28, 2020

jorisvandenbossche commented Nov 28, 2020

Choose a reason for hiding this comment

topper-123 Nov 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Nov 29, 2020

topper-123 commented Nov 15, 2020 •

edited

Loading

topper-123 Nov 17, 2020 •

edited

Loading

jorisvandenbossche left a comment •

edited

Loading

topper-123 commented Nov 21, 2020 •

edited

Loading

jorisvandenbossche commented Nov 24, 2020 •

edited

Loading

topper-123 commented Nov 25, 2020 •

edited

Loading

topper-123 Nov 28, 2020 •

edited

Loading