REGR: isin with nullable types with missing values raising #42473

mzeitlin11 · 2021-07-09T22:12:40Z

closes CI: series_methods.IsInLongSeriesLookUpDominates.time_isin fails with NumPy 1.20+ #39844
closes BUG: isin() with missing values does not work in 1.3.0 with extension dtypes #42405
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

Didn't look like there was a consensus reached on NA propagation in #39844, this should maintain the behavior from 1.2.x of not propagating and just fix the regression

mzeitlin11 · 2021-07-09T22:13:39Z

pandas/_libs/lib.pyx

+        if arr[i] is C_NA:
+            return True
+
+    return False


This may be overkill, but fits with some other cython methods like has_infs and would hopefully be useful in other locations

could live in _libs.missing?

mzeitlin11 · 2021-07-09T22:15:01Z

pandas/core/arrays/masked.py

+
+            # For now, NA does not propagate so set result according to presence of NA,
+            # see https://github.com/pandas-dev/pandas/pull/38379 for some discussion
+            result[self._mask] = values_have_NA


This line should be equivalent to the deleted if/else AFAICT, but I think makes logic easier to follow

jbrockmendel · 2021-07-10T02:17:40Z

asv_bench/benchmarks/algos/isin.py

-    from pandas.compat import np_version_under1p20
-except ImportError:
-    from pandas.compat.numpy import _np_version_under1p20 as np_version_under1p20
+from pandas.core.dtypes.cast import is_extension_array_dtype


get this from dtypes.common

jbrockmendel · 2021-07-10T02:18:41Z

pandas/_libs/lib.pyx

@@ -538,6 +538,24 @@ def has_infs(floating[:] arr) -> bool:
    return ret


+@cython.wraparound(False)
+@cython.boundscheck(False)
+def has_NA(ndarray arr) -> bool:


ndarray[object]?

When I try that (or object[:]), end up getting errors like ValueError: Buffer dtype mismatch, expected 'Python object' but got 'double'

yah, this should only be called on object-dtype, in other cases we already know the result is False

Ah got it thanks, have changed

jbrockmendel · 2021-07-11T17:24:41Z

asv_bench/benchmarks/algos/isin.py

-        self.values = np.arange(MaxNumber).astype(dtype)
+
+        if is_extension_array_dtype(dtype):
+            vals_dtype = dtype.type


should this be more specific to Int64Dtype/Float64Dtype? we we had e.g. PeriodDtype then the astype on L308 would break

Sounds good, have limited to those to make clearer what can pass through here in current form

jreback · 2021-07-12T00:41:01Z

asv_bench/benchmarks/algos/isin.py

+            "float64",
+            "float32",
+            "object",
+            Int64Dtype(),


this is very odd, why doesn't Int64 not work here?

This was so the type attribute could be accessed to get the underlying numpy compatible type. I'll try to reorganize to avoid this

jreback · 2021-07-12T00:41:15Z

asv_bench/benchmarks/algos/isin.py

@@ -297,6 +298,10 @@ def setup(self, dtype, MaxNumber, series_type):
            array = np.arange(N) + MaxNumber

        self.series = Series(array).astype(dtype)
+
+        if isinstance(dtype, (Int64Dtype, Float64Dtype)):


see above, this is a problem if you need to do this

jreback · 2021-07-12T00:43:07Z

pandas/_libs/missing.pyx

@@ -334,6 +334,22 @@ def isnaobj2d_old(arr: ndarray) -> ndarray:
    return result.view(np.bool_)


+@cython.wraparound(False)


this is the same as: cpdef ndarray[uint8_t] isnaobj(ndarray arr) no?

its might be a very tiny bit faster but likely not worth it

The issue here was that we need to be able to explicitly check for NA (and not other missing values) to maintain the existing behavior

we actually test this? i think we accept multiple types of Nulls here (e.g. None is accepted as well as np.nan for strings). I would just use our existing routine unless you have tests that show this is not sufficient.

This is not tested (hence the regression :)

On 1.2.x the behavior is

In [1]: import pandas as pd In [2]: import numpy as np In [3]: ser = pd.Series([1, 1, pd.NA], dtype="Int64") In [4]: ser.isin([pd.NA]) Out[4]: 0 False 1 False 2 True dtype: bool In [5]: ser.isin([np.nan]) Out[5]: 0 False 1 False 2 False dtype: bool

This isin behavior singling out pd.NA also matches strings:

In [7]: ser = pd.Series(["a", "b", None], dtype="string") In [8]: ser Out[8]: 0 a 1 b 2 <NA> dtype: string In [9]: ser.isin([np.nan]) Out[9]: 0 False 1 False 2 False dtype: bool In [10]: ser.isin([pd.NA]) Out[10]: 0 False 1 False 2 True dtype: bool

So this matches existing behavior and how StringDtype handles NA for isin, even if the behavior itself may be questionable

did you add an explict test for this?

The added test has parameterizations which hit this explicitly.

also -1 on the expansion of yet another null checker. if we want to keep the existing then do this via a flag to isnaobj.

If we're uncertain about this behavior, could also just turn the added cython routine into something similar done in python space just to fix the regression. Then that could easily be removed for 1.4 if we decide to change the behavior.

sure. this should match anything we allow for nulls in the _from_sequence which is what isnaobj does

i havent looked at this closely, but the canonical behavior belongs in is_valid_na_for_dtype

I think we're going to at some point need a vectorized is_matching_na, which this would be a special case of

i havent looked at this closely, but the canonical behavior belongs in is_valid_na_for_dtype

Not sure if along the lines of what you were thinking, but pushed a different solution just checking for presence of a valid missing value using dtype.na_value

…on_types

mzeitlin11 · 2021-07-12T04:14:10Z

Have fixed the issue in benchmark for nullable types in a different way. If still looks a bit kludgy let me know, and we can just leave that out of this pr

asv_bench/benchmarks/algos/isin.py

jreback · 2021-07-13T21:19:16Z

also rebase

…on_types

jreback · 2021-07-14T18:19:13Z

@meeseeksdev backport 1.3.x

jreback · 2021-07-14T18:20:06Z

this is going to need a manual backport, cc @mzeitlin11

…sing values raising

…s raising (#42543)

…v#42473)

mzeitlin11 added 3 commits July 9, 2021 16:44

Push for testing asv fix

d9132d7

Add tests and whatsnew

1902d7a

Comment fixups

d2b988d

mzeitlin11 added isin isin method NA - MaskedArrays Related to pd.NA and nullable extension arrays Regression Functionality that used to work in a prior pandas version labels Jul 9, 2021

mzeitlin11 added this to the 1.3.1 milestone Jul 9, 2021

mzeitlin11 commented Jul 9, 2021

View reviewed changes

mzeitlin11 added 3 commits July 9, 2021 19:15

Fix benchmark

e60c0d4

Fix benchmark

e902218

Fix benchmark

814f978

jbrockmendel reviewed Jul 10, 2021

View reviewed changes

mzeitlin11 added 2 commits July 9, 2021 23:30

Move has_na, change import

0a9b4e6

Only check NA for object

6a68490

jbrockmendel reviewed Jul 11, 2021

View reviewed changes

mzeitlin11 and others added 2 commits July 11, 2021 15:22

Update benchmark to make allowed clearer

44bcd67

Merge branch 'master' into regr_isin_extension_types

821c6b6

jreback requested changes Jul 12, 2021

View reviewed changes

mzeitlin11 added 3 commits July 11, 2021 22:39

Merge remote-tracking branch 'upstream/master' into regr_isin_extensi…

893343f

…on_types

Simpler benchmark fix

e7b66b7

Simpler benchmark fix

3cfe98e

jreback reviewed Jul 12, 2021

View reviewed changes

asv_bench/benchmarks/algos/isin.py Show resolved Hide resolved

mzeitlin11 added 2 commits July 13, 2021 20:54

Merge remote-tracking branch 'upstream/master' into regr_isin_extensi…

ec61ab1

…on_types

Handle in python space

17a50b9

jreback approved these changes Jul 14, 2021

View reviewed changes

jreback merged commit f8855eb into pandas-dev:master Jul 14, 2021

This comment has been minimized.

Sign in to view

lumberbot-app bot added the Still Needs Manual Backport label Jul 14, 2021

This comment has been minimized.

Sign in to view

mzeitlin11 deleted the regr_isin_extension_types branch July 14, 2021 18:20

mzeitlin11 added a commit to mzeitlin11/pandas that referenced this pull request Jul 15, 2021

Backport PR pandas-dev#42473: REGR: isin with nullable types with mis…

608cab5

…sing values raising

mzeitlin11 mentioned this pull request Jul 15, 2021

API: ExtensionArray.isin treatment of missing values #42545

Closed

jreback pushed a commit that referenced this pull request Jul 15, 2021

Backport PR #42473: REGR: isin with nullable types with missing value…

c8d510e

…s raising (#42543)

simonjayhawkins mentioned this pull request Jul 16, 2021

Backport PR #42473: REGR: isin with nullable types with missing values raising #42543

Merged

simonjayhawkins removed the Still Needs Manual Backport label Jul 16, 2021

feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021

REGR: isin with nullable types with missing values raising (pandas-de…

37de105

…v#42473)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: isin with nullable types with missing values raising #42473

REGR: isin with nullable types with missing values raising #42473

mzeitlin11 commented Jul 9, 2021

mzeitlin11 Jul 9, 2021

jbrockmendel Jul 10, 2021

mzeitlin11 Jul 10, 2021

mzeitlin11 Jul 9, 2021

jbrockmendel Jul 10, 2021

jbrockmendel Jul 10, 2021

mzeitlin11 Jul 10, 2021

jbrockmendel Jul 10, 2021

mzeitlin11 Jul 10, 2021

jbrockmendel Jul 11, 2021

mzeitlin11 Jul 11, 2021

jreback Jul 12, 2021

mzeitlin11 Jul 12, 2021

jreback Jul 12, 2021

jreback Jul 12, 2021

jreback Jul 12, 2021

mzeitlin11 Jul 12, 2021

jreback Jul 12, 2021

mzeitlin11 Jul 12, 2021

mzeitlin11 Jul 13, 2021 •

edited

Loading

mzeitlin11 Jul 13, 2021

jbrockmendel Jul 13, 2021

jbrockmendel Jul 13, 2021

mzeitlin11 Jul 14, 2021

mzeitlin11 commented Jul 12, 2021

jreback commented Jul 13, 2021

This comment has been minimized.

jreback commented Jul 14, 2021

This comment has been minimized.

jreback commented Jul 14, 2021

		@@ -334,6 +334,22 @@ def isnaobj2d_old(arr: ndarray) -> ndarray:
		return result.view(np.bool_)


		@cython.wraparound(False)

REGR: isin with nullable types with missing values raising #42473

REGR: isin with nullable types with missing values raising #42473

Conversation

mzeitlin11 commented Jul 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzeitlin11 Jul 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzeitlin11 commented Jul 12, 2021

jreback commented Jul 13, 2021

This comment has been minimized.

jreback commented Jul 14, 2021

This comment has been minimized.

jreback commented Jul 14, 2021

mzeitlin11 Jul 13, 2021 •

edited

Loading