Skip to content

REGR: isin with nullable types with missing values raising #42473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jul 14, 2021
Merged
18 changes: 3 additions & 15 deletions asv_bench/benchmarks/algos/isin.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,5 @@
import numpy as np

try:
from pandas.compat import np_version_under1p20
except ImportError:
from pandas.compat.numpy import _np_version_under1p20 as np_version_under1p20

from pandas import (
Categorical,
NaT,
Expand Down Expand Up @@ -283,10 +278,6 @@ class IsInLongSeriesLookUpDominates:
def setup(self, dtype, MaxNumber, series_type):
N = 10 ** 7

# https://github.com/pandas-dev/pandas/issues/39844
if not np_version_under1p20 and dtype in ("Int64", "Float64"):
raise NotImplementedError

if series_type == "random_hits":
array = np.random.randint(0, MaxNumber, N)
if series_type == "random_misses":
Expand All @@ -297,7 +288,8 @@ def setup(self, dtype, MaxNumber, series_type):
array = np.arange(N) + MaxNumber

self.series = Series(array).astype(dtype)
self.values = np.arange(MaxNumber).astype(dtype)

self.values = np.arange(MaxNumber).astype(dtype.lower())

def time_isin(self, dtypes, MaxNumber, series_type):
self.series.isin(self.values)
Expand All @@ -313,16 +305,12 @@ class IsInLongSeriesValuesDominate:
def setup(self, dtype, series_type):
N = 10 ** 7

# https://github.com/pandas-dev/pandas/issues/39844
if not np_version_under1p20 and dtype in ("Int64", "Float64"):
raise NotImplementedError

if series_type == "random":
vals = np.random.randint(0, 10 * N, N)
if series_type == "monotone":
vals = np.arange(N)

self.values = vals.astype(dtype)
self.values = vals.astype(dtype.lower())
M = 10 ** 6 + 1
self.series = Series(np.arange(M)).astype(dtype)

Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.3.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ Fixed regressions
- Performance regression in constructing a :class:`DataFrame` from a dictionary of dictionaries (:issue:`42338`)
- Fixed regression in :meth:`DataFrame.agg` dropping values when the DataFrame had an Extension Array dtype, a duplicate index, and ``axis=1`` (:issue:`42380`)
- Fixed regression in indexing with a ``list`` subclass incorrectly raising ``TypeError`` (:issue:`42433`, :issue:42461`)
- Fixed regression in :meth:`DataFrame.isin` and :meth:`Series.isin` raising ``TypeError`` with nullable data containing at least one missing value (:issue:`42405`)
-

.. ---------------------------------------------------------------------------
Expand Down
16 changes: 16 additions & 0 deletions pandas/_libs/missing.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -334,6 +334,22 @@ def isnaobj2d_old(arr: ndarray) -> ndarray:
return result.view(np.bool_)


@cython.wraparound(False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the same as: cpdef ndarray[uint8_t] isnaobj(ndarray arr) no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its might be a very tiny bit faster but likely not worth it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here was that we need to be able to explicitly check for NA (and not other missing values) to maintain the existing behavior

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we actually test this? i think we accept multiple types of Nulls here (e.g. None is accepted as well as np.nan for strings). I would just use our existing routine unless you have tests that show this is not sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not tested (hence the regression :)

On 1.2.x the behavior is

In [1]: import pandas as pd
  
In [2]: import numpy as np

In [3]: ser = pd.Series([1, 1, pd.NA], dtype="Int64")

In [4]: ser.isin([pd.NA])
Out[4]: 
0    False
1    False
2     True
dtype: bool

In [5]: ser.isin([np.nan])
Out[5]: 
0    False
1    False
2    False
dtype: bool

This isin behavior singling out pd.NA also matches strings:

In [7]: ser = pd.Series(["a", "b", None], dtype="string")

In [8]: ser
Out[8]: 
0       a
1       b
2    <NA>
dtype: string

In [9]: ser.isin([np.nan])
Out[9]: 
0    False
1    False
2    False
dtype: bool

In [10]: ser.isin([pd.NA])
Out[10]: 
0    False
1    False
2     True
dtype: bool

So this matches existing behavior and how StringDtype handles NA for isin, even if the behavior itself may be questionable

Copy link
Member Author

@mzeitlin11 mzeitlin11 Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you add an explict test for this?

The added test has parameterizations which hit this explicitly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also -1 on the expansion of yet another null checker. if we want to keep the existing then do this via a flag to isnaobj.

If we're uncertain about this behavior, could also just turn the added cython routine into something similar done in python space just to fix the regression. Then that could easily be removed for 1.4 if we decide to change the behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. this should match anything we allow for nulls in the _from_sequence which is what isnaobj does

i havent looked at this closely, but the canonical behavior belongs in is_valid_na_for_dtype

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're going to at some point need a vectorized is_matching_na, which this would be a special case of

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i havent looked at this closely, but the canonical behavior belongs in is_valid_na_for_dtype

Not sure if along the lines of what you were thinking, but pushed a different solution just checking for presence of a valid missing value using dtype.na_value

@cython.boundscheck(False)
def has_NA(ndarray[object, ndim=1] arr) -> bool:
"""
Return True if NA present in arr, False otherwise
"""
cdef:
Py_ssize_t i

for i in range(len(arr)):
if arr[i] is C_NA:
return True

return False


def isposinf_scalar(val: object) -> bool:
return util.is_float_object(val) and val == INF

Expand Down
18 changes: 13 additions & 5 deletions pandas/core/arrays/masked.py
Original file line number Diff line number Diff line change
Expand Up @@ -403,12 +403,20 @@ def isin(self, values) -> BooleanArray: # type: ignore[override]

from pandas.core.arrays import BooleanArray

result = isin(self._data, values)
# algorithms.isin will eventually convert values to an ndarray, so no extra
# cost to doing it here first
values_arr = np.asarray(values)
result = isin(self._data, values_arr)

if self._hasna:
if libmissing.NA in values:
result += self._mask
else:
result *= np.invert(self._mask)
values_have_NA = is_object_dtype(values_arr.dtype) and libmissing.has_NA(
values_arr
)

# For now, NA does not propagate so set result according to presence of NA,
# see https://github.com/pandas-dev/pandas/pull/38379 for some discussion
result[self._mask] = values_have_NA
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should be equivalent to the deleted if/else AFAICT, but I think makes logic easier to follow


mask = np.zeros_like(self, dtype=bool)
return BooleanArray(result, mask, copy=False)

Expand Down
21 changes: 21 additions & 0 deletions pandas/tests/series/methods/test_isin.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,27 @@ def test_isin_float_in_int_series(self, values):
expected = Series([True, False])
tm.assert_series_equal(result, expected)

@pytest.mark.parametrize("dtype", ["boolean", "Int64", "Float64"])
@pytest.mark.parametrize(
"data,values,expected",
[
([0, 1, 0], [1], [False, True, False]),
([0, 1, 0], [1, pd.NA], [False, True, False]),
([0, pd.NA, 0], [1, 0], [True, False, True]),
([0, 1, pd.NA], [1, pd.NA], [False, True, True]),
([0, 1, pd.NA], [1, np.nan], [False, True, False]),
([0, pd.NA, pd.NA], [np.nan, pd.NaT, None], [False, False, False]),
],
)
def test_isin_masked_types(self, dtype, data, values, expected):
# GH#42405
ser = Series(data, dtype=dtype)

result = ser.isin(values)
expected = Series(expected, dtype="boolean")

tm.assert_series_equal(result, expected)


@pytest.mark.slow
def test_isin_large_series_mixed_dtypes_and_nan():
Expand Down