Skip to content

Commit 1d33e4c

Browse files
authored
BUG: Fixed TypeError for Series.isin() when large series and values contains NA (pandas-dev#60678) (pandas-dev#60736)
* BUG: Fixed TypeError for Series.isin() when large series and values contains NA (pandas-dev#60678) * Add entry to whatsnew/v3.0.0.rst for bug fixing * Replaced np.vectorize() with any() for minor performance improvement and add new test cases * Fixed failed pre-commit.ci hooks : Formatting errors in algorithms.py, inconsistent-namespace-usage in test_isin.py, sorted whatsnew entry * Combined redundant if-statements to improve readability and performance
1 parent fef01c5 commit 1d33e4c

File tree

3 files changed

+31
-0
lines changed

3 files changed

+31
-0
lines changed

doc/source/whatsnew/v3.0.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -804,6 +804,7 @@ Other
804804
- Bug in :meth:`Index.sort_values` when passing a key function that turns values into tuples, e.g. ``key=natsort.natsort_key``, would raise ``TypeError`` (:issue:`56081`)
805805
- Bug in :meth:`Series.diff` allowing non-integer values for the ``periods`` argument. (:issue:`56607`)
806806
- Bug in :meth:`Series.dt` methods in :class:`ArrowDtype` that were returning incorrect values. (:issue:`57355`)
807+
- Bug in :meth:`Series.isin` raising ``TypeError`` when series is large (>10**6) and ``values`` contains NA (:issue:`60678`)
807808
- Bug in :meth:`Series.rank` that doesn't preserve missing values for nullable integers when ``na_option='keep'``. (:issue:`56976`)
808809
- Bug in :meth:`Series.replace` and :meth:`DataFrame.replace` inconsistently replacing matching instances when ``regex=True`` and missing values are present. (:issue:`56599`)
809810
- Bug in :meth:`Series.replace` and :meth:`DataFrame.replace` throwing ``ValueError`` when ``regex=True`` and all NA values. (:issue:`60688`)

pandas/core/algorithms.py

+6
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
iNaT,
2424
lib,
2525
)
26+
from pandas._libs.missing import NA
2627
from pandas._typing import (
2728
AnyArrayLike,
2829
ArrayLike,
@@ -544,10 +545,15 @@ def isin(comps: ListLike, values: ListLike) -> npt.NDArray[np.bool_]:
544545
# Ensure np.isin doesn't get object types or it *may* throw an exception
545546
# Albeit hashmap has O(1) look-up (vs. O(logn) in sorted array),
546547
# isin is faster for small sizes
548+
549+
# GH60678
550+
# Ensure values don't contain <NA>, otherwise it throws exception with np.in1d
551+
547552
if (
548553
len(comps_array) > _MINIMUM_COMP_ARR_LEN
549554
and len(values) <= 26
550555
and comps_array.dtype != object
556+
and not any(v is NA for v in values)
551557
):
552558
# If the values include nan we need to check for nan explicitly
553559
# since np.nan it not equal to np.nan

pandas/tests/series/methods/test_isin.py

+24
Original file line numberDiff line numberDiff line change
@@ -211,6 +211,30 @@ def test_isin_large_series_mixed_dtypes_and_nan(monkeypatch):
211211
tm.assert_series_equal(result, expected)
212212

213213

214+
@pytest.mark.parametrize(
215+
"dtype, data, values, expected",
216+
[
217+
("boolean", [pd.NA, False, True], [False, pd.NA], [True, True, False]),
218+
("Int64", [pd.NA, 2, 1], [1, pd.NA], [True, False, True]),
219+
("boolean", [pd.NA, False, True], [pd.NA, True, "a", 20], [True, False, True]),
220+
("boolean", [pd.NA, False, True], [], [False, False, False]),
221+
("Float64", [20.0, 30.0, pd.NA], [pd.NA], [False, False, True]),
222+
],
223+
)
224+
def test_isin_large_series_and_pdNA(dtype, data, values, expected, monkeypatch):
225+
# https://github.com/pandas-dev/pandas/issues/60678
226+
# combination of large series (> _MINIMUM_COMP_ARR_LEN elements) and
227+
# values contains pdNA
228+
min_isin_comp = 2
229+
ser = Series(data, dtype=dtype)
230+
expected = Series(expected, dtype="boolean")
231+
232+
with monkeypatch.context() as m:
233+
m.setattr(algorithms, "_MINIMUM_COMP_ARR_LEN", min_isin_comp)
234+
result = ser.isin(values)
235+
tm.assert_series_equal(result, expected)
236+
237+
214238
def test_isin_complex_numbers():
215239
# GH 17927
216240
array = [0, 1j, 1j, 1, 1 + 1j, 1 + 2j, 1 + 1j]

0 commit comments

Comments
 (0)