Skip to content

BUG: Fixed TypeError for Series.isin() when large series and values contains NA (#60678) #60736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 22, 2025
1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -804,6 +804,7 @@ Other
- Bug in :meth:`Index.sort_values` when passing a key function that turns values into tuples, e.g. ``key=natsort.natsort_key``, would raise ``TypeError`` (:issue:`56081`)
- Bug in :meth:`Series.diff` allowing non-integer values for the ``periods`` argument. (:issue:`56607`)
- Bug in :meth:`Series.dt` methods in :class:`ArrowDtype` that were returning incorrect values. (:issue:`57355`)
- Bug in :meth:`Series.isin` raising ``TypeError`` when series is large (>10**6) and ``values`` contains NA (:issue:`60678`)
- Bug in :meth:`Series.rank` that doesn't preserve missing values for nullable integers when ``na_option='keep'``. (:issue:`56976`)
- Bug in :meth:`Series.replace` and :meth:`DataFrame.replace` inconsistently replacing matching instances when ``regex=True`` and missing values are present. (:issue:`56599`)
- Bug in :meth:`Series.replace` and :meth:`DataFrame.replace` throwing ``ValueError`` when ``regex=True`` and all NA values. (:issue:`60688`)
Expand Down
6 changes: 6 additions & 0 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
iNaT,
lib,
)
from pandas._libs.missing import NA
from pandas._typing import (
AnyArrayLike,
ArrayLike,
Expand Down Expand Up @@ -544,10 +545,15 @@ def isin(comps: ListLike, values: ListLike) -> npt.NDArray[np.bool_]:
# Ensure np.isin doesn't get object types or it *may* throw an exception
# Albeit hashmap has O(1) look-up (vs. O(logn) in sorted array),
# isin is faster for small sizes

# GH60678
# Ensure values don't contain <NA>, otherwise it throws exception with np.in1d

if (
len(comps_array) > _MINIMUM_COMP_ARR_LEN
and len(values) <= 26
and comps_array.dtype != object
and not any(v is NA for v in values)
):
# If the values include nan we need to check for nan explicitly
# since np.nan it not equal to np.nan
Expand Down
24 changes: 24 additions & 0 deletions pandas/tests/series/methods/test_isin.py
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,30 @@ def test_isin_large_series_mixed_dtypes_and_nan(monkeypatch):
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize(
"dtype, data, values, expected",
[
("boolean", [pd.NA, False, True], [False, pd.NA], [True, True, False]),
("Int64", [pd.NA, 2, 1], [1, pd.NA], [True, False, True]),
("boolean", [pd.NA, False, True], [pd.NA, True, "a", 20], [True, False, True]),
("boolean", [pd.NA, False, True], [], [False, False, False]),
("Float64", [20.0, 30.0, pd.NA], [pd.NA], [False, False, True]),
],
)
def test_isin_large_series_and_pdNA(dtype, data, values, expected, monkeypatch):
# https://github.com/pandas-dev/pandas/issues/60678
# combination of large series (> _MINIMUM_COMP_ARR_LEN elements) and
# values contains pdNA
min_isin_comp = 2
ser = Series(data, dtype=dtype)
expected = Series(expected, dtype="boolean")

with monkeypatch.context() as m:
m.setattr(algorithms, "_MINIMUM_COMP_ARR_LEN", min_isin_comp)
result = ser.isin(values)
tm.assert_series_equal(result, expected)


def test_isin_complex_numbers():
# GH 17927
array = [0, 1j, 1j, 1, 1 + 1j, 1 + 2j, 1 + 1j]
Expand Down
Loading