Skip to content

Commit 9992d9c

Browse files
Morgan243alanbato
authored andcommitted
BUG: Large object array isin
closes pandas-dev#16012 Author: Morgan Stuart <[email protected]> Closes pandas-dev#16969 from Morgan243/large_array_isin and squashes the following commits: 31cb4b3 [Morgan Stuart] Removed unneeded details from whatsnew description 4b59745 [Morgan Stuart] Linting errors; additional test clarification 186607b [Morgan Stuart] BUG pandas-dev#16012 - fix isin for large object arrays
1 parent 5fee148 commit 9992d9c

File tree

3 files changed

+19
-3
lines changed

3 files changed

+19
-3
lines changed

doc/source/whatsnew/v0.21.0.txt

+3-2
Original file line numberDiff line numberDiff line change
@@ -178,8 +178,6 @@ Performance Improvements
178178
Bug Fixes
179179
~~~~~~~~~
180180

181-
- Fixes regression in 0.20, :func:`Series.aggregate` and :func:`DataFrame.aggregate` allow dictionaries as return values again (:issue:`16741`)
182-
- Fixes bug where indexing with ``np.inf`` caused an ``OverflowError`` to be raised (:issue:`16957`)
183181

184182
Conversion
185183
^^^^^^^^^^
@@ -193,6 +191,7 @@ Indexing
193191
- Fixes regression in 0.20.3 when indexing with a string on a ``TimedeltaIndex`` (:issue:`16896`).
194192
- Fixed :func:`TimedeltaIndex.get_loc` handling of ``np.timedelta64`` inputs (:issue:`16909`).
195193
- Fix :func:`MultiIndex.sort_index` ordering when ``ascending`` argument is a list, but not all levels are specified, or are in a different order (:issue:`16934`).
194+
- Fixes bug where indexing with ``np.inf`` caused an ``OverflowError`` to be raised (:issue:`16957`)
196195

197196
I/O
198197
^^^
@@ -222,6 +221,8 @@ Sparse
222221
Reshaping
223222
^^^^^^^^^
224223
- Joining/Merging with a non unique ``PeriodIndex`` raised a TypeError (:issue:`16871`)
224+
- Bug when using :func:`isin` on a large object series and large comparison array (:issue:`16012`)
225+
- Fixes regression from 0.20, :func:`Series.aggregate` and :func:`DataFrame.aggregate` allow dictionaries as return values again (:issue:`16741`)
225226

226227

227228
Numeric

pandas/core/algorithms.py

+4-1
Original file line numberDiff line numberDiff line change
@@ -402,7 +402,10 @@ def isin(comps, values):
402402
# work-around for numpy < 1.8 and comparisions on py3
403403
# faster for larger cases to use np.in1d
404404
f = lambda x, y: htable.ismember_object(x, values)
405-
if (_np_version_under1p8 and compat.PY3) or len(comps) > 1000000:
405+
# GH16012
406+
# Ensure np.in1d doesn't get object types or it *may* throw an exception
407+
if ((_np_version_under1p8 and compat.PY3) or len(comps) > 1000000 and
408+
not is_object_dtype(comps)):
406409
f = lambda x, y: np.in1d(x, y)
407410
elif is_integer_dtype(comps):
408411
try:

pandas/tests/series/test_analytics.py

+12
Original file line numberDiff line numberDiff line change
@@ -1092,6 +1092,18 @@ def test_isin(self):
10921092
expected = Series([True, False, True, False, False, False, True, True])
10931093
assert_series_equal(result, expected)
10941094

1095+
# GH: 16012
1096+
# This specific issue has to have a series over 1e6 in len, but the
1097+
# comparison array (in_list) must be large enough so that numpy doesn't
1098+
# do a manual masking trick that will avoid this issue altogether
1099+
s = Series(list('abcdefghijk' * 10 ** 5))
1100+
# If numpy doesn't do the manual comparison/mask, these
1101+
# unorderable mixed types are what cause the exception in numpy
1102+
in_list = [-1, 'a', 'b', 'G', 'Y', 'Z', 'E',
1103+
'K', 'E', 'S', 'I', 'R', 'R'] * 6
1104+
1105+
assert s.isin(in_list).sum() == 200000
1106+
10951107
def test_isin_with_string_scalar(self):
10961108
# GH4763
10971109
s = Series(['A', 'B', 'C', 'a', 'B', 'B', 'A', 'C'])

0 commit comments

Comments
 (0)