Skip to content

Commit b431f85

Browse files
evanpwjreback
authored andcommitted
BUG: Spurious matches in DataFrame.duplicated when keep=False, pandas-dev#11864
1 parent 6132df0 commit b431f85

File tree

3 files changed

+11
-1
lines changed

3 files changed

+11
-1
lines changed

doc/source/whatsnew/v0.18.0.txt

+2
Original file line numberDiff line numberDiff line change
@@ -468,9 +468,11 @@ Bug Fixes
468468
- Bug in ``to_numeric`` where it does not raise if input is more than one dimension (:issue:`11776`)
469469

470470
- Bug in parsing timezone offset strings with non-zero minutes (:issue:`11708`)
471+
471472
- Bug in ``df.plot`` using incorrect colors for bar plots under matplotlib 1.5+ (:issue:`11614`)
472473
- Bug in the ``groupby`` ``plot`` method when using keyword arguments (:issue:`11805`).
473474

475+
- Bug in ``DataFrame.duplicated`` and ``drop_duplicates`` causing spurious matches when setting ``keep=False`` (:issue:`11864`)
474476

475477
- Bug in ``.loc`` result with duplicated key may have ``Index`` with incorrect dtype (:issue:`11497`)
476478
- Bug in ``pd.rolling_median`` where memory allocation failed even with sufficient memory (:issue:`11696`)

pandas/hashtable.pyx

+2-1
Original file line numberDiff line numberDiff line change
@@ -1067,7 +1067,8 @@ def mode_int64(int64_t[:] values):
10671067
@cython.boundscheck(False)
10681068
def duplicated_int64(ndarray[int64_t, ndim=1] values, object keep='first'):
10691069
cdef:
1070-
int ret = 0, value, k
1070+
int ret = 0, k
1071+
int64_t value
10711072
Py_ssize_t i, n = len(values)
10721073
kh_int64_t * table = kh_init_int64()
10731074
ndarray[uint8_t, ndim=1, cast=True] out = np.empty(n, dtype='bool')

pandas/tests/test_frame.py

+7
Original file line numberDiff line numberDiff line change
@@ -8532,6 +8532,13 @@ def test_drop_duplicates(self):
85328532
df = pd.DataFrame([[-x, x], [x, x + 4]])
85338533
assert_frame_equal(df.drop_duplicates(), df)
85348534

8535+
# GH 11864
8536+
df = pd.DataFrame([i] * 9 for i in range(16))
8537+
df = df.append([[1] + [0] * 8], ignore_index=True)
8538+
8539+
for keep in ['first', 'last', False]:
8540+
assert_equal(df.duplicated(keep=keep).sum(), 0)
8541+
85358542
def test_drop_duplicates_for_take_all(self):
85368543
df = DataFrame({'AAA': ['foo', 'bar', 'baz', 'bar',
85378544
'foo', 'bar', 'qux', 'foo'],

0 commit comments

Comments
 (0)