DataFrame.duplicated detects duplicates when none exist #11436

welchr · 2015-10-27T01:13:04Z

Hello,

I'm running into what I think is a bug in DataFrame.duplicated where it detects duplicates, but the data frame does not actually have any duplicated rows. It seems to only happen with integer columns, and somewhat large datasets (>600,000 rows).

I created a test data set to show the issue:

df = pd.read_table(
  "https://www.dropbox.com/s/vkw8bzxp290jitz/test.tab?raw=1",
  dtype = {"chrom" : "int64","pos" : "int64"}
)

If you ask for duplicates, it will detect them:

df.duplicated().any() # returns True

However, there are no duplicates:

In [5]: from collections import Counter

In [6]: counter = Counter(zip(df.chrom,df.pos))

In [7]: counter.most_common(5)
Out[7]:
[((0, 13704091), 1),
 ((0, 201539008), 1),
 ((0, 8573433), 1),
 ((0, 127434927), 1),
 ((0, 247829766), 1)]

If I convert one of the columns to float, and then ask for duplicates, it is correct:

df.loc[:,"pos"] = df.pos.astype("float64")
df.duplicated().any() # returns False

Strangely, converting the first column chrom to float or string does not seem to matter.

I had a difficult time in constructing this data frame to illustrate the example. It seems to only occur:

With at least two columns
One column (integer, string, float) that has a small number of unique values
One column (must be integer) that has a wide range of values, but in many instances, the values are close to each other
The dataset must be somewhat large, but it's hard to pin down. Somewhere on the order of > 500,000 rows, it seems.

From looking quickly at the DataFrame.duplicated code, it looks like it is using a hash table of some kind, and using integer columns differently than other columns - perhaps it's ending up with collisions?

Apologies if I'm missing something obvious here. Please let me know if I can be of any help in investigating further. My pandas version information is below.

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.4
Cython: 0.22.1
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 3.2.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: None
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 2.0.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.7
pymysql: None
psycopg2: None

The text was updated successfully, but these errors were encountered:

jreback · 2015-10-27T01:16:05Z

thanks for the report, a dupe of: #11376

this was already fixed here: #11403

and will be in forthcoming 0.17.1 (it's in master now)

welchr · 2015-10-27T01:17:47Z

Awesome, many thanks!

dougomania · 2020-05-04T05:04:10Z

Having the same issue

sales_cycle[sales_cycle.duplicated(['phone_number_r'])]

In that column, it says these are duplicated
4888726221,2032240454

These are fake numbers using faker

jreback closed this as completed Oct 27, 2015

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Duplicate Report Duplicate issue or pull request labels Oct 27, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.duplicated detects duplicates when none exist #11436

DataFrame.duplicated detects duplicates when none exist #11436

welchr commented Oct 27, 2015

jreback commented Oct 27, 2015

welchr commented Oct 27, 2015

dougomania commented May 4, 2020 •

edited

Loading

DataFrame.duplicated detects duplicates when none exist #11436

DataFrame.duplicated detects duplicates when none exist #11436

Comments

welchr commented Oct 27, 2015

jreback commented Oct 27, 2015

welchr commented Oct 27, 2015

dougomania commented May 4, 2020 • edited Loading

dougomania commented May 4, 2020 •

edited

Loading