DataFrame.duplicated detects duplicates when none exist #11436
Labels
Duplicate Report
Duplicate issue or pull request
Reshaping
Concat, Merge/Join, Stack/Unstack, Explode
Hello,
I'm running into what I think is a bug in DataFrame.duplicated where it detects duplicates, but the data frame does not actually have any duplicated rows. It seems to only happen with integer columns, and somewhat large datasets (>600,000 rows).
I created a test data set to show the issue:
If you ask for duplicates, it will detect them:
However, there are no duplicates:
If I convert one of the columns to float, and then ask for duplicates, it is correct:
Strangely, converting the first column
chrom
to float or string does not seem to matter.I had a difficult time in constructing this data frame to illustrate the example. It seems to only occur:
From looking quickly at the
DataFrame.duplicated
code, it looks like it is using a hash table of some kind, and using integer columns differently than other columns - perhaps it's ending up with collisions?Apologies if I'm missing something obvious here. Please let me know if I can be of any help in investigating further. My pandas version information is below.
The text was updated successfully, but these errors were encountered: