Skip to content

DataFrame.duplicated detects duplicates when none exist #11436

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
welchr opened this issue Oct 27, 2015 · 3 comments
Closed

DataFrame.duplicated detects duplicates when none exist #11436

welchr opened this issue Oct 27, 2015 · 3 comments
Labels
Duplicate Report Duplicate issue or pull request Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@welchr
Copy link

welchr commented Oct 27, 2015

Hello,

I'm running into what I think is a bug in DataFrame.duplicated where it detects duplicates, but the data frame does not actually have any duplicated rows. It seems to only happen with integer columns, and somewhat large datasets (>600,000 rows).

I created a test data set to show the issue:

df = pd.read_table(
  "https://www.dropbox.com/s/vkw8bzxp290jitz/test.tab?raw=1",
  dtype = {"chrom" : "int64","pos" : "int64"}
)

If you ask for duplicates, it will detect them:

df.duplicated().any() # returns True

However, there are no duplicates:

In [5]: from collections import Counter

In [6]: counter = Counter(zip(df.chrom,df.pos))

In [7]: counter.most_common(5)
Out[7]:
[((0, 13704091), 1),
 ((0, 201539008), 1),
 ((0, 8573433), 1),
 ((0, 127434927), 1),
 ((0, 247829766), 1)]

If I convert one of the columns to float, and then ask for duplicates, it is correct:

df.loc[:,"pos"] = df.pos.astype("float64")
df.duplicated().any() # returns False

Strangely, converting the first column chrom to float or string does not seem to matter.

I had a difficult time in constructing this data frame to illustrate the example. It seems to only occur:

  • With at least two columns
  • One column (integer, string, float) that has a small number of unique values
  • One column (must be integer) that has a wide range of values, but in many instances, the values are close to each other
  • The dataset must be somewhat large, but it's hard to pin down. Somewhere on the order of > 500,000 rows, it seems.

From looking quickly at the DataFrame.duplicated code, it looks like it is using a hash table of some kind, and using integer columns differently than other columns - perhaps it's ending up with collisions?

Apologies if I'm missing something obvious here. Please let me know if I can be of any help in investigating further. My pandas version information is below.

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.4
Cython: 0.22.1
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 3.2.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: None
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 2.0.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.7
pymysql: None
psycopg2: None
@jreback
Copy link
Contributor

jreback commented Oct 27, 2015

thanks for the report, a dupe of: #11376

this was already fixed here: #11403

and will be in forthcoming 0.17.1 (it's in master now)

@jreback jreback closed this as completed Oct 27, 2015
@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Duplicate Report Duplicate issue or pull request labels Oct 27, 2015
@welchr
Copy link
Author

welchr commented Oct 27, 2015

Awesome, many thanks!

@dougomania
Copy link

dougomania commented May 4, 2020

Having the same issue

sales_cycle[sales_cycle.duplicated(['phone_number_r'])]

In that column, it says these are duplicated
4888726221,2032240454

These are fake numbers using faker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

3 participants