Skip to content

DataFrame .duplicated() / .drop_duplicates() flagging unique rows as duplicated in 0.17.0 #11668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bijanhoule opened this issue Nov 20, 2015 · 5 comments
Labels
Duplicate Report Duplicate issue or pull request

Comments

@bijanhoule
Copy link

Dataframe.duplicated() and .drop_duplicates() are flagging rows as duplicates when they are in fact distinct.

This was the smallest dataset I could make to recreate the issue, but I've seen this issue on DataFrames of any size:

>>> import pandas as pd

>>> json_data = '{"Col1":{"0":"S2#OaGwWII","1":")A9$rw3W_I","2":"2Ra+*_RWII","3":"2RA`4kRWII","4":"2R=K*_RWII"},' \
            '"Col2":{"0":141105144406,"1":141107294517,"2":141106133624,"3":141108219194,"4":141106133614}}'
>>> df = pd.read_json(json_data)
>>> print(df)

         Col1          Col2
0  S2#OaGwWII  141105144406
1  )A9$rw3W_I  141107294517
2  2Ra+*_RWII  141106133624
3  2RA`4kRWII  141108219194
4  2R=K*_RWII  141106133614

>>> df.duplicated(keep=False)

0    False
1    False
2     True
3    False
4     True
dtype: bool

It also seems to depend on row order / column order; this behavior can be changed by shuffling / sampling rows or columns, e.g.:

>>> df[['Col2', 'Col1']].duplicated(keep=False)
0    False
1    False
2    False
3    False
4    False
dtype: bool

I only see this behavior on 0.17.0, while 0.16.2 is fine. More details about each environment are below:

pandas 0.17.0 / python 3.4.3 (failing)

>>> pd.util.print_versions.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-400.1.1.el5
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US

pandas: 0.17.0
nose: 1.3.4
pip: 7.1.2
setuptools: 18.4
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 3.2.1
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.4
xlwt: None
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.4.0
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: 0.6.6.None
psycopg2: 2.6 (dt dec pq3 ext)

pandas 0.17.0 / python 3.5 (failing)

>>> pd.util.print_versions.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.0.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-348.18.1.el5
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C

pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.4
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 4.0.0
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.5.0
openpyxl: 1.8.5
xlrd: 0.9.4
xlwt: None
xlsxwriter: 0.7.7
lxml: 3.4.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: 0.6.7.None
psycopg2: None

pandas 0.16.2 python 3.5 (passing)

>>> pd.util.print_versions.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.0.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-348.18.1.el5
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C

pandas: 0.16.2
nose: None
Cython: None
numpy: 1.10.1
scipy: None
statsmodels: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.7
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
@bijanhoule
Copy link
Author

Here's an easier to copy-paste version:

import pandas as pd

json_data = '{"Col1":{"0":"S2#OaGwWII","1":")A9$rw3W_I","2":"2Ra+*_RWII","3":"2RA`4kRWII","4":"2R=K*_RWII"},' \
            '"Col2":{"0":141105144406,"1":141107294517,"2":141106133624,"3":141108219194,"4":141106133614}}'
df = pd.read_json(json_data)
assert not df.duplicated().any()

@jreback
Copy link
Contributor

jreback commented Nov 20, 2015

this was fixed here: #11376

its in 0.17.1, which is being released today (not on PyPi just yet), you can get via conda though

conda install pandas -c pandas

@jreback jreback closed this as completed Nov 20, 2015
@jreback jreback added the Duplicate Report Duplicate issue or pull request label Nov 20, 2015
@bijanhoule
Copy link
Author

Great, thanks!

@capelastegui
Copy link

I have found the same problem still happens in 0.17.1, when using large dataframes, and duplicated(keep=False).

import pandas as pd, numpy as np
df = pd.DataFrame({'a': pd.Series(range(1,100000)),
                   'b': pd.Series(range(10,1000000)),
                   'c': pd.Series(3*range(2,200000,2))})
df.head()

np.sum(df.duplicated())

Out[]: 0

np.sum(df.duplicated(keep=False))

Out[]:110

Changing column order also results in different behavior.

np.sum(df[['c','b','a']].duplicated(keep=False))

Out[]:2138

>> pd.util.print_versions.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.17.1
nose: None
pip: 7.1.2
setuptools: 18.5
Cython: None
numpy: 1.10.1
scipy: 0.16.0
statsmodels: None
IPython: 4.0.0
sphinx: 1.3.3
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext)
Jinja2: None

The examples provided by bijanhoule now work correctly, though.

@jreback
Copy link
Contributor

jreback commented Dec 18, 2015

@bijanhoule can you make a separate issue of this, with your example and xref this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

3 participants