DataFrame .duplicated() / .drop_duplicates() flagging unique rows as duplicated in 0.17.0 #11668

bijanhoule · 2015-11-20T18:11:39Z

Dataframe.duplicated() and .drop_duplicates() are flagging rows as duplicates when they are in fact distinct.

This was the smallest dataset I could make to recreate the issue, but I've seen this issue on DataFrames of any size:

>>> import pandas as pd

>>> json_data = '{"Col1":{"0":"S2#OaGwWII","1":")A9$rw3W_I","2":"2Ra+*_RWII","3":"2RA`4kRWII","4":"2R=K*_RWII"},' \
            '"Col2":{"0":141105144406,"1":141107294517,"2":141106133624,"3":141108219194,"4":141106133614}}'
>>> df = pd.read_json(json_data)
>>> print(df)

         Col1          Col2
0  S2#OaGwWII  141105144406
1  )A9$rw3W_I  141107294517
2  2Ra+*_RWII  141106133624
3  2RA`4kRWII  141108219194
4  2R=K*_RWII  141106133614

>>> df.duplicated(keep=False)

0    False
1    False
2     True
3    False
4     True
dtype: bool

It also seems to depend on row order / column order; this behavior can be changed by shuffling / sampling rows or columns, e.g.:

>>> df[['Col2', 'Col1']].duplicated(keep=False)
0    False
1    False
2    False
3    False
4    False
dtype: bool

I only see this behavior on 0.17.0, while 0.16.2 is fine. More details about each environment are below:

pandas 0.17.0 / python 3.4.3 (failing)

>>> pd.util.print_versions.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-400.1.1.el5
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US

pandas: 0.17.0
nose: 1.3.4
pip: 7.1.2
setuptools: 18.4
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 3.2.1
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.4
xlwt: None
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.4.0
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: 0.6.6.None
psycopg2: 2.6 (dt dec pq3 ext)

pandas 0.17.0 / python 3.5 (failing)

>>> pd.util.print_versions.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.0.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-348.18.1.el5
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C

pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.4
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 4.0.0
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.5.0
openpyxl: 1.8.5
xlrd: 0.9.4
xlwt: None
xlsxwriter: 0.7.7
lxml: 3.4.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: 0.6.7.None
psycopg2: None

pandas 0.16.2 python 3.5 (passing)

>>> pd.util.print_versions.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.0.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-348.18.1.el5
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C

pandas: 0.16.2
nose: None
Cython: None
numpy: 1.10.1
scipy: None
statsmodels: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.7
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None

The text was updated successfully, but these errors were encountered:

bijanhoule · 2015-11-20T18:15:34Z

Here's an easier to copy-paste version:

import pandas as pd

json_data = '{"Col1":{"0":"S2#OaGwWII","1":")A9$rw3W_I","2":"2Ra+*_RWII","3":"2RA`4kRWII","4":"2R=K*_RWII"},' \
            '"Col2":{"0":141105144406,"1":141107294517,"2":141106133624,"3":141108219194,"4":141106133614}}'
df = pd.read_json(json_data)
assert not df.duplicated().any()

jreback · 2015-11-20T18:17:35Z

this was fixed here: #11376

its in 0.17.1, which is being released today (not on PyPi just yet), you can get via conda though

conda install pandas -c pandas

bijanhoule · 2015-11-20T18:32:18Z

Great, thanks!

capelastegui · 2015-12-18T17:19:45Z

I have found the same problem still happens in 0.17.1, when using large dataframes, and duplicated(keep=False).

import pandas as pd, numpy as np
df = pd.DataFrame({'a': pd.Series(range(1,100000)),
                   'b': pd.Series(range(10,1000000)),
                   'c': pd.Series(3*range(2,200000,2))})
df.head()

np.sum(df.duplicated())

Out[]: 0

np.sum(df.duplicated(keep=False))

Out[]:110

Changing column order also results in different behavior.

np.sum(df[['c','b','a']].duplicated(keep=False))

Out[]:2138

>> pd.util.print_versions.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.17.1
nose: None
pip: 7.1.2
setuptools: 18.5
Cython: None
numpy: 1.10.1
scipy: 0.16.0
statsmodels: None
IPython: 4.0.0
sphinx: 1.3.3
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext)
Jinja2: None

The examples provided by bijanhoule now work correctly, though.

jreback · 2015-12-18T18:14:37Z

@bijanhoule can you make a separate issue of this, with your example and xref this one.

jreback closed this as completed Nov 20, 2015

jreback added the Duplicate Report Duplicate issue or pull request label Nov 20, 2015

capelastegui mentioned this issue Dec 18, 2015

DataFrame .duplicated() / .drop_duplicates() flagging unique rows as duplicated in 0.17.1 #11864

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame .duplicated() / .drop_duplicates() flagging unique rows as duplicated in 0.17.0 #11668

DataFrame .duplicated() / .drop_duplicates() flagging unique rows as duplicated in 0.17.0 #11668

bijanhoule commented Nov 20, 2015

bijanhoule commented Nov 20, 2015

jreback commented Nov 20, 2015

bijanhoule commented Nov 20, 2015

capelastegui commented Dec 18, 2015

jreback commented Dec 18, 2015

DataFrame .duplicated() / .drop_duplicates() flagging unique rows as duplicated in 0.17.0 #11668

DataFrame .duplicated() / .drop_duplicates() flagging unique rows as duplicated in 0.17.0 #11668

Comments

bijanhoule commented Nov 20, 2015

pandas 0.17.0 / python 3.4.3 (failing)

pandas 0.17.0 / python 3.5 (failing)

pandas 0.16.2 python 3.5 (passing)

bijanhoule commented Nov 20, 2015

jreback commented Nov 20, 2015

bijanhoule commented Nov 20, 2015

capelastegui commented Dec 18, 2015

jreback commented Dec 18, 2015