-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DataFrame .duplicated() / .drop_duplicates() flagging unique rows as duplicated in 0.17.0 #11668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Here's an easier to copy-paste version: import pandas as pd
json_data = '{"Col1":{"0":"S2#OaGwWII","1":")A9$rw3W_I","2":"2Ra+*_RWII","3":"2RA`4kRWII","4":"2R=K*_RWII"},' \
'"Col2":{"0":141105144406,"1":141107294517,"2":141106133624,"3":141108219194,"4":141106133614}}'
df = pd.read_json(json_data)
assert not df.duplicated().any() |
this was fixed here: #11376 its in 0.17.1, which is being released today (not on PyPi just yet), you can get via conda though
|
Great, thanks! |
I have found the same problem still happens in 0.17.1, when using large dataframes, and duplicated(keep=False).
Out[]: 0
Out[]:110 Changing column order also results in different behavior.
Out[]:2138
The examples provided by bijanhoule now work correctly, though. |
@bijanhoule can you make a separate issue of this, with your example and xref this one. |
Dataframe.duplicated() and .drop_duplicates() are flagging rows as duplicates when they are in fact distinct.
This was the smallest dataset I could make to recreate the issue, but I've seen this issue on DataFrames of any size:
It also seems to depend on row order / column order; this behavior can be changed by shuffling / sampling rows or columns, e.g.:
I only see this behavior on 0.17.0, while 0.16.2 is fine. More details about each environment are below:
pandas 0.17.0 / python 3.4.3 (failing)
pandas 0.17.0 / python 3.5 (failing)
pandas 0.16.2 python 3.5 (passing)
The text was updated successfully, but these errors were encountered: