-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
df.drop_duplicates() not working as expected #32993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you investigate where things are going wrong? Likely somewhere in our factorization / hashtable code, though I'm not sure. ( |
I am interested in checking the issue this weekend. |
One thing I should note -- it looks like it wasn't all non-printing characters, but I believe this is specific to |
The problem also happens for the
...but for strings that contain the special character \x00 the function only returns one entry (\x01 and \x01 work, though):
This confirms that the problem probably happens in the hashtable |
take |
I am facing the same issue. A dataframe of 682 rows and 29 columns with all columns data types being either string or int64, it drops out two unique rows from the dataframe. These two rows are not completely unique, some of its columns are unique while other columns are the same but overall these rows are unique and should not be dropped. Another weird thing is that if I check these two rows separately in a dataframe, drop_duplicates do not drop them but rather retain it. This is really bad because it can potentially drop out a lot more rows for a data size ranging in millions. What I have debugged so far, the problem lies in pandas.core.algorithms.factorize which is being used by the duplicated function in frame.py.
|
@aadishms, just curious, any updates? |
Thank you to everyone working on Pandas. It's a great library and tool. Now that's out of the way, I just wanted to confirm that this method still drops rows that are NOT duplicates. It was one of the hardest bugs to pinpoint. Even after looking at my data, I still don't understand why this method would think that the rows it dropped were dupes. If any project member is interested in looking at my data, ping me. |
Seems like a duplicate issue of #34551, and that thread is closer to pinpointing a potential fix so closing in favor of that issue. |
Code Sample, a copy-pastable example
Problem description
If you try running the code above, you see only rows
0, 2, 3, 5, and 7
are retained. However, the actual strings are all unique, and I would have expecteddrop_duplicates()
to retain all rows. It's looking likedrop_duplicates()
only compares strings up to non-printing characters, then everything else gets ignored in the comparisons.Also, to note, I discovered this bug since I originally had these strings in a set and I didn't expect
drop_duplicates()
to do anything, but it did.I've verified on
pandas.__version__ == '1.0.3'
(installed via anaconda on linux in a new test environment), and I looked for other existing issues, but didn't find anything that seemed to match (although #11376 seems to be close/ in the same vein).Expected Output
(github markdown is highlighting some of the rows red, for some reason, but this is unrelated to what should be shown.)
Output of
pd.show_versions()
(ignoring rows with "None")The text was updated successfully, but these errors were encountered: