-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
drop_duplicates destroys non-duplicated data under 0.17 #11376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
pls |
cc @sinhrks |
In the same field - and maybe connected (?) - I ran into a Python 2 case where the .duplicated() method applied to a dataframe returned lines that were NOT duplicates. Too large to paste the example here. But wanted to mention it in case it almost the same bug under the hood. |
able to reproduce in master (PY3) |
I see the loss of 6,6 under both 2.7.6 and 3.5.0 with 0.17.0-- that is, I see no 2 vs. 3 difference in this example. (My notebook has a somewhat rare 32-bit Python linux environment, and so problems sometimes manifest differently.)
|
I'm not convinced that the comment
Anyway, one route to a quick fix would be to push everything along the factorize branch, IIUC. |
I can't track enough yet, but below part looks to break factorized labels. @behzadnouri Any idea? |
Just a 'me too'. I'm running into this problem under python 2.7.10 now that I've upgraded to pandas 1.7. I can also reproduce using the example given in the stackoverflow post:
i.e. it's dropped the 3,5 duplicated (correct) but also the 6,6 row (incorrect). |
If all of the integers are non-negative, we can index directly with integers, but the shape should be the largest integer present + 1, rather than the number of unique integers. |
this is broken by #10917
>>> a
array([ 5, 8, 11])
>>> unique1d(a)
array([ 5, 8, 11])
>>> factorize(a)[0]
array([0, 1, 2]) |
👍 on i'm seeing a similar problems, as the following should return an empty data frame:
|
closed by #11403 |
The
drop_duplicates()
function in Python 3 is broken. Take the following example snippet:When run under python 2, the results are correct, but when running under python 3, pandas removes
6,6
from the frame, which is a completely unique row. When using this function with large CSV files, it causes thousands of lines of unique data loss.See:
http://stackoverflow.com/questions/33224356/why-is-pandas-dropping-unique-rows
The text was updated successfully, but these errors were encountered: