Skip to content

drop_duplicates() is dropping more than just duplicates in 0.17.0 #11512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
andersonjacob opened this issue Nov 3, 2015 · 1 comment
Closed
Labels
Duplicate Report Duplicate issue or pull request Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@andersonjacob
Copy link

When I upgraded from 0.16.2 to 0.17.0, I was met with a nasty surprise when dropping duplicates. It looks like DataFrame.drop_duplicates() is not working as I would expect it to based on the previous version. I have a dataframe

test_ids = df['test_id'].unique()
print('N test ids: {}'.format(test_ids.shape))
print('N tests: {}'.format(df[['test_id', <some other columns>]].drop_duplicates().shape))

the output is:

N test ids: (341334,)
N tests: (237426, 10)

when I run the same in 0.16.2 the output is:

N test ids: (341334,)
N tests: (341334, 10)

I don't think you should be able to get fewer rows than the number of unique entries in a single column.

@TomAugspurger
Copy link
Contributor

Sorry about that.

Probably a dupe of #11376, fixed in #11403 which is in master.

If you're able could you build master and see if it's fixed for you? Thanks.

@TomAugspurger TomAugspurger added Duplicate Report Duplicate issue or pull request Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 3, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

2 participants