Skip to content

Method/option for deduplicating index, a la drop_duplicates #2825

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wesm opened this issue Feb 9, 2013 · 10 comments
Closed

Method/option for deduplicating index, a la drop_duplicates #2825

wesm opened this issue Feb 9, 2013 · 10 comments
Milestone

Comments

@wesm
Copy link
Member

wesm commented Feb 9, 2013

see e.g. #2763

@patricktokeeffe
Copy link
Contributor

The workarounds I've found for emulating DataFrame.drop_duplicates() on the index all use df.groupby(level=0) with .first() or .last(). Though easily rationalized, this is not intuitive. Also, IIUC this does not perform the operation in-place which could be a problem for very large data sets.

Maybe drop_duplicates could have a new boolean parameter which shortcuts like so

DataFrame.drop_duplicates(use_index=True) equals df.groupby(level=0).first()
DataFrame.drop_duplicates(use_index=True, take_last=True) equals df.groupby(level=0).last()

Of course the inplace parameter should still work. And if col is provided while use_index is True, then the index should be treated as just another column.

I'm wholly ignorant of what implementation barriers exist; this is just speculation about how it could be done.

@sinhrks
Copy link
Member

sinhrks commented Aug 11, 2015

Now we can do as below. Should document somewhere.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 5), index=list('abcdeafagh'))
df[df.index.duplicated()]
#           0         1         2         3         4
# a -0.224246 -0.693000  0.424868 -0.384844  0.512322
# a -0.779163 -0.556674  0.102416 -0.761115  0.438872

df[~df.index.duplicated()]
#           0         1         2         3         4
# a  0.918487  0.438925  0.761677 -1.299548 -0.277456
# b  1.251101 -0.361700  3.210220 -0.778130 -0.351071
# c  0.768538  0.718754  0.565361  0.532387  0.797216
# d -0.147877  0.468789 -0.486800 -0.634428  0.502015
# e  0.331038  0.301599 -2.451796 -1.350511 -1.629918
# f -1.723805  0.714659 -0.219288  1.286036 -0.034969
# g -0.448491  0.874354  0.686063 -0.491885 -1.652858
# h -0.437542  1.518854  0.361197  2.423748 -0.425337

df[~df.index.duplicated(keep=False)]
#           0         1         2         3         4
# b  1.251101 -0.361700  3.210220 -0.778130 -0.351071
# c  0.768538  0.718754  0.565361  0.532387  0.797216
# d -0.147877  0.468789 -0.486800 -0.634428  0.502015
# e  0.331038  0.301599 -2.451796 -1.350511 -1.629918
# f -1.723805  0.714659 -0.219288  1.286036 -0.034969
# g -0.448491  0.874354  0.686063 -0.491885 -1.652858
# h -0.437542  1.518854  0.361197  2.423748 -0.425337

@bilderbuchi
Copy link

df[df.index.duplicated()]

Seeing how the index has a occuring 3 times, I have to say I would expect this statement to give 3 results.

@patricktokeeffe
Copy link
Contributor

I hadn't noticed but totally agree. Labeling the two with higher index numbers as dups is arbitrary.

On August 11, 2015 1:05:51 PM PDT, Christoph Buchner [email protected] wrote:

df[df.index.duplicated()]

Seeing how the index has a occuring 3 times, I have to say I would
expect this statement to give 3 results.


Reply to this email directly or view it on GitHub:
#2825 (comment)

@sinhrks
Copy link
Member

sinhrks commented Aug 11, 2015

@bilderbuchi You can specify the behavior with keep option, see #10236

df[df.index.duplicated(keep=False)]
#           0         1         2         3         4
# a  0.791988 -0.127854  0.308921 -0.360801 -0.202838
# a -1.076777  1.027121  0.665178  0.115625  0.381496
# a  0.478709 -0.487886 -1.643373 -1.430937 -1.386072

df[~df.index.duplicated(keep=False)]
#           0         1         2         3         4
# b -1.773957  1.150821 -1.223845 -1.463176 -0.832168
# c -0.934962 -0.002800  0.102012  2.207798 -0.528325
# d  1.061423  1.362033  2.487682 -0.709634 -0.683065
# e  0.166217  0.156525 -0.535327  1.128811  0.120434
# f -0.542128  0.108480  0.572719  1.358334  0.159375
# g  0.264287 -0.188011  1.264308 -0.371283  0.089278
# h  0.059987 -0.604473 -1.010755  1.112758  0.865414

@patricktokeeffe
Copy link
Contributor

patricktokeeffe commented Aug 11, 2015 via email

@bilderbuchi
Copy link

ah, interesting, thanks for the clarification.
yeah, that definitely should be documented somewhere imo.
Also, after my initial reading I read the discussion in #10236 - I also feel that a kwarg changing between string and bool feels reeeally weird (I'd have gone for 'none'), but I guess that ship has sailed. maybe also worth it to document that clearly.

@sinhrks
Copy link
Member

sinhrks commented Aug 11, 2015

Thanks to confirm. #10236 is included in v0.17 and not released yet, we can discuss if you have better alternatives.

@bilderbuchi
Copy link

Thanks. Well, as I said I'd have gone for a 'none' string instead of the boolean False - as a Python intermediate I find neither more pythonic than the other. I'm just a casual user of pandas, though, and I I'm sure that @jreback has more investment in pandas, so I guess that it fits better with the general pandas API or somesuch...

@jreback
Copy link
Contributor

jreback commented Aug 12, 2015

@bilderbuchi having a string and a Boolean possible in a keyword are no big deal

'none' however is very confusing

in any event @sinhrks out up s nice doc about how to do this

@jreback jreback modified the milestones: 0.17.0, Someday Aug 18, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants