Method/option for deduplicating index, a la drop_duplicates #2825

wesm · 2013-02-09T18:41:00Z

see e.g. #2763

patricktokeeffe · 2013-03-20T01:25:53Z

The workarounds I've found for emulating DataFrame.drop_duplicates() on the index all use df.groupby(level=0) with .first() or .last(). Though easily rationalized, this is not intuitive. Also, IIUC this does not perform the operation in-place which could be a problem for very large data sets.

Maybe drop_duplicates could have a new boolean parameter which shortcuts like so

DataFrame.drop_duplicates(use_index=True) equals df.groupby(level=0).first()
DataFrame.drop_duplicates(use_index=True, take_last=True) equals df.groupby(level=0).last()

Of course the inplace parameter should still work. And if col is provided while use_index is True, then the index should be treated as just another column.

I'm wholly ignorant of what implementation barriers exist; this is just speculation about how it could be done.

sinhrks · 2015-08-11T01:24:06Z

Now we can do as below. Should document somewhere.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 5), index=list('abcdeafagh'))
df[df.index.duplicated()]
#           0         1         2         3         4
# a -0.224246 -0.693000  0.424868 -0.384844  0.512322
# a -0.779163 -0.556674  0.102416 -0.761115  0.438872

df[~df.index.duplicated()]
#           0         1         2         3         4
# a  0.918487  0.438925  0.761677 -1.299548 -0.277456
# b  1.251101 -0.361700  3.210220 -0.778130 -0.351071
# c  0.768538  0.718754  0.565361  0.532387  0.797216
# d -0.147877  0.468789 -0.486800 -0.634428  0.502015
# e  0.331038  0.301599 -2.451796 -1.350511 -1.629918
# f -1.723805  0.714659 -0.219288  1.286036 -0.034969
# g -0.448491  0.874354  0.686063 -0.491885 -1.652858
# h -0.437542  1.518854  0.361197  2.423748 -0.425337

df[~df.index.duplicated(keep=False)]
#           0         1         2         3         4
# b  1.251101 -0.361700  3.210220 -0.778130 -0.351071
# c  0.768538  0.718754  0.565361  0.532387  0.797216
# d -0.147877  0.468789 -0.486800 -0.634428  0.502015
# e  0.331038  0.301599 -2.451796 -1.350511 -1.629918
# f -1.723805  0.714659 -0.219288  1.286036 -0.034969
# g -0.448491  0.874354  0.686063 -0.491885 -1.652858
# h -0.437542  1.518854  0.361197  2.423748 -0.425337

bilderbuchi · 2015-08-11T20:05:36Z

df[df.index.duplicated()]

Seeing how the index has a occuring 3 times, I have to say I would expect this statement to give 3 results.

patricktokeeffe · 2015-08-11T20:10:53Z

I hadn't noticed but totally agree. Labeling the two with higher index numbers as dups is arbitrary.

On August 11, 2015 1:05:51 PM PDT, Christoph Buchner [email protected] wrote:

df[df.index.duplicated()]

Seeing how the index has a occuring 3 times, I have to say I would
expect this statement to give 3 results.

Reply to this email directly or view it on GitHub:
#2825 (comment)

sinhrks · 2015-08-11T20:34:07Z

@bilderbuchi You can specify the behavior with keep option, see #10236

df[df.index.duplicated(keep=False)]
#           0         1         2         3         4
# a  0.791988 -0.127854  0.308921 -0.360801 -0.202838
# a -1.076777  1.027121  0.665178  0.115625  0.381496
# a  0.478709 -0.487886 -1.643373 -1.430937 -1.386072

df[~df.index.duplicated(keep=False)]
#           0         1         2         3         4
# b -1.773957  1.150821 -1.223845 -1.463176 -0.832168
# c -0.934962 -0.002800  0.102012  2.207798 -0.528325
# d  1.061423  1.362033  2.487682 -0.709634 -0.683065
# e  0.166217  0.156525 -0.535327  1.128811  0.120434
# f -0.542128  0.108480  0.572719  1.358334  0.159375
# g  0.264287 -0.188011  1.264308 -0.371283  0.089278
# h  0.059987 -0.604473 -1.010755  1.112758  0.865414

patricktokeeffe · 2015-08-11T20:40:04Z

Ah - that's subtle, but a very good compromise.

bilderbuchi · 2015-08-11T20:47:00Z

ah, interesting, thanks for the clarification.
yeah, that definitely should be documented somewhere imo.
Also, after my initial reading I read the discussion in #10236 - I also feel that a kwarg changing between string and bool feels reeeally weird (I'd have gone for 'none'), but I guess that ship has sailed. maybe also worth it to document that clearly.

sinhrks · 2015-08-11T20:48:47Z

Thanks to confirm. #10236 is included in v0.17 and not released yet, we can discuss if you have better alternatives.

bilderbuchi · 2015-08-12T18:58:43Z

Thanks. Well, as I said I'd have gone for a 'none' string instead of the boolean False - as a Python intermediate I find neither more pythonic than the other. I'm just a casual user of pandas, though, and I I'm sure that @jreback has more investment in pandas, so I guess that it fits better with the general pandas API or somesuch...

jreback · 2015-08-12T22:27:21Z

@bilderbuchi having a string and a Boolean possible in a keyword are no big deal

'none' however is very confusing

in any event @sinhrks out up s nice doc about how to do this

sinhrks mentioned this issue Aug 13, 2014

ENH: Add duplicated/drop_duplicates to Index #7979

Merged

sinhrks mentioned this issue Aug 12, 2015

DOC: Updated drop_duplicates doc #10810

Merged

jreback modified the milestones: 0.17.0, Someday Aug 18, 2015

jreback closed this as completed in #10810 Aug 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Method/option for deduplicating index, a la drop_duplicates #2825

Method/option for deduplicating index, a la drop_duplicates #2825

wesm commented Feb 9, 2013

patricktokeeffe commented Mar 20, 2013

sinhrks commented Aug 11, 2015

bilderbuchi commented Aug 11, 2015

patricktokeeffe commented Aug 11, 2015

sinhrks commented Aug 11, 2015

patricktokeeffe commented Aug 11, 2015 via email

bilderbuchi commented Aug 11, 2015

sinhrks commented Aug 11, 2015

bilderbuchi commented Aug 12, 2015

jreback commented Aug 12, 2015

Method/option for deduplicating index, a la drop_duplicates #2825

Method/option for deduplicating index, a la drop_duplicates #2825

Comments

wesm commented Feb 9, 2013

patricktokeeffe commented Mar 20, 2013

sinhrks commented Aug 11, 2015

bilderbuchi commented Aug 11, 2015

patricktokeeffe commented Aug 11, 2015

sinhrks commented Aug 11, 2015

patricktokeeffe commented Aug 11, 2015 via email

bilderbuchi commented Aug 11, 2015

sinhrks commented Aug 11, 2015

bilderbuchi commented Aug 12, 2015

jreback commented Aug 12, 2015