ENH: drop_duplicates(consecutive=True) to drop only consecutive duplicates #10540

bwillers · 2015-07-10T00:11:38Z

DataFrame.drop_duplicates can be useful to 'sparsify' a frame, requiring less
memory/storage. However, it doesn't handle the case where a value later reverts
to an earlier value. For example:

In [2]: df = pd.DataFrame(index=pd.date_range('20020101', periods=5, freq='D'),
                          data={'poll_support': [0.3, 0.4, 0.4, 0.4, 0.3]})

In [3]: df
Out[3]:
            poll_support
2002-01-01           0.3
2002-01-02           0.4
2002-01-03           0.4
2002-01-04           0.4
2002-01-05           0.3

In [4]: df.drop_duplicates()
Out[4]:
            poll_support
2002-01-01           0.3
2002-01-02           0.4

Would be ideal to be able to do something like:

In [4]: df.drop_duplicates(consecutive=True)
Out[4]:
            poll_support
2002-01-01           0.3
2002-01-02           0.4
2002-01-05           0.3

This should also be a much faster operation, since you only have to compare each
row with its successor, rather with all other rows.

You can achieve something like this with some shift trickery:

In [5]: s1 = df.shift(1)

In [6]: different = (s1 != df) & (s1.notnull() | df.notnull())

In [7]: df.drop(df.index[~different.any(axis=1)], axis=0)
Out[7]:
            poll_support
2002-01-01           0.3
2002-01-02           0.4
2002-01-05           0.3

But this is somewhat cumbersome, and allocating the intermediate shifted
frame can be slow (particularly if done via a groupby with a lot of groups).

The text was updated successfully, but these errors were encountered:

bashtage · 2015-07-10T13:46:24Z

For numeric data it is probably fastest to use np.diff(x,axis)==0 to find the dupes.

jorisvandenbossche · 2015-07-10T14:58:24Z

I think this is indeed a very useful feature.
But, we should think a bit about the API. As there could also be some kind of groupby that does this (where this drop_duplicates would then be a consecutive_groupby().first()). See #5494, #7387

kawochen · 2015-07-10T16:34:46Z

You could find the duplicates using itertools.groupby as well, or groupby something like [[(i, x) for i, g in groupby(df['a']) for x in g]], but it's pretty ugly and doesn't handle nan well.

sinhrks · 2016-07-19T21:06:10Z

You can do:

df[df['poll_support'] != df['poll_support'].shift(1)]
#             poll_support
# 2002-01-01           0.3
# 2002-01-02           0.4
# 2002-01-05           0.3

df.groupby((df['poll_support'] != df['poll_support'].shift(1)).cumsum()).count()
#               poll_support
# poll_support              
# 1                        1
# 2                        3
# 3                        1

I prefer this to be on cookbook rather than new method / options as user may want more flexible, e.g. drop 3 consective.

giuliobeseghi · 2020-08-11T11:14:03Z

Any updates on this?

I think it's a good idea, not sure if it makes more sense to extend drop_duplicates or create a new method (e.g. drop_consecutive_duplicates).

jorisvandenbossche added Enhancement API Design labels Jul 10, 2015

sinhrks added the Docs label Jul 19, 2016

sinhrks added this to the 0.19.0 milestone Jul 19, 2016

sinhrks mentioned this issue Jul 19, 2016

What do you guys think about a DataFrame.drop_consecutive_duplicates? #4543

Closed

jorisvandenbossche modified the milestones: Next Major Release, 0.19.0 Aug 15, 2016

jorisvandenbossche removed the Docs label Mar 7, 2018

simonjayhawkins added the duplicated duplicated, drop_duplicates label Jun 10, 2022

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: drop_duplicates(consecutive=True) to drop only consecutive duplicates #10540

ENH: drop_duplicates(consecutive=True) to drop only consecutive duplicates #10540

bwillers commented Jul 10, 2015

bashtage commented Jul 10, 2015

jorisvandenbossche commented Jul 10, 2015

kawochen commented Jul 10, 2015

sinhrks commented Jul 19, 2016

giuliobeseghi commented Aug 11, 2020

ENH: drop_duplicates(consecutive=True) to drop only consecutive duplicates #10540

ENH: drop_duplicates(consecutive=True) to drop only consecutive duplicates #10540

Comments

bwillers commented Jul 10, 2015

bashtage commented Jul 10, 2015

jorisvandenbossche commented Jul 10, 2015

kawochen commented Jul 10, 2015

sinhrks commented Jul 19, 2016

giuliobeseghi commented Aug 11, 2020