Skip to content

ENH: drop_duplicates(consecutive=True) to drop only consecutive duplicates #10540

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bwillers opened this issue Jul 10, 2015 · 5 comments
Open
Labels
API Design duplicated duplicated, drop_duplicates Enhancement

Comments

@bwillers
Copy link
Contributor

DataFrame.drop_duplicates can be useful to 'sparsify' a frame, requiring less
memory/storage. However, it doesn't handle the case where a value later reverts
to an earlier value. For example:

In [2]: df = pd.DataFrame(index=pd.date_range('20020101', periods=5, freq='D'),
                          data={'poll_support': [0.3, 0.4, 0.4, 0.4, 0.3]})

In [3]: df
Out[3]:
            poll_support
2002-01-01           0.3
2002-01-02           0.4
2002-01-03           0.4
2002-01-04           0.4
2002-01-05           0.3

In [4]: df.drop_duplicates()
Out[4]:
            poll_support
2002-01-01           0.3
2002-01-02           0.4

Would be ideal to be able to do something like:

In [4]: df.drop_duplicates(consecutive=True)
Out[4]:
            poll_support
2002-01-01           0.3
2002-01-02           0.4
2002-01-05           0.3

This should also be a much faster operation, since you only have to compare each
row with its successor, rather with all other rows.

You can achieve something like this with some shift trickery:

In [5]: s1 = df.shift(1)

In [6]: different = (s1 != df) & (s1.notnull() | df.notnull())

In [7]: df.drop(df.index[~different.any(axis=1)], axis=0)
Out[7]:
            poll_support
2002-01-01           0.3
2002-01-02           0.4
2002-01-05           0.3

But this is somewhat cumbersome, and allocating the intermediate shifted
frame can be slow (particularly if done via a groupby with a lot of groups).

@bashtage
Copy link
Contributor

For numeric data it is probably fastest to use np.diff(x,axis)==0 to find the dupes.

@jorisvandenbossche
Copy link
Member

I think this is indeed a very useful feature.
But, we should think a bit about the API. As there could also be some kind of groupby that does this (where this drop_duplicates would then be a consecutive_groupby().first()). See #5494, #7387

@kawochen
Copy link
Contributor

You could find the duplicates using itertools.groupby as well, or groupby something like [[(i, x) for i, g in groupby(df['a']) for x in g]], but it's pretty ugly and doesn't handle nan well.

@sinhrks
Copy link
Member

sinhrks commented Jul 19, 2016

You can do:

df[df['poll_support'] != df['poll_support'].shift(1)]
#             poll_support
# 2002-01-01           0.3
# 2002-01-02           0.4
# 2002-01-05           0.3

df.groupby((df['poll_support'] != df['poll_support'].shift(1)).cumsum()).count()
#               poll_support
# poll_support              
# 1                        1
# 2                        3
# 3                        1

I prefer this to be on cookbook rather than new method / options as user may want more flexible, e.g. drop 3 consective.

@giuliobeseghi
Copy link

Any updates on this?

I think it's a good idea, not sure if it makes more sense to extend drop_duplicates or create a new method (e.g. drop_consecutive_duplicates).

@simonjayhawkins simonjayhawkins added the duplicated duplicated, drop_duplicates label Jun 10, 2022
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design duplicated duplicated, drop_duplicates Enhancement
Projects
None yet
Development

No branches or pull requests

8 participants