Skip to content

Commit bc45bca

Browse files
committed
Merge pull request #10810 from sinhrks/dup_doc
DOC: Updated drop_duplicates doc
2 parents a4875d6 + 7a9268d commit bc45bca

File tree

1 file changed

+30
-15
lines changed

1 file changed

+30
-15
lines changed

doc/source/indexing.rst

+30-15
Original file line numberDiff line numberDiff line change
@@ -1180,28 +1180,43 @@ takes as an argument the columns to use to identify duplicated rows.
11801180
By default, the first observed row of a duplicate set is considered unique, but
11811181
each method has a ``keep`` parameter to specify targets to be kept.
11821182

1183+
- ``keep='first'`` (default): mark / drop duplicates except for the first occurrence.
1184+
- ``keep='last'``: mark / drop duplicates except for the last occurrence.
1185+
- ``keep=False``: mark / drop all duplicates.
1186+
11831187
.. ipython:: python
11841188
1185-
df2 = pd.DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
1186-
'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
1187-
'c' : np.random.randn(7)})
1188-
df2.duplicated(['a','b'])
1189-
df2.duplicated(['a','b'], keep='last')
1190-
df2.duplicated(['a','b'], keep=False)
1191-
df2.drop_duplicates(['a','b'])
1192-
df2.drop_duplicates(['a','b'], keep='last')
1193-
df2.drop_duplicates(['a','b'], keep=False)
1189+
df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],
1190+
'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],
1191+
'c': np.random.randn(7)})
1192+
df2
1193+
df2.duplicated('a')
1194+
df2.duplicated('a', keep='last')
1195+
df2.duplicated('a', keep=False)
1196+
df2.drop_duplicates('a')
1197+
df2.drop_duplicates('a', keep='last')
1198+
df2.drop_duplicates('a', keep=False)
11941199
1195-
An alternative way to drop duplicates on the index is ``.groupby(level=0)`` combined with ``first()`` or ``last()``.
1200+
Also, you can pass a list of columns to identify duplications.
11961201

11971202
.. ipython:: python
11981203
1199-
df3 = df2.set_index('b')
1200-
df3
1201-
df3.groupby(level=0).first()
1204+
df2.duplicated(['a', 'b'])
1205+
df2.drop_duplicates(['a', 'b'])
1206+
1207+
To drop duplicates by index value, use ``Index.duplicated`` then perform slicing.
1208+
Same options are available in ``keep`` parameter.
12021209

1203-
# a bit more verbose
1204-
df3.reset_index().drop_duplicates(subset='b', keep='first').set_index('b')
1210+
.. ipython:: python
1211+
1212+
df3 = pd.DataFrame({'a': np.arange(6),
1213+
'b': np.random.randn(6)},
1214+
index=['a', 'a', 'b', 'c', 'b', 'a'])
1215+
df3
1216+
df3.index.duplicated()
1217+
df3[~df3.index.duplicated()]
1218+
df3[~df3.index.duplicated(keep='last')]
1219+
df3[~df3.index.duplicated(keep=False)]
12051220
12061221
.. _indexing.dictionarylike:
12071222

0 commit comments

Comments
 (0)