@@ -1180,28 +1180,43 @@ takes as an argument the columns to use to identify duplicated rows.
1180
1180
By default, the first observed row of a duplicate set is considered unique, but
1181
1181
each method has a ``keep `` parameter to specify targets to be kept.
1182
1182
1183
+ - ``keep='first' `` (default): mark / drop duplicates except for the first occurrence.
1184
+ - ``keep='last' ``: mark / drop duplicates except for the last occurrence.
1185
+ - ``keep=False ``: mark / drop all duplicates.
1186
+
1183
1187
.. ipython :: python
1184
1188
1185
- df2 = pd.DataFrame({' a' : [' one' , ' one' , ' two' , ' three' , ' two' , ' one' , ' six' ],
1186
- ' b' : [' x' , ' y' , ' y' , ' x' , ' y' , ' x' , ' x' ],
1187
- ' c' : np.random.randn(7 )})
1188
- df2.duplicated([' a' ,' b' ])
1189
- df2.duplicated([' a' ,' b' ], keep = ' last' )
1190
- df2.duplicated([' a' ,' b' ], keep = False )
1191
- df2.drop_duplicates([' a' ,' b' ])
1192
- df2.drop_duplicates([' a' ,' b' ], keep = ' last' )
1193
- df2.drop_duplicates([' a' ,' b' ], keep = False )
1189
+ df2 = pd.DataFrame({' a' : [' one' , ' one' , ' two' , ' two' , ' two' , ' three' , ' four' ],
1190
+ ' b' : [' x' , ' y' , ' x' , ' y' , ' x' , ' x' , ' x' ],
1191
+ ' c' : np.random.randn(7 )})
1192
+ df2
1193
+ df2.duplicated(' a' )
1194
+ df2.duplicated(' a' , keep = ' last' )
1195
+ df2.duplicated(' a' , keep = False )
1196
+ df2.drop_duplicates(' a' )
1197
+ df2.drop_duplicates(' a' , keep = ' last' )
1198
+ df2.drop_duplicates(' a' , keep = False )
1194
1199
1195
- An alternative way to drop duplicates on the index is `` .groupby(level=0) `` combined with `` first() `` or `` last() `` .
1200
+ Also, you can pass a list of columns to identify duplications .
1196
1201
1197
1202
.. ipython :: python
1198
1203
1199
- df3 = df2.set_index(' b' )
1200
- df3
1201
- df3.groupby(level = 0 ).first()
1204
+ df2.duplicated([' a' , ' b' ])
1205
+ df2.drop_duplicates([' a' , ' b' ])
1206
+
1207
+ To drop duplicates by index value, use ``Index.duplicated `` then perform slicing.
1208
+ Same options are available in ``keep `` parameter.
1202
1209
1203
- # a bit more verbose
1204
- df3.reset_index().drop_duplicates(subset = ' b' , keep = ' first' ).set_index(' b' )
1210
+ .. ipython :: python
1211
+
1212
+ df3 = pd.DataFrame({' a' : np.arange(6 ),
1213
+ ' b' : np.random.randn(6 )},
1214
+ index = [' a' , ' a' , ' b' , ' c' , ' b' , ' a' ])
1215
+ df3
1216
+ df3.index.duplicated()
1217
+ df3[~ df3.index.duplicated()]
1218
+ df3[~ df3.index.duplicated(keep = ' last' )]
1219
+ df3[~ df3.index.duplicated(keep = False )]
1205
1220
1206
1221
.. _indexing.dictionarylike :
1207
1222
0 commit comments