Merge pull request #10810 from sinhrks/dup_doc

jreback · jreback · commit bc45bcac3964 · 2015-08-18T06:48:47.000-04:00
DOC: Updated drop_duplicates doc
diff --git a/doc/source/indexing.rst b/doc/source/indexing.rst
@@ -1180,28 +1180,43 @@ takes as an argument the columns to use to identify duplicated rows.
 By default, the first observed row of a duplicate set is considered unique, but
 each method has a ``keep`` parameter to specify targets to be kept.
 
+- ``keep='first'`` (default): mark / drop duplicates except for the first occurrence.
+- ``keep='last'``: mark / drop duplicates except for the last occurrence.
+- ``keep=False``: mark  / drop all duplicates.
+
 .. ipython:: python
 
-   df2 = pd.DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
-                       'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
-                       'c' : np.random.randn(7)})
-   df2.duplicated(['a','b'])
-   df2.duplicated(['a','b'], keep='last')
-   df2.duplicated(['a','b'], keep=False)
-   df2.drop_duplicates(['a','b'])
-   df2.drop_duplicates(['a','b'], keep='last')
-   df2.drop_duplicates(['a','b'], keep=False)
+   df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],
+                       'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],
+                       'c': np.random.randn(7)})
+   df2
+   df2.duplicated('a')
+   df2.duplicated('a', keep='last')
+   df2.duplicated('a', keep=False)
+   df2.drop_duplicates('a')
+   df2.drop_duplicates('a', keep='last')
+   df2.drop_duplicates('a', keep=False)
 
-An alternative way to drop duplicates on the index is ``.groupby(level=0)`` combined with ``first()`` or ``last()``.
+Also, you can pass a list of columns to identify duplications.
 
 .. ipython:: python
 
-   df3 = df2.set_index('b')
-   df3
-   df3.groupby(level=0).first()
+   df2.duplicated(['a', 'b'])
+   df2.drop_duplicates(['a', 'b'])
+
+To drop duplicates by index value, use ``Index.duplicated`` then perform slicing.
+Same options are available in ``keep`` parameter.
 
-   # a bit more verbose
-   df3.reset_index().drop_duplicates(subset='b', keep='first').set_index('b')
+.. ipython:: python
+
+   df3 = pd.DataFrame({'a': np.arange(6),
+                       'b': np.random.randn(6)},
+                      index=['a', 'a', 'b', 'c', 'b', 'a'])
+   df3
+   df3.index.duplicated()
+   df3[~df3.index.duplicated()]
+   df3[~df3.index.duplicated(keep='last')]
+   df3[~df3.index.duplicated(keep=False)]
 
 .. _indexing.dictionarylike: