|
| 1 | +.. _duplicates: |
| 2 | + |
| 3 | +**************** |
| 4 | +Duplicate Labels |
| 5 | +**************** |
| 6 | + |
| 7 | +:class:`Index` objects are not required to be unique; you can have duplicate row |
| 8 | +or column labels. This may be a bit confusing at first. If you're familiar with |
| 9 | +SQL, you know that row labels are similar to a primary key on a table, and you |
| 10 | +would never want duplicates in a SQL table. But one of pandas' roles is to clean |
| 11 | +messy, real-world data before it goes to some downstream system. And real-world |
| 12 | +data has duplicates, even in fields that are supposed to be unique. |
| 13 | + |
| 14 | +This section describes how duplicate labels change the behavior of certain |
| 15 | +operations, and how prevent duplicates from arising during operations, or to |
| 16 | +detect them if they do. |
| 17 | + |
| 18 | +.. ipython:: python |
| 19 | +
|
| 20 | + import pandas as pd |
| 21 | + import numpy as np |
| 22 | +
|
| 23 | +Consequences of Duplicate Labels |
| 24 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 25 | + |
| 26 | +Some pandas methods (:meth:`Series.reindex` for example) just don't work with |
| 27 | +duplicates present. The output can't be determined, and so pandas raises. |
| 28 | + |
| 29 | +.. ipython:: python |
| 30 | + :okexcept: |
| 31 | +
|
| 32 | + s1 = pd.Series([0, 1, 2], index=['a', 'b', 'b']) |
| 33 | + s1.reindex(['a', 'b', 'c']) |
| 34 | +
|
| 35 | +Other methods, like indexing, can give very surprising results. Typically |
| 36 | +indexing with a scalar will *reduce dimensionality*. Slicing a ``DataFrame`` |
| 37 | +with a scalar will return a ``Series``. Slicing a ``Series`` with a scalar will |
| 38 | +return a scalar. But with duplicates, this isn't the case. |
| 39 | + |
| 40 | +.. ipython:: python |
| 41 | +
|
| 42 | + df1 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=['A', 'A', 'B']) |
| 43 | + df1 |
| 44 | +
|
| 45 | +We have duplicates in the columns. If we slice ``'B'``, we get back a ``Series`` |
| 46 | + |
| 47 | +.. ipython:: python |
| 48 | +
|
| 49 | + df1['B'] # a series |
| 50 | +
|
| 51 | +But slicing ``'A'`` returns a ``DataFrame`` |
| 52 | + |
| 53 | + |
| 54 | +.. ipython:: python |
| 55 | +
|
| 56 | + df1['A'] # a DataFrame |
| 57 | +
|
| 58 | +This applies to row labels as well |
| 59 | + |
| 60 | +.. ipython:: python |
| 61 | +
|
| 62 | + df2 = pd.DataFrame({"A": [0, 1, 2]}, index=['a', 'a', 'b']) |
| 63 | + df2 |
| 64 | + df2.loc['b', 'A'] # a scalar |
| 65 | +
|
| 66 | + df2.loc['a', 'A'] # a Series |
| 67 | +
|
| 68 | +Duplicate Label Detection |
| 69 | +~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 70 | + |
| 71 | +You can check with an :class:`Index` (storing the row or column labels) is |
| 72 | +unique with :attr:`Index.is_unique`: |
| 73 | + |
| 74 | +.. ipython:: python |
| 75 | +
|
| 76 | + df2 |
| 77 | + df2.index.is_unique |
| 78 | + df2.columns.is_unique |
| 79 | +
|
| 80 | +.. note:: |
| 81 | + |
| 82 | + Checking whether an index is unique is somewhat expensive for large datasets. |
| 83 | + Pandas does cache this result, so re-checking on the same index is very fast. |
| 84 | + |
| 85 | +:meth:`Index.duplicated` will return a boolean ndarray indicating whether a |
| 86 | +label is a repeat. |
| 87 | + |
| 88 | +.. ipython:: python |
| 89 | +
|
| 90 | + df2.index.duplicated() |
| 91 | +
|
| 92 | +Which can be used as a boolean filter to drop duplicate rows. |
| 93 | + |
| 94 | +.. ipython:: python |
| 95 | +
|
| 96 | + df2.loc[~df2.index.duplicated(), :] |
| 97 | +
|
| 98 | +If you need additional logic to handle duplicate labels, rather than just |
| 99 | +dropping the repeats, using :meth:`~DataFrame.groupby` on the index is a common |
| 100 | +trick. For example, we'll resolve duplicates by taking the average of all rows |
| 101 | +with the same label. |
| 102 | + |
| 103 | +.. ipython:: python |
| 104 | +
|
| 105 | + df2.groupby(level=0).mean() |
| 106 | +
|
| 107 | +.. _duplicates.disallow: |
| 108 | + |
| 109 | +Disallowing Duplicate Labels |
| 110 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 111 | + |
| 112 | +As noted above, handling duplicates is an important feature when reading in raw |
| 113 | +data. That said, you may want to avoid introducing duplicates as part of a data |
| 114 | +processing pipeline (from methods like :meth:`pandas.concat`, |
| 115 | +:meth:`~DataFrame.rename`, etc.). Both :class:`Series` and :class:`DataFrame` |
| 116 | +can be created with the argument ``allow_duplicate_labels=False`` to *disallow* |
| 117 | +duplicate labels (the default is to allow them). If there are duplicate labels, |
| 118 | +an exception will be raised. |
| 119 | + |
| 120 | +.. ipython:: python |
| 121 | + :okexcept: |
| 122 | +
|
| 123 | + pd.Series([0, 1, 2], index=['a', 'b', 'b'], allow_duplicate_labels=False) |
| 124 | +
|
| 125 | +This applies to both row and column labels for a :class:`DataFrame` |
| 126 | + |
| 127 | +.. ipython:: python |
| 128 | + :okexcept: |
| 129 | +
|
| 130 | + pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=["A", "B", "C"], |
| 131 | + allow_duplicate_labels=False) |
| 132 | +
|
| 133 | +This attribute can be checked with :attr:`~DataFrame.allows_duplicate_labels`, |
| 134 | +which indicates whether that object can have duplicate labels. |
| 135 | + |
| 136 | +.. ipython:: python |
| 137 | +
|
| 138 | + df = pd.DataFrame({"A": [0, 1, 2, 3]}, index=['x', 'y', 'X', 'Y'], |
| 139 | + allow_duplicate_labels=False) |
| 140 | + df |
| 141 | + df.allows_duplicate_labels |
| 142 | +
|
| 143 | +Performing an operation that introduces duplicate labels on a ``Series`` or |
| 144 | +``DataFrame`` that disallows duplicates will raise an |
| 145 | +:class:`errors.DuplicateLabelError`. |
| 146 | + |
| 147 | +.. ipython:: python |
| 148 | + :okexcept: |
| 149 | +
|
| 150 | + df.rename(str.upper) |
| 151 | +
|
| 152 | +Duplicate Label Propagation |
| 153 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 154 | + |
| 155 | +In general, disallowing duplicates is "sticky". It's preserved through |
| 156 | +operations. |
| 157 | + |
| 158 | +.. ipython:: python |
| 159 | + :okexcept: |
| 160 | +
|
| 161 | + s1 = pd.Series(0, index=['a', 'b'], allow_duplicate_labels=False) |
| 162 | + s1 |
| 163 | + abs(s1).rename({"a": "b"}) |
| 164 | +
|
| 165 | +When multiple Series or DataFrames are involved in an operation, |
| 166 | +duplictes are disallowed if *any* of the inputs disallow duplicates. |
| 167 | + |
| 168 | +.. ipython:: python |
| 169 | + :okexcept: |
| 170 | +
|
| 171 | + df1 = pd.Series(0, index=['a', 'b'], allow_duplicate_labels=False) |
| 172 | + df2 = pd.Series(1, index=['b', 'c'], allow_duplicate_labels=True) |
| 173 | +
|
| 174 | + pd.concat([df1, df2]) |
0 commit comments