|
| 1 | +.. _duplicates: |
| 2 | + |
| 3 | +**************** |
| 4 | +Duplicate Labels |
| 5 | +**************** |
| 6 | + |
| 7 | +:class:`Index` objects are not required to be unique; you can have duplicate row |
| 8 | +or column labels. This may be a bit confusing at first. If you're familiar with |
| 9 | +SQL, you know that row labels are similar to a primary key on a table, and you |
| 10 | +would never want duplicates in a SQL table. But one of pandas' roles is to clean |
| 11 | +messy, real-world data before it goes to some downstream system. And real-world |
| 12 | +data has duplicates, even in fields that are supposed to be unique. |
| 13 | + |
| 14 | +This section describes how duplicate labels change the behavior of certain |
| 15 | +operations, and how prevent duplicates from arising during operations, or to |
| 16 | +detect them if they do. |
| 17 | + |
| 18 | +.. ipython:: python |
| 19 | +
|
| 20 | + import pandas as pd |
| 21 | + import numpy as np |
| 22 | +
|
| 23 | +Consequences of Duplicate Labels |
| 24 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 25 | + |
| 26 | +Some pandas methods (:meth:`Series.reindex` for example) just don't work with |
| 27 | +duplicates present. The output can't be determined, and so pandas raises. |
| 28 | + |
| 29 | +.. ipython:: python |
| 30 | + :okexcept: |
| 31 | +
|
| 32 | + s1 = pd.Series([0, 1, 2], index=['a', 'b', 'b']) |
| 33 | + s1.reindex(['a', 'b', 'c']) |
| 34 | +
|
| 35 | +Other methods, like indexing, can give very surprising results. Typically |
| 36 | +indexing with a scalar will *reduce dimensionality*. Slicing a ``DataFrame`` |
| 37 | +with a scalar will return a ``Series``. Slicing a ``Series`` with a scalar will |
| 38 | +return a scalar. But with duplicates, this isn't the case. |
| 39 | + |
| 40 | +.. ipython:: python |
| 41 | +
|
| 42 | + df1 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=['A', 'A', 'B']) |
| 43 | + df1 |
| 44 | +
|
| 45 | +We have duplicates in the columns. If we slice ``'B'``, we get back a ``Series`` |
| 46 | + |
| 47 | +.. ipython:: python |
| 48 | +
|
| 49 | + df1['B'] # a series |
| 50 | +
|
| 51 | +But slicing ``'A'`` returns a ``DataFrame`` |
| 52 | + |
| 53 | + |
| 54 | +.. ipython:: python |
| 55 | +
|
| 56 | + df1['A'] # a DataFrame |
| 57 | +
|
| 58 | +This applies to row labels as well |
| 59 | + |
| 60 | +.. ipython:: python |
| 61 | +
|
| 62 | + df2 = pd.DataFrame({"A": [0, 1, 2]}, index=['a', 'a', 'b']) |
| 63 | + df2 |
| 64 | + df2.loc['b', 'A'] # a scalar |
| 65 | + df2.loc['a', 'A'] # a Series |
| 66 | +
|
| 67 | +Duplicate Label Detection |
| 68 | +~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 69 | + |
| 70 | +You can check whether an :class:`Index` (storing the row or column labels) is |
| 71 | +unique with :attr:`Index.is_unique`: |
| 72 | + |
| 73 | +.. ipython:: python |
| 74 | +
|
| 75 | + df2 |
| 76 | + df2.index.is_unique |
| 77 | + df2.columns.is_unique |
| 78 | +
|
| 79 | +.. note:: |
| 80 | + |
| 81 | + Checking whether an index is unique is somewhat expensive for large datasets. |
| 82 | + Pandas does cache this result, so re-checking on the same index is very fast. |
| 83 | + |
| 84 | +:meth:`Index.duplicated` will return a boolean ndarray indicating whether a |
| 85 | +label is repeated. |
| 86 | + |
| 87 | +.. ipython:: python |
| 88 | +
|
| 89 | + df2.index.duplicated() |
| 90 | +
|
| 91 | +Which can be used as a boolean filter to drop duplicate rows. |
| 92 | + |
| 93 | +.. ipython:: python |
| 94 | +
|
| 95 | + df2.loc[~df2.index.duplicated(), :] |
| 96 | +
|
| 97 | +If you need additional logic to handle duplicate labels, rather than just |
| 98 | +dropping the repeats, using :meth:`~DataFrame.groupby` on the index is a common |
| 99 | +trick. For example, we'll resolve duplicates by taking the average of all rows |
| 100 | +with the same label. |
| 101 | + |
| 102 | +.. ipython:: python |
| 103 | +
|
| 104 | + df2.groupby(level=0).mean() |
| 105 | +
|
| 106 | +.. _duplicates.disallow: |
| 107 | + |
| 108 | +Disallowing Duplicate Labels |
| 109 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 110 | + |
| 111 | +.. versionadded:: 1.2.0 |
| 112 | + |
| 113 | +As noted above, handling duplicates is an important feature when reading in raw |
| 114 | +data. That said, you may want to avoid introducing duplicates as part of a data |
| 115 | +processing pipeline (from methods like :meth:`pandas.concat`, |
| 116 | +:meth:`~DataFrame.rename`, etc.). Both :class:`Series` and :class:`DataFrame` |
| 117 | +*disallow* duplicate labels by calling ``.set_flags(allows_duplicate_labels=False)``. |
| 118 | +(the default is to allow them). If there are duplicate labels, an exception |
| 119 | +will be raised. |
| 120 | + |
| 121 | +.. ipython:: python |
| 122 | + :okexcept: |
| 123 | +
|
| 124 | + pd.Series( |
| 125 | + [0, 1, 2], |
| 126 | + index=['a', 'b', 'b'] |
| 127 | + ).set_flags(allows_duplicate_labels=False) |
| 128 | +
|
| 129 | +This applies to both row and column labels for a :class:`DataFrame` |
| 130 | + |
| 131 | +.. ipython:: python |
| 132 | + :okexcept: |
| 133 | +
|
| 134 | + pd.DataFrame( |
| 135 | + [[0, 1, 2], [3, 4, 5]], columns=["A", "B", "C"], |
| 136 | + ).set_flags(allows_duplicate_labels=False) |
| 137 | +
|
| 138 | +This attribute can be checked or set with :attr:`~DataFrame.flags.allows_duplicate_labels`, |
| 139 | +which indicates whether that object can have duplicate labels. |
| 140 | + |
| 141 | +.. ipython:: python |
| 142 | +
|
| 143 | + df = ( |
| 144 | + pd.DataFrame({"A": [0, 1, 2, 3]}, |
| 145 | + index=['x', 'y', 'X', 'Y']) |
| 146 | + .set_flags(allows_duplicate_labels=False) |
| 147 | + ) |
| 148 | + df |
| 149 | + df.flags.allows_duplicate_labels |
| 150 | +
|
| 151 | +:meth:`DataFrame.set_flags` can be used to return a new ``DataFrame`` with attributes |
| 152 | +like ``allows_duplicate_labels`` set to some value |
| 153 | + |
| 154 | +.. ipython:: python |
| 155 | +
|
| 156 | + df2 = df.set_flags(allows_duplicate_labels=True) |
| 157 | + df2.flags.allows_duplicate_labels |
| 158 | +
|
| 159 | +The new ``DataFrame`` returned is a view on the same data as the old ``DataFrame``. |
| 160 | +Or the property can just be set directly on the same object |
| 161 | + |
| 162 | + |
| 163 | +.. ipython:: python |
| 164 | +
|
| 165 | + df2.flags.allows_duplicate_labels = False |
| 166 | + df2.flags.allows_duplicate_labels |
| 167 | +
|
| 168 | +When processing raw, messy data you might initially read in the messy data |
| 169 | +(which potentially has duplicate labels), deduplicate, and then disallow duplicates |
| 170 | +going forward, to ensure that your data pipeline doesn't introduce duplicates. |
| 171 | + |
| 172 | + |
| 173 | +.. code-block:: python |
| 174 | +
|
| 175 | + >>> raw = pd.read_csv("...") |
| 176 | + >>> deduplicated = raw.groupby(level=0).first() # remove duplicates |
| 177 | + >>> deduplicated.flags.allows_duplicate_labels = False # disallow going forward |
| 178 | +
|
| 179 | +Setting ``allows_duplicate_labels=True`` on a ``Series`` or ``DataFrame`` with duplicate |
| 180 | +labels or performing an operation that introduces duplicate labels on a ``Series`` or |
| 181 | +``DataFrame`` that disallows duplicates will raise an |
| 182 | +:class:`errors.DuplicateLabelError`. |
| 183 | + |
| 184 | +.. ipython:: python |
| 185 | + :okexcept: |
| 186 | +
|
| 187 | + df.rename(str.upper) |
| 188 | +
|
| 189 | +This error message contains the labels that are duplicated, and the numeric positions |
| 190 | +of all the duplicates (including the "original") in the ``Series`` or ``DataFrame`` |
| 191 | + |
| 192 | +Duplicate Label Propagation |
| 193 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 194 | + |
| 195 | +In general, disallowing duplicates is "sticky". It's preserved through |
| 196 | +operations. |
| 197 | + |
| 198 | +.. ipython:: python |
| 199 | + :okexcept: |
| 200 | +
|
| 201 | + s1 = pd.Series(0, index=['a', 'b']).set_flags(allows_duplicate_labels=False) |
| 202 | + s1 |
| 203 | + s1.head().rename({"a": "b"}) |
| 204 | +
|
| 205 | +.. warning:: |
| 206 | + |
| 207 | + This is an experimental feature. Currently, many methods fail to |
| 208 | + propagate the ``allows_duplicate_labels`` value. In future versions |
| 209 | + it is expected that every method taking or returning one or more |
| 210 | + DataFrame or Series objects will propagate ``allows_duplicate_labels``. |
0 commit comments