Skip to content

Commit 76eb314

Browse files
Optionally disallow duplicate labels (#28394)
1 parent 81e3236 commit 76eb314

24 files changed

+1227
-10
lines changed

doc/source/reference/frame.rst

+16
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ Attributes and underlying data
3737
DataFrame.shape
3838
DataFrame.memory_usage
3939
DataFrame.empty
40+
DataFrame.set_flags
4041

4142
Conversion
4243
~~~~~~~~~~
@@ -276,6 +277,21 @@ Time Series-related
276277
DataFrame.tz_convert
277278
DataFrame.tz_localize
278279

280+
.. _api.frame.flags:
281+
282+
Flags
283+
~~~~~
284+
285+
Flags refer to attributes of the pandas object. Properties of the dataset (like
286+
the date is was recorded, the URL it was accessed from, etc.) should be stored
287+
in :attr:`DataFrame.attrs`.
288+
289+
.. autosummary::
290+
:toctree: api/
291+
292+
Flags
293+
294+
279295
.. _api.frame.metadata:
280296

281297
Metadata

doc/source/reference/general_utility_functions.rst

+1
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ Exceptions and warnings
3737

3838
errors.AccessorRegistrationWarning
3939
errors.DtypeWarning
40+
errors.DuplicateLabelError
4041
errors.EmptyDataError
4142
errors.InvalidIndexError
4243
errors.MergeError

doc/source/reference/series.rst

+15
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,8 @@ Attributes
3939
Series.empty
4040
Series.dtypes
4141
Series.name
42+
Series.flags
43+
Series.set_flags
4244

4345
Conversion
4446
----------
@@ -527,6 +529,19 @@ Sparse-dtype specific methods and attributes are provided under the
527529
Series.sparse.from_coo
528530
Series.sparse.to_coo
529531

532+
.. _api.series.flags:
533+
534+
Flags
535+
~~~~~
536+
537+
Flags refer to attributes of the pandas object. Properties of the dataset (like
538+
the date is was recorded, the URL it was accessed from, etc.) should be stored
539+
in :attr:`Series.attrs`.
540+
541+
.. autosummary::
542+
:toctree: api/
543+
544+
Flags
530545

531546
.. _api.series.metadata:
532547

doc/source/user_guide/duplicates.rst

+210
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
.. _duplicates:
2+
3+
****************
4+
Duplicate Labels
5+
****************
6+
7+
:class:`Index` objects are not required to be unique; you can have duplicate row
8+
or column labels. This may be a bit confusing at first. If you're familiar with
9+
SQL, you know that row labels are similar to a primary key on a table, and you
10+
would never want duplicates in a SQL table. But one of pandas' roles is to clean
11+
messy, real-world data before it goes to some downstream system. And real-world
12+
data has duplicates, even in fields that are supposed to be unique.
13+
14+
This section describes how duplicate labels change the behavior of certain
15+
operations, and how prevent duplicates from arising during operations, or to
16+
detect them if they do.
17+
18+
.. ipython:: python
19+
20+
import pandas as pd
21+
import numpy as np
22+
23+
Consequences of Duplicate Labels
24+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
25+
26+
Some pandas methods (:meth:`Series.reindex` for example) just don't work with
27+
duplicates present. The output can't be determined, and so pandas raises.
28+
29+
.. ipython:: python
30+
:okexcept:
31+
32+
s1 = pd.Series([0, 1, 2], index=['a', 'b', 'b'])
33+
s1.reindex(['a', 'b', 'c'])
34+
35+
Other methods, like indexing, can give very surprising results. Typically
36+
indexing with a scalar will *reduce dimensionality*. Slicing a ``DataFrame``
37+
with a scalar will return a ``Series``. Slicing a ``Series`` with a scalar will
38+
return a scalar. But with duplicates, this isn't the case.
39+
40+
.. ipython:: python
41+
42+
df1 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=['A', 'A', 'B'])
43+
df1
44+
45+
We have duplicates in the columns. If we slice ``'B'``, we get back a ``Series``
46+
47+
.. ipython:: python
48+
49+
df1['B'] # a series
50+
51+
But slicing ``'A'`` returns a ``DataFrame``
52+
53+
54+
.. ipython:: python
55+
56+
df1['A'] # a DataFrame
57+
58+
This applies to row labels as well
59+
60+
.. ipython:: python
61+
62+
df2 = pd.DataFrame({"A": [0, 1, 2]}, index=['a', 'a', 'b'])
63+
df2
64+
df2.loc['b', 'A'] # a scalar
65+
df2.loc['a', 'A'] # a Series
66+
67+
Duplicate Label Detection
68+
~~~~~~~~~~~~~~~~~~~~~~~~~
69+
70+
You can check whether an :class:`Index` (storing the row or column labels) is
71+
unique with :attr:`Index.is_unique`:
72+
73+
.. ipython:: python
74+
75+
df2
76+
df2.index.is_unique
77+
df2.columns.is_unique
78+
79+
.. note::
80+
81+
Checking whether an index is unique is somewhat expensive for large datasets.
82+
Pandas does cache this result, so re-checking on the same index is very fast.
83+
84+
:meth:`Index.duplicated` will return a boolean ndarray indicating whether a
85+
label is repeated.
86+
87+
.. ipython:: python
88+
89+
df2.index.duplicated()
90+
91+
Which can be used as a boolean filter to drop duplicate rows.
92+
93+
.. ipython:: python
94+
95+
df2.loc[~df2.index.duplicated(), :]
96+
97+
If you need additional logic to handle duplicate labels, rather than just
98+
dropping the repeats, using :meth:`~DataFrame.groupby` on the index is a common
99+
trick. For example, we'll resolve duplicates by taking the average of all rows
100+
with the same label.
101+
102+
.. ipython:: python
103+
104+
df2.groupby(level=0).mean()
105+
106+
.. _duplicates.disallow:
107+
108+
Disallowing Duplicate Labels
109+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
110+
111+
.. versionadded:: 1.2.0
112+
113+
As noted above, handling duplicates is an important feature when reading in raw
114+
data. That said, you may want to avoid introducing duplicates as part of a data
115+
processing pipeline (from methods like :meth:`pandas.concat`,
116+
:meth:`~DataFrame.rename`, etc.). Both :class:`Series` and :class:`DataFrame`
117+
*disallow* duplicate labels by calling ``.set_flags(allows_duplicate_labels=False)``.
118+
(the default is to allow them). If there are duplicate labels, an exception
119+
will be raised.
120+
121+
.. ipython:: python
122+
:okexcept:
123+
124+
pd.Series(
125+
[0, 1, 2],
126+
index=['a', 'b', 'b']
127+
).set_flags(allows_duplicate_labels=False)
128+
129+
This applies to both row and column labels for a :class:`DataFrame`
130+
131+
.. ipython:: python
132+
:okexcept:
133+
134+
pd.DataFrame(
135+
[[0, 1, 2], [3, 4, 5]], columns=["A", "B", "C"],
136+
).set_flags(allows_duplicate_labels=False)
137+
138+
This attribute can be checked or set with :attr:`~DataFrame.flags.allows_duplicate_labels`,
139+
which indicates whether that object can have duplicate labels.
140+
141+
.. ipython:: python
142+
143+
df = (
144+
pd.DataFrame({"A": [0, 1, 2, 3]},
145+
index=['x', 'y', 'X', 'Y'])
146+
.set_flags(allows_duplicate_labels=False)
147+
)
148+
df
149+
df.flags.allows_duplicate_labels
150+
151+
:meth:`DataFrame.set_flags` can be used to return a new ``DataFrame`` with attributes
152+
like ``allows_duplicate_labels`` set to some value
153+
154+
.. ipython:: python
155+
156+
df2 = df.set_flags(allows_duplicate_labels=True)
157+
df2.flags.allows_duplicate_labels
158+
159+
The new ``DataFrame`` returned is a view on the same data as the old ``DataFrame``.
160+
Or the property can just be set directly on the same object
161+
162+
163+
.. ipython:: python
164+
165+
df2.flags.allows_duplicate_labels = False
166+
df2.flags.allows_duplicate_labels
167+
168+
When processing raw, messy data you might initially read in the messy data
169+
(which potentially has duplicate labels), deduplicate, and then disallow duplicates
170+
going forward, to ensure that your data pipeline doesn't introduce duplicates.
171+
172+
173+
.. code-block:: python
174+
175+
>>> raw = pd.read_csv("...")
176+
>>> deduplicated = raw.groupby(level=0).first() # remove duplicates
177+
>>> deduplicated.flags.allows_duplicate_labels = False # disallow going forward
178+
179+
Setting ``allows_duplicate_labels=True`` on a ``Series`` or ``DataFrame`` with duplicate
180+
labels or performing an operation that introduces duplicate labels on a ``Series`` or
181+
``DataFrame`` that disallows duplicates will raise an
182+
:class:`errors.DuplicateLabelError`.
183+
184+
.. ipython:: python
185+
:okexcept:
186+
187+
df.rename(str.upper)
188+
189+
This error message contains the labels that are duplicated, and the numeric positions
190+
of all the duplicates (including the "original") in the ``Series`` or ``DataFrame``
191+
192+
Duplicate Label Propagation
193+
^^^^^^^^^^^^^^^^^^^^^^^^^^^
194+
195+
In general, disallowing duplicates is "sticky". It's preserved through
196+
operations.
197+
198+
.. ipython:: python
199+
:okexcept:
200+
201+
s1 = pd.Series(0, index=['a', 'b']).set_flags(allows_duplicate_labels=False)
202+
s1
203+
s1.head().rename({"a": "b"})
204+
205+
.. warning::
206+
207+
This is an experimental feature. Currently, many methods fail to
208+
propagate the ``allows_duplicate_labels`` value. In future versions
209+
it is expected that every method taking or returning one or more
210+
DataFrame or Series objects will propagate ``allows_duplicate_labels``.

doc/source/user_guide/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ Further information on any specific method can be obtained in the
3333
reshaping
3434
text
3535
missing_data
36+
duplicates
3637
categorical
3738
integer_na
3839
boolean

doc/source/whatsnew/v1.2.0.rst

+49
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,53 @@ including other versions of pandas.
1313
Enhancements
1414
~~~~~~~~~~~~
1515

16+
.. _whatsnew_120.duplicate_labels:
17+
18+
Optionally disallow duplicate labels
19+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
20+
21+
:class:`Series` and :class:`DataFrame` can now be created with ``allows_duplicate_labels=False`` flag to
22+
control whether the index or columns can contain duplicate labels (:issue:`28394`). This can be used to
23+
prevent accidental introduction of duplicate labels, which can affect downstream operations.
24+
25+
By default, duplicates continue to be allowed
26+
27+
.. ipython:: python
28+
29+
pd.Series([1, 2], index=['a', 'a'])
30+
31+
.. ipython:: python
32+
:okexcept:
33+
34+
pd.Series([1, 2], index=['a', 'a']).set_flags(allows_duplicate_labels=False)
35+
36+
Pandas will propagate the ``allows_duplicate_labels`` property through many operations.
37+
38+
.. ipython:: python
39+
:okexcept:
40+
41+
a = (
42+
pd.Series([1, 2], index=['a', 'b'])
43+
.set_flags(allows_duplicate_labels=False)
44+
)
45+
a
46+
# An operation introducing duplicates
47+
a.reindex(['a', 'b', 'a'])
48+
49+
.. warning::
50+
51+
This is an experimental feature. Currently, many methods fail to
52+
propagate the ``allows_duplicate_labels`` value. In future versions
53+
it is expected that every method taking or returning one or more
54+
DataFrame or Series objects will propagate ``allows_duplicate_labels``.
55+
56+
See :ref:`duplicates` for more.
57+
58+
The ``allows_duplicate_labels`` flag is stored in the new :attr:`DataFrame.flags`
59+
attribute. This stores global attributes that apply to the *pandas object*. This
60+
differs from :attr:`DataFrame.attrs`, which stores information that applies to
61+
the dataset.
62+
1663
Passing arguments to fsspec backends
1764
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1865

@@ -53,6 +100,8 @@ For example:
53100

54101
Other enhancements
55102
^^^^^^^^^^^^^^^^^^
103+
104+
- Added :meth:`~DataFrame.set_flags` for setting table-wide flags on a ``Series`` or ``DataFrame`` (:issue:`28394`)
56105
- :class:`Index` with object dtype supports division and multiplication (:issue:`34160`)
57106
-
58107
-

pandas/__init__.py

+1
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,7 @@
100100
to_datetime,
101101
to_timedelta,
102102
# misc
103+
Flags,
103104
Grouper,
104105
factorize,
105106
unique,

0 commit comments

Comments
 (0)