Skip to content

Commit 3915847

Browse files
committed
API: unique labels
1 parent 6c898e6 commit 3915847

File tree

14 files changed

+680
-20
lines changed

14 files changed

+680
-20
lines changed

doc/source/index.rst.template

+1
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ See the :ref:`overview` for more detail about what's in the library.
7171
* :doc:`user_guide/reshaping`
7272
* :doc:`user_guide/text`
7373
* :doc:`user_guide/missing_data`
74+
* :doc:`user_guide/duplicates`
7475
* :doc:`user_guide/categorical`
7576
* :doc:`user_guide/integer_na`
7677
* :doc:`user_guide/visualization`

doc/source/reference/frame.rst

+1
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ Attributes and underlying data
2323

2424
DataFrame.index
2525
DataFrame.columns
26+
DataFrame.allows_duplicate_labels
2627

2728
.. autosummary::
2829
:toctree: api/

doc/source/reference/series.rst

+1
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ Attributes
2222
:toctree: api/
2323

2424
Series.index
25+
Series.allows_duplicate_labels
2526

2627
.. autosummary::
2728
:toctree: api/

doc/source/user_guide/duplicates.rst

+174
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
.. _duplicates:
2+
3+
****************
4+
Duplicate Labels
5+
****************
6+
7+
:class:`Index` objects are not required to be unique; you can have duplicate row
8+
or column labels. This may be a bit confusing at first. If you're familiar with
9+
SQL, you know that row labels are similar to a primary key on a table, and you
10+
would never want duplicates in a SQL table. But one of pandas' roles is to clean
11+
messy, real-world data before it goes to some downstream system. And real-world
12+
data has duplicates, even in fields that are supposed to be unique.
13+
14+
This section describes how duplicate labels change the behavior of certain
15+
operations, and how prevent duplicates from arising during operations, or to
16+
detect them if they do.
17+
18+
.. ipython:: python
19+
20+
import pandas as pd
21+
import numpy as np
22+
23+
Consequences of Duplicate Labels
24+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
25+
26+
Some pandas methods (:meth:`Series.reindex` for example) just don't work with
27+
duplicates present. The output can't be determined, and so pandas raises.
28+
29+
.. ipython:: python
30+
:okexcept:
31+
32+
s1 = pd.Series([0, 1, 2], index=['a', 'b', 'b'])
33+
s1.reindex(['a', 'b', 'c'])
34+
35+
Other methods, like indexing, can give very surprising results. Typically
36+
indexing with a scalar will *reduce dimensionality*. Slicing a ``DataFrame``
37+
with a scalar will return a ``Series``. Slicing a ``Series`` with a scalar will
38+
return a scalar. But with duplicates, this isn't the case.
39+
40+
.. ipython:: python
41+
42+
df1 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=['A', 'A', 'B'])
43+
df1
44+
45+
We have duplicates in the columns. If we slice ``'B'``, we get back a ``Series``
46+
47+
.. ipython:: python
48+
49+
df1['B'] # a series
50+
51+
But slicing ``'A'`` returns a ``DataFrame``
52+
53+
54+
.. ipython:: python
55+
56+
df1['A'] # a DataFrame
57+
58+
This applies to row labels as well
59+
60+
.. ipython:: python
61+
62+
df2 = pd.DataFrame({"A": [0, 1, 2]}, index=['a', 'a', 'b'])
63+
df2
64+
df2.loc['b', 'A'] # a scalar
65+
66+
df2.loc['a', 'A'] # a Series
67+
68+
Duplicate Label Detection
69+
~~~~~~~~~~~~~~~~~~~~~~~~~
70+
71+
You can check with an :class:`Index` (storing the row or column labels) is
72+
unique with :attr:`Index.is_unique`:
73+
74+
.. ipython:: python
75+
76+
df2
77+
df2.index.is_unique
78+
df2.columns.is_unique
79+
80+
.. note::
81+
82+
Checking whether an index is unique is somewhat expensive for large datasets.
83+
Pandas does cache this result, so re-checking on the same index is very fast.
84+
85+
:meth:`Index.duplicated` will return a boolean ndarray indicating whether a
86+
label is a repeat.
87+
88+
.. ipython:: python
89+
90+
df2.index.duplicated()
91+
92+
Which can be used as a boolean filter to drop duplicate rows.
93+
94+
.. ipython:: python
95+
96+
df2.loc[~df2.index.duplicated(), :]
97+
98+
If you need additional logic to handle duplicate labels, rather than just
99+
dropping the repeats, using :meth:`~DataFrame.groupby` on the index is a common
100+
trick. For example, we'll resolve duplicates by taking the average of all rows
101+
with the same label.
102+
103+
.. ipython:: python
104+
105+
df2.groupby(level=0).mean()
106+
107+
.. _duplicates.disallow:
108+
109+
Disallowing Duplicate Labels
110+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
111+
112+
As noted above, handling duplicates is an important feature when reading in raw
113+
data. That said, you may want to avoid introducing duplicates as part of a data
114+
processing pipeline (from methods like :meth:`pandas.concat`,
115+
:meth:`~DataFrame.rename`, etc.). Both :class:`Series` and :class:`DataFrame`
116+
can be created with the argument ``allow_duplicate_labels=False`` to *disallow*
117+
duplicate labels (the default is to allow them). If there are duplicate labels,
118+
an exception will be raised.
119+
120+
.. ipython:: python
121+
:okexcept:
122+
123+
pd.Series([0, 1, 2], index=['a', 'b', 'b'], allow_duplicate_labels=False)
124+
125+
This applies to both row and column labels for a :class:`DataFrame`
126+
127+
.. ipython:: python
128+
:okexcept:
129+
130+
pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=["A", "B", "C"],
131+
allow_duplicate_labels=False)
132+
133+
This attribute can be checked with :attr:`~DataFrame.allows_duplicate_labels`,
134+
which indicates whether that object can have duplicate labels.
135+
136+
.. ipython:: python
137+
138+
df = pd.DataFrame({"A": [0, 1, 2, 3]}, index=['x', 'y', 'X', 'Y'],
139+
allow_duplicate_labels=False)
140+
df
141+
df.allows_duplicate_labels
142+
143+
Performing an operation that introduces duplicate labels on a ``Series`` or
144+
``DataFrame`` that disallows duplicates will raise an
145+
:class:`errors.DuplicateLabelError`.
146+
147+
.. ipython:: python
148+
:okexcept:
149+
150+
df.rename(str.upper)
151+
152+
Duplicate Label Propagation
153+
^^^^^^^^^^^^^^^^^^^^^^^^^^^
154+
155+
In general, disallowing duplicates is "sticky". It's preserved through
156+
operations.
157+
158+
.. ipython:: python
159+
:okexcept:
160+
161+
s1 = pd.Series(0, index=['a', 'b'], allow_duplicate_labels=False)
162+
s1
163+
abs(s1).rename({"a": "b"})
164+
165+
When multiple Series or DataFrames are involved in an operation,
166+
duplictes are disallowed if *any* of the inputs disallow duplicates.
167+
168+
.. ipython:: python
169+
:okexcept:
170+
171+
df1 = pd.Series(0, index=['a', 'b'], allow_duplicate_labels=False)
172+
df2 = pd.Series(1, index=['b', 'c'], allow_duplicate_labels=True)
173+
174+
pd.concat([df1, df2])

doc/source/whatsnew/v1.0.0.rst

+3
Original file line numberDiff line numberDiff line change
@@ -406,6 +406,9 @@ Other
406406
- Trying to set the ``display.precision``, ``display.max_rows`` or ``display.max_columns`` using :meth:`set_option` to anything but a ``None`` or a positive int will raise a ``ValueError`` (:issue:`23348`)
407407
- Using :meth:`DataFrame.replace` with overlapping keys in a nested dictionary will no longer raise, now matching the behavior of a flat dictionary (:issue:`27660`)
408408
- :meth:`DataFrame.to_csv` and :meth:`Series.to_csv` now support dicts as ``compression`` argument with key ``'method'`` being the compression method and others as additional compression options when the compression method is ``'zip'``. (:issue:`26023`)
409+
- Metadata is now finalized for the following methods on ``Series`` and ``DataFrame`` (:issue:``)
410+
* :meth:`~DataFrame.abs`
411+
* :meth:`Series.to_frame`
409412
- Bug in :meth:`Series.diff` where a boolean series would incorrectly raise a ``TypeError`` (:issue:`17294`)
410413
- :meth:`Series.append` will no longer raise a ``TypeError`` when passed a tuple of ``Series`` (:issue:`28410`)
411414
- Fix corrupted error message when calling ``pandas.libs._json.encode()`` on a 0d array (:issue:`18878`)

pandas/core/frame.py

+20-3
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
"""
1+
""""
22
DataFrame
33
---------
44
An efficient 2D container for potentially mixed-type time series or other
@@ -338,6 +338,13 @@ class DataFrame(NDFrame):
338338
Data type to force. Only a single dtype is allowed. If None, infer.
339339
copy : bool, default False
340340
Copy data from inputs. Only affects DataFrame / 2d ndarray input.
341+
allow_duplicate_labels : bool, default True
342+
Whether to allow duplicate row or column labels in this DataFrame.
343+
By default, duplicte labels are permitted. Setting this to ``False``
344+
will cause an :class:`errors.DuplicateLabelError` to be raised when
345+
`index` or `columns` are not unique, or when any subsequent operation
346+
on this DataFrame introduces duplicates. See :ref:`duplictes.disallow`
347+
for more.
341348
342349
See Also
343350
--------
@@ -407,6 +414,7 @@ def __init__(
407414
columns: Optional[Axes] = None,
408415
dtype: Optional[Dtype] = None,
409416
copy: bool = False,
417+
allow_duplicate_labels: bool = True,
410418
):
411419
if data is None:
412420
data = {}
@@ -497,7 +505,9 @@ def __init__(
497505
else:
498506
raise ValueError("DataFrame constructor not properly called!")
499507

500-
NDFrame.__init__(self, mgr, fastpath=True)
508+
NDFrame.__init__(
509+
self, mgr, fastpath=True, allow_duplicate_labels=allow_duplicate_labels
510+
)
501511

502512
# ----------------------------------------------------------------------
503513

@@ -2770,6 +2780,8 @@ def _ixs(self, i: int, axis: int = 0):
27702780
If slice passed, the resulting data will be a view.
27712781
"""
27722782
# irow
2783+
# TODO: Figure out if this is the right place to finalize.
2784+
# Does it make sense to do here, or higher-level (like `LocationIndexer`)?
27732785
if axis == 0:
27742786
label = self.index[i]
27752787
new_values = self._data.fast_xs(i)
@@ -2781,7 +2793,7 @@ def _ixs(self, i: int, axis: int = 0):
27812793
index=self.columns,
27822794
name=self.index[i],
27832795
dtype=new_values.dtype,
2784-
)
2796+
).__finalize__(self, method="ixs")
27852797
result._set_is_copy(self, copy=copy)
27862798
return result
27872799

@@ -2798,6 +2810,8 @@ def _ixs(self, i: int, axis: int = 0):
27982810
if len(self.index) and not len(values):
27992811
values = np.array([np.nan] * len(self.index), dtype=object)
28002812
result = self._box_col_values(values, label)
2813+
if isinstance(result, NDFrame):
2814+
result.__finalize__(self, method="ixs")
28012815

28022816
# this is a cached value, mark it so
28032817
result._set_as_cached(label, self)
@@ -2859,6 +2873,8 @@ def __getitem__(self, key):
28592873
if data.shape[1] == 1 and not isinstance(self.columns, ABCMultiIndex):
28602874
data = data[key]
28612875

2876+
if isinstance(data, NDFrame):
2877+
data.__finalize__(self, method="dataframe_getitem")
28622878
return data
28632879

28642880
def _getitem_bool_array(self, key):
@@ -5300,6 +5316,7 @@ def _arith_op(left, right):
53005316
with np.errstate(all="ignore"):
53015317
res_values = _arith_op(this.values, other.values)
53025318
new_data = dispatch_fill_zeros(func, this.values, other.values, res_values)
5319+
# XXX: pass them here.
53035320
return this._construct_result(new_data)
53045321

53055322
def _combine_match_index(self, other, func):

0 commit comments

Comments
 (0)