Skip to content

ENH: reset_index on a MultiIndex with duplicate levels raises a ValueError #44755

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 75 commits into from
Jan 30, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
88cd26f
reset_index to handle duplicate column labels
johnzangwill Dec 4, 2021
a23dbdc
Add tests
johnzangwill Dec 5, 2021
b101177
Add tests
johnzangwill Dec 5, 2021
39d9f75
Update v1.4.0.rst
johnzangwill Dec 5, 2021
c83b84b
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 5, 2021
e1a5910
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 5, 2021
e5ab5f7
Implement allow_duplicates parameter
johnzangwill Dec 6, 2021
b4828e6
Formatting
johnzangwill Dec 6, 2021
d9d60ee
Add docstrings
johnzangwill Dec 6, 2021
e1bb16f
Trigger CI
johnzangwill Dec 6, 2021
e924f93
Trigger CI
johnzangwill Dec 6, 2021
65c6fef
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 6, 2021
704fae4
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 6, 2021
2911dd7
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 7, 2021
96af74d
Update v1.4.0.rst
johnzangwill Dec 7, 2021
0e90ae9
Update test_reset_index.py
johnzangwill Dec 7, 2021
170d05f
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 8, 2021
fe23eeb
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 10, 2021
a0f0e4c
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 10, 2021
01ad538
Make separate allow_duplicates test
johnzangwill Dec 10, 2021
9da58ed
Update test_reset_index.py
johnzangwill Dec 10, 2021
9d988a0
Update test_reset_index.py
johnzangwill Dec 10, 2021
a108a70
Merge branch 'master' into reset_index-duplicate-labels
johnzangwill Dec 10, 2021
13c6dce
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 11, 2021
2b37ab3
Trigger CI
johnzangwill Dec 11, 2021
140b5d9
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 11, 2021
1bd1cbd
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 11, 2021
e27df82
Change from None to use_flag
johnzangwill Dec 12, 2021
749838c
Get rid of use_flag altogether
johnzangwill Dec 12, 2021
c7cf483
Update frame.py
johnzangwill Dec 12, 2021
7711474
Added version and improved tests
johnzangwill Dec 14, 2021
0ecf6fd
Trigger CI
johnzangwill Dec 14, 2021
0730d7b
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 14, 2021
99ea2c3
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 14, 2021
250999a
Trigger CI
johnzangwill Dec 14, 2021
090468e
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 14, 2021
14a07e1
Trigger CI
johnzangwill Dec 15, 2021
8eb8710
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 15, 2021
18341db
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 20, 2021
1337283
Trigger CI
johnzangwill Dec 20, 2021
a8dab78
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 20, 2021
95cc65b
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 20, 2021
7de074c
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 22, 2021
bb54268
allow_duplicates lib.no_default
johnzangwill Dec 24, 2021
01136ae
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 24, 2021
d21846b
Trigger CI
johnzangwill Dec 24, 2021
3ef667c
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 24, 2021
22deac4
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 26, 2021
540b307
Correct overloads
johnzangwill Dec 27, 2021
8cc81b0
Merge branch 'reset_index-duplicate-labels' of https://github.com/joh…
johnzangwill Dec 27, 2021
fc74265
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 27, 2021
a59644a
Update frame.py
johnzangwill Dec 27, 2021
b774598
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 27, 2021
67a5956
Update frame.py
johnzangwill Dec 27, 2021
e88b8e2
Merge branch 'reset_index-duplicate-labels' of https://github.com/joh…
johnzangwill Dec 27, 2021
9ad3026
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 27, 2021
1fdc39f
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Dec 29, 2021
7639464
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Jan 1, 2022
c58d992
Merge branch 'pandas-dev:master' into reset_index-duplicate-labels
johnzangwill Jan 2, 2022
e07eea6
Merge branch 'main' into reset_index-duplicate-labels
johnzangwill Jan 17, 2022
04b3a38
Removed docstring Raises
johnzangwill Jan 17, 2022
2f94170
Version to 1.5
johnzangwill Jan 17, 2022
9241f96
Merge branch 'pandas-dev:main' into reset_index-duplicate-labels
johnzangwill Jan 18, 2022
0b84426
Merge branch 'pandas-dev:main' into reset_index-duplicate-labels
johnzangwill Jan 19, 2022
c2bed8f
Merge branch 'pandas-dev:main' into reset_index-duplicate-labels
johnzangwill Jan 20, 2022
3530ca8
Merge branch 'main' into reset_index-duplicate-labels
johnzangwill Jan 23, 2022
dddae07
Merge branch 'pandas-dev:main' into reset_index-duplicate-labels
johnzangwill Jan 24, 2022
14edbdb
Update series.py
johnzangwill Jan 24, 2022
8383572
Merge branch 'reset_index-duplicate-labels' of https://github.com/joh…
johnzangwill Jan 24, 2022
37cc560
Merge branch 'pandas-dev:main' into reset_index-duplicate-labels
johnzangwill Jan 24, 2022
3625d77
Merge branch 'pandas-dev:main' into reset_index-duplicate-labels
johnzangwill Jan 24, 2022
9d0f798
Merge branch 'pandas-dev:main' into reset_index-duplicate-labels
johnzangwill Jan 25, 2022
e2fbf3e
Merge branch 'pandas-dev:main' into reset_index-duplicate-labels
johnzangwill Jan 25, 2022
0b2a9d1
Merge branch 'pandas-dev:main' into reset_index-duplicate-labels
johnzangwill Jan 26, 2022
c5618c8
Merge branch 'pandas-dev:main' into reset_index-duplicate-labels
johnzangwill Jan 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ Other enhancements
- :class:`StringArray` now accepts array-likes containing nan-likes (``None``, ``np.nan``) for the ``values`` parameter in its constructor in addition to strings and :attr:`pandas.NA`. (:issue:`40839`)
- Improved the rendering of ``categories`` in :class:`CategoricalIndex` (:issue:`45218`)
- :meth:`to_numeric` now preserves float64 arrays when downcasting would generate values not representable in float32 (:issue:`43693`)
- :meth:`Series.reset_index` and :meth:`DataFrame.reset_index` now support the argument ``allow_duplicates`` (:issue:`44410`)
- :meth:`.GroupBy.min` and :meth:`.GroupBy.max` now supports `Numba <https://numba.pydata.org/>`_ execution with the ``engine`` keyword (:issue:`45428`)
-

Expand Down
26 changes: 23 additions & 3 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -4392,7 +4392,7 @@ def insert(
loc: int,
column: Hashable,
value: Scalar | AnyArrayLike,
allow_duplicates: bool = False,
allow_duplicates: bool | lib.NoDefault = lib.no_default,
) -> None:
"""
Insert column into DataFrame at specified location.
Expand All @@ -4407,7 +4407,7 @@ def insert(
column : str, number, or hashable object
Label of the inserted column.
value : Scalar, Series, or array-like
allow_duplicates : bool, optional default False
allow_duplicates : bool, optional, default lib.no_default

See Also
--------
Expand Down Expand Up @@ -4439,6 +4439,8 @@ def insert(
0 NaN 100 1 99 3
1 5.0 100 2 99 4
"""
if allow_duplicates is lib.no_default:
allow_duplicates = False
if allow_duplicates and not self.flags.allows_duplicate_labels:
raise ValueError(
"Cannot specify 'allow_duplicates=True' when "
Expand Down Expand Up @@ -5581,6 +5583,7 @@ def reset_index(
inplace: Literal[False] = ...,
col_level: Hashable = ...,
col_fill: Hashable = ...,
allow_duplicates: bool | lib.NoDefault = ...,
) -> DataFrame:
...

Expand All @@ -5592,6 +5595,7 @@ def reset_index(
inplace: Literal[True],
col_level: Hashable = ...,
col_fill: Hashable = ...,
allow_duplicates: bool | lib.NoDefault = ...,
) -> None:
...

Expand All @@ -5603,6 +5607,7 @@ def reset_index(
inplace: Literal[True],
col_level: Hashable = ...,
col_fill: Hashable = ...,
allow_duplicates: bool | lib.NoDefault = ...,
) -> None:
...

Expand All @@ -5614,6 +5619,7 @@ def reset_index(
inplace: Literal[True],
col_level: Hashable = ...,
col_fill: Hashable = ...,
allow_duplicates: bool | lib.NoDefault = ...,
) -> None:
...

Expand All @@ -5624,6 +5630,7 @@ def reset_index(
inplace: Literal[True],
col_level: Hashable = ...,
col_fill: Hashable = ...,
allow_duplicates: bool | lib.NoDefault = ...,
) -> None:
...

Expand All @@ -5635,6 +5642,7 @@ def reset_index(
inplace: bool = ...,
col_level: Hashable = ...,
col_fill: Hashable = ...,
allow_duplicates: bool | lib.NoDefault = ...,
) -> DataFrame | None:
...

Expand All @@ -5646,6 +5654,7 @@ def reset_index(
inplace: bool = False,
col_level: Hashable = 0,
col_fill: Hashable = "",
allow_duplicates: bool | lib.NoDefault = lib.no_default,
) -> DataFrame | None:
"""
Reset the index, or a level of it.
Expand All @@ -5671,6 +5680,10 @@ def reset_index(
col_fill : object, default ''
If the columns have multiple levels, determines how the other
levels are named. If None then the index name is repeated.
allow_duplicates : bool, optional, default lib.no_default
Allow duplicate column labels to be created.

.. versionadded:: 1.5.0

Returns
-------
Expand Down Expand Up @@ -5794,6 +5807,8 @@ class max type
new_obj = self
else:
new_obj = self.copy()
if allow_duplicates is not lib.no_default:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt we do the same as you are doing in insert and just set this here e.g.


if allow_duplicates is not lib.no_default:
    allow_duplicates = False
.....

simpler & more readable

Copy link
Contributor Author

@johnzangwill johnzangwill Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Your suggestion has got to be a typo. That change would render the argument always False and fail my tests.

If I take out the next line, which is type-checking, then I will have to take out my test https://github.com/johnzangwill/pandas/blob/dddae0734e37ab9fa4ae4a89e68043c4631fd68c/pandas/tests/frame/methods/test_reset_index.py#L474

If I copy in the code from insert, then the lib.no_default no longer has any function.

I only used lib.no_default to avoid your original objections to this argument. I don't like it, because it makes no sense in the documentation and disguises the real default, which is False.

I would be more than happy to take it out and go back to the original much simpler allow_duplicates: bool = False, which is the case for insert in main and for Series in this PR: https://github.com/johnzangwill/pandas/blob/dddae0734e37ab9fa4ae4a89e68043c4631fd68c/pandas/core/series.py#L1369

@jreback Let me know what you want:

  1. Leave as is.
  2. Revert lib.no_default and have all the arguments allow_duplicates: bool = False

cc @rhshadrach, @phofl, @mroeschke. Comments on this?

allow_duplicates = validate_bool_kwarg(allow_duplicates, "allow_duplicates")

new_index = default_index(len(new_obj))
if level is not None:
Expand Down Expand Up @@ -5845,7 +5860,12 @@ class max type
level_values, lab, allow_fill=True, fill_value=lev._na_value
)

new_obj.insert(0, name, level_values)
new_obj.insert(
0,
name,
level_values,
allow_duplicates=allow_duplicates,
)

new_obj.index = new_index
if not inplace:
Expand Down
17 changes: 15 additions & 2 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -1359,7 +1359,14 @@ def repeat(self, repeats, axis=None) -> Series:
)

@deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "level"])
def reset_index(self, level=None, drop=False, name=lib.no_default, inplace=False):
def reset_index(
self,
level=None,
drop=False,
name=lib.no_default,
inplace=False,
allow_duplicates: bool = False,
):
"""
Generate a new DataFrame or Series with the index reset.

Expand All @@ -1381,6 +1388,10 @@ def reset_index(self, level=None, drop=False, name=lib.no_default, inplace=False
when `drop` is True.
inplace : bool, default False
Modify the Series in place (do not create a new object).
allow_duplicates : bool, default False
Allow duplicate column labels to be created.

.. versionadded:: 1.5.0

Returns
-------
Expand Down Expand Up @@ -1497,7 +1508,9 @@ def reset_index(self, level=None, drop=False, name=lib.no_default, inplace=False
name = self.name

df = self.to_frame(name)
return df.reset_index(level=level, drop=drop)
return df.reset_index(
level=level, drop=drop, allow_duplicates=allow_duplicates
)

# ----------------------------------------------------------------------
# Rendering Methods
Expand Down
65 changes: 54 additions & 11 deletions pandas/tests/frame/methods/test_reset_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,12 @@
import pandas._testing as tm


@pytest.fixture()
def multiindex_df():
levels = [["A", ""], ["B", "b"]]
return DataFrame([[0, 2], [1, 3]], columns=MultiIndex.from_tuples(levels))


class TestResetIndex:
def test_reset_index_empty_rangeindex(self):
# GH#45230
Expand Down Expand Up @@ -381,33 +387,31 @@ def test_reset_index_range(self):
)
tm.assert_frame_equal(result, expected)

def test_reset_index_multiindex_columns(self):
levels = [["A", ""], ["B", "b"]]
df = DataFrame([[0, 2], [1, 3]], columns=MultiIndex.from_tuples(levels))
result = df[["B"]].rename_axis("A").reset_index()
tm.assert_frame_equal(result, df)
def test_reset_index_multiindex_columns(self, multiindex_df):
result = multiindex_df[["B"]].rename_axis("A").reset_index()
tm.assert_frame_equal(result, multiindex_df)

# GH#16120: already existing column
msg = r"cannot insert \('A', ''\), already exists"
with pytest.raises(ValueError, match=msg):
df.rename_axis("A").reset_index()
multiindex_df.rename_axis("A").reset_index()

# GH#16164: multiindex (tuple) full key
result = df.set_index([("A", "")]).reset_index()
tm.assert_frame_equal(result, df)
result = multiindex_df.set_index([("A", "")]).reset_index()
tm.assert_frame_equal(result, multiindex_df)

# with additional (unnamed) index level
idx_col = DataFrame(
[[0], [1]], columns=MultiIndex.from_tuples([("level_0", "")])
)
expected = pd.concat([idx_col, df[[("B", "b"), ("A", "")]]], axis=1)
result = df.set_index([("B", "b")], append=True).reset_index()
expected = pd.concat([idx_col, multiindex_df[[("B", "b"), ("A", "")]]], axis=1)
result = multiindex_df.set_index([("B", "b")], append=True).reset_index()
tm.assert_frame_equal(result, expected)

# with index name which is a too long tuple...
msg = "Item must have length equal to number of levels."
with pytest.raises(ValueError, match=msg):
df.rename_axis([("C", "c", "i")]).reset_index()
multiindex_df.rename_axis([("C", "c", "i")]).reset_index()

# or too short...
levels = [["A", "a", ""], ["B", "b", "i"]]
Expand All @@ -433,6 +437,45 @@ def test_reset_index_multiindex_columns(self):
result = df2.rename_axis([("c", "ii")]).reset_index(col_level=1, col_fill="C")
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize("flag", [False, True])
@pytest.mark.parametrize("allow_duplicates", [False, True])
def test_reset_index_duplicate_columns_allow(
self, multiindex_df, flag, allow_duplicates
):
# GH#44755 reset_index with duplicate column labels
df = multiindex_df.rename_axis("A")
df = df.set_flags(allows_duplicate_labels=flag)

if flag and allow_duplicates:
result = df.reset_index(allow_duplicates=allow_duplicates)
levels = [["A", ""], ["A", ""], ["B", "b"]]
expected = DataFrame(
[[0, 0, 2], [1, 1, 3]], columns=MultiIndex.from_tuples(levels)
)
tm.assert_frame_equal(result, expected)
else:
if not flag and allow_duplicates:
msg = "Cannot specify 'allow_duplicates=True' when "
"'self.flags.allows_duplicate_labels' is False"
else:
msg = r"cannot insert \('A', ''\), already exists"
with pytest.raises(ValueError, match=msg):
df.reset_index(allow_duplicates=allow_duplicates)

@pytest.mark.parametrize("flag", [False, True])
def test_reset_index_duplicate_columns_default(self, multiindex_df, flag):
df = multiindex_df.rename_axis("A")
df = df.set_flags(allows_duplicate_labels=flag)

msg = r"cannot insert \('A', ''\), already exists"
with pytest.raises(ValueError, match=msg):
df.reset_index()

@pytest.mark.parametrize("allow_duplicates", ["bad value"])
def test_reset_index_allow_duplicates_check(self, multiindex_df, allow_duplicates):
with pytest.raises(ValueError, match="expected type bool"):
multiindex_df.reset_index(allow_duplicates=allow_duplicates)

@pytest.mark.filterwarnings("ignore:Timestamp.freq is deprecated:FutureWarning")
def test_reset_index_datetime(self, tz_naive_fixture):
# GH#3950
Expand Down
20 changes: 20 additions & 0 deletions pandas/tests/series/methods/test_reset_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -186,3 +186,23 @@ def test_reset_index_dtypes_on_empty_series_with_multiindex(array, dtype):
{"level_0": np.int64, "level_1": np.float64, "level_2": dtype, 0: object}
)
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize(
"names, expected_names",
[
(["A", "A"], ["A", "A"]),
(["level_1", None], ["level_1", "level_1"]),
],
)
@pytest.mark.parametrize("allow_duplicates", [False, True])
def test_column_name_duplicates(names, expected_names, allow_duplicates):
# GH#44755 reset_index with duplicate column labels
s = Series([1], index=MultiIndex.from_arrays([[1], [1]], names=names))
if allow_duplicates:
result = s.reset_index(allow_duplicates=True)
expected = DataFrame([[1, 1, 1]], columns=expected_names + [0])
tm.assert_frame_equal(result, expected)
else:
with pytest.raises(ValueError, match="cannot insert"):
s.reset_index()