ENH: reset_index on a MultiIndex with duplicate levels raises a ValueError #44755

johnzangwill · 2021-12-04T18:18:05Z

[x ] closes BUG: reset_index on a MultiIndex with duplicate levels raises a ValueError #44410
[x ] tests added / passed
[x ] Ensure all linting tests pass, see here for how to run them
[x ] whatsnew entry

jreback

pls add a whatsnew note as well

pandas/tests/frame/methods/test_reset_index.py

phofl · 2021-12-04T22:38:41Z

We should discuss this first, @mroeschke and I agreed that we should only add a note to the docs

johnzangwill · 2021-12-05T15:40:44Z

@phofl

We should discuss this first, @mroeschke and I agreed that we should only add a note to the docs

By all means! This arose due to @rhshadrach's efforts to break my #44267. I found that a trivial and sensible change to frame.py fixed the problem, and also fixed #44410. I know that you decided to document rather than fix. But since the bug was a regression, and the fix seems simple, I suggest to accept my fix. It will also be a prerequisite for my #44267.

phofl · 2021-12-05T15:55:47Z

Imo this change is neither trivial nor simple. Accidentially ending up with duplictated column names is not desireable. This is a real PITA in indexing among others. For example:

df = pd.DataFrame([[1, 2, 3]], columns=["a", "a", "b"])

df["b"] returns a Series while df["a"] returns a DataFrame.
IMO this is more of an api change instead of a bugfix. Also I think you can make a point that this should be changed inside of insert, not on a higher level.

rhshadrach · 2021-12-05T16:17:21Z

Accidentially ending up with duplictated column names is not desireable.
IMO this is more of an api change instead of a bugfix.

Agreed on both counts. For #44267, it is the case that the user called the function with duplicate columns, the internals of groupby moved them into the index, and we want to restore them as columns. For this, it would suffice to have an internal version of reset_index where we can override allow_duplicates, i.e. this pattern:

def foo(x):
    return _foo(x, hidden_arg=False)

def _foo(x, hidden_arg=False):
    pass

where #44267 could then call _foo(x, hidden_arg=True) (where hidden_arg is allow_duplicates).

Also I think you can make a point that this should be changed inside of insert, not on a higher level.

I'm not sure what this means (in other words, what should be changed inside of insert?). In any case, for #44267, I don't think we should be reimplementing reset_index via calling insert directly (to be clear, I'm not suggesting you're saying that).

phofl · 2021-12-05T16:27:04Z

What I meant with respect to insert:

If we want to respect the flag, we should change the keyword allow_duplicates to None or lib.NO_DEFAULT and figure out the correct default inside of insert instead of the caller.

But I would prefer not changing the public behavior, so I like your idea with a private implementation better.

johnzangwill · 2021-12-05T17:27:58Z

There is a whole chapter on this: https://pandas.pydata.org/pandas-docs/stable/user_guide/duplicates.html.
It implies that the Pandas policy is to default to allowing duplicates:

Both Series and DataFrame disallow duplicate labels by calling .set_flags(allows_duplicate_labels=False). (the default is to allow them).

It seems to me that if you have a legal Series (with duplicates), then it is OK for reset_index() to translate it to a legal DataFrame (with duplicates).
My change just takes allow_duplicates from the frame's own allows_duplicate_labels flag.
But it is not for me to say. If you want reset_index() to have an allow_duplicates flag (over-riding the frame's flag) then let me know.
I can't see anything gained by hiding it in a private method. SInce #44410 is easily resolved then why not let it be resolved?

rhshadrach · 2021-12-05T17:39:10Z

SInce #44410 is easily resolved then why not let it be resolved?

I personally am not a fan of this resolution. My understanding is that flags.allow_duplicate_labels will not allow duplicates in both the index and the columns. In using pandas, I very often find use in having duplicates in the index, but never want them in the columns. In other words, I want reset_index to throw an error when the result would have duplicate columns but I am not able to set flags.allow_duplicate_labels to be True.

In any case, this an API change that will need to have a FutureWarning before it can be implemented in 2.0 if we go this route.

pep8speaks · 2021-12-06T13:31:35Z

Hello @johnzangwill! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-01-28 22:44:47 UTC

johnzangwill · 2021-12-06T14:01:35Z

I have added allow_duplicates to DataFrame.reset_index() and Series.reset_index(). I have reverted and added to the tests accordingly, and changed the whatsnew from bug to enhancement.
The defaults are both False, so the current (pre-PR) behaviour is unchanged: duplicates are not created, regardless of the frame flag.
#44410 still raises ValueError. But adding allow_duplicates=None or allow_duplicates=True makes reset_index() work as the OP proposed.

jreback

looks fine one comment. ping on greenish

jreback · 2022-01-23T19:52:10Z

pandas/core/frame.py

@@ -5792,6 +5805,8 @@ class    max    type
            new_obj = self
        else:
            new_obj = self.copy()
+        if allow_duplicates is not lib.no_default:


shouldnt we do the same as you are doing in insert and just set this here e.g.

if allow_duplicates is not lib.no_default: allow_duplicates = False .....

simpler & more readable

@jreback Your suggestion has got to be a typo. That change would render the argument always False and fail my tests.

If I take out the next line, which is type-checking, then I will have to take out my test https://github.com/johnzangwill/pandas/blob/dddae0734e37ab9fa4ae4a89e68043c4631fd68c/pandas/tests/frame/methods/test_reset_index.py#L474

If I copy in the code from insert, then the lib.no_default no longer has any function.

I only used lib.no_default to avoid your original objections to this argument. I don't like it, because it makes no sense in the documentation and disguises the real default, which is False.

I would be more than happy to take it out and go back to the original much simpler allow_duplicates: bool = False, which is the case for insert in main and for Series in this PR: https://github.com/johnzangwill/pandas/blob/dddae0734e37ab9fa4ae4a89e68043c4631fd68c/pandas/core/series.py#L1369

@jreback Let me know what you want:

Leave as is.

Revert lib.no_default and have all the arguments allow_duplicates: bool = False

cc @rhshadrach, @phofl, @mroeschke. Comments on this?

…nzangwill/pandas into reset_index-duplicate-labels

jreback

thanks @johnzangwill

…Error (pandas-dev#44755)

reset_index to handle duplicate column labels

88cd26f

johnzangwill marked this pull request as draft December 4, 2021 18:18

jreback requested changes Dec 4, 2021

View reviewed changes

pandas/tests/frame/methods/test_reset_index.py Show resolved Hide resolved

jreback added the MultiIndex label Dec 4, 2021

johnzangwill added 3 commits December 5, 2021 14:02

Add tests

a23dbdc

Add tests

b101177

Update v1.4.0.rst

39d9f75

johnzangwill marked this pull request as ready for review December 5, 2021 14:57

Merge branch 'pandas-dev:master' into reset_index-duplicate-labels

c83b84b

johnzangwill mentioned this pull request Dec 5, 2021

ENH: Add DataFrameGroupBy.value_counts #44267

Merged

4 tasks

johnzangwill added 2 commits December 5, 2021 19:21

Merge branch 'pandas-dev:master' into reset_index-duplicate-labels

e1a5910

Implement allow_duplicates parameter

e5ab5f7

Formatting

b4828e6

johnzangwill added 8 commits December 6, 2021 14:10

Add docstrings

d9d60ee

Trigger CI

e1bb16f

Trigger CI

e924f93

Merge branch 'pandas-dev:master' into reset_index-duplicate-labels

65c6fef

Merge branch 'pandas-dev:master' into reset_index-duplicate-labels

704fae4

Merge branch 'pandas-dev:master' into reset_index-duplicate-labels

2911dd7

Update v1.4.0.rst

96af74d

Update test_reset_index.py

0e90ae9

johnzangwill mentioned this pull request Jan 2, 2022

BUG: DataFrameGroupBy.value_counts() fails if as_index=False and there are duplicate column labels #45160

Merged

4 tasks

Merge branch 'pandas-dev:master' into reset_index-duplicate-labels

c58d992

johnzangwill mentioned this pull request Jan 7, 2022

API: Consistency between DataFrame, MultiIndex.to_frame and reset_index #45245

Open

johnzangwill added 3 commits January 17, 2022 15:45

Merge branch 'main' into reset_index-duplicate-labels

e07eea6

Removed docstring Raises

04b3a38

Version to 1.5

2f94170

johnzangwill requested review from jreback and rhshadrach January 17, 2022 16:05

johnzangwill added 2 commits January 18, 2022 14:10

Merge branch 'pandas-dev:main' into reset_index-duplicate-labels

9241f96

Merge branch 'pandas-dev:main' into reset_index-duplicate-labels

0b84426

johnzangwill mentioned this pull request Jan 20, 2022

ENH: Add allow_duplicates to MultiIndex.to_frame #45318

Merged

3 tasks

johnzangwill added 2 commits January 20, 2022 14:24

Merge branch 'pandas-dev:main' into reset_index-duplicate-labels

c2bed8f

Merge branch 'main' into reset_index-duplicate-labels

3530ca8

jreback requested changes Jan 23, 2022

View reviewed changes

johnzangwill added 3 commits January 24, 2022 09:33

Merge branch 'pandas-dev:main' into reset_index-duplicate-labels

dddae07

Update series.py

14edbdb

Merge branch 'reset_index-duplicate-labels' of https://github.com/joh…

8383572

…nzangwill/pandas into reset_index-duplicate-labels

johnzangwill requested a review from jreback January 24, 2022 10:45

johnzangwill added 6 commits January 24, 2022 15:44

Merge branch 'pandas-dev:main' into reset_index-duplicate-labels

37cc560

Merge branch 'pandas-dev:main' into reset_index-duplicate-labels

3625d77

Merge branch 'pandas-dev:main' into reset_index-duplicate-labels

9d0f798

Merge branch 'pandas-dev:main' into reset_index-duplicate-labels

e2fbf3e

Merge branch 'pandas-dev:main' into reset_index-duplicate-labels

0b2a9d1

Merge branch 'pandas-dev:main' into reset_index-duplicate-labels

c5618c8

jreback added this to the 1.5 milestone Jan 30, 2022

jreback approved these changes Jan 30, 2022

View reviewed changes

jreback merged commit fd6f23f into pandas-dev:main Jan 30, 2022

johnzangwill deleted the reset_index-duplicate-labels branch January 30, 2022 22:42

phofl pushed a commit to phofl/pandas that referenced this pull request Feb 14, 2022

ENH: reset_index on a MultiIndex with duplicate levels raises a Value…

8dab104

…Error (pandas-dev#44755)

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022

ENH: reset_index on a MultiIndex with duplicate levels raises a Value…

ff5f86e

…Error (pandas-dev#44755)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: reset_index on a MultiIndex with duplicate levels raises a ValueError #44755

ENH: reset_index on a MultiIndex with duplicate levels raises a ValueError #44755

johnzangwill commented Dec 4, 2021

jreback left a comment

phofl commented Dec 4, 2021

johnzangwill commented Dec 5, 2021 •

edited

Loading

phofl commented Dec 5, 2021

rhshadrach commented Dec 5, 2021 •

edited

Loading

phofl commented Dec 5, 2021

johnzangwill commented Dec 5, 2021

rhshadrach commented Dec 5, 2021

pep8speaks commented Dec 6, 2021 •

edited

Loading

johnzangwill commented Dec 6, 2021 •

edited

Loading

jreback left a comment

jreback Jan 23, 2022

johnzangwill Jan 24, 2022 •

edited

Loading

jreback left a comment

ENH: reset_index on a MultiIndex with duplicate levels raises a ValueError #44755

ENH: reset_index on a MultiIndex with duplicate levels raises a ValueError #44755

Conversation

johnzangwill commented Dec 4, 2021

jreback left a comment

Choose a reason for hiding this comment

phofl commented Dec 4, 2021

johnzangwill commented Dec 5, 2021 • edited Loading

phofl commented Dec 5, 2021

rhshadrach commented Dec 5, 2021 • edited Loading

phofl commented Dec 5, 2021

johnzangwill commented Dec 5, 2021

rhshadrach commented Dec 5, 2021

pep8speaks commented Dec 6, 2021 • edited Loading

Comment last updated at 2022-01-28 22:44:47 UTC

johnzangwill commented Dec 6, 2021 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

jreback Jan 23, 2022

Choose a reason for hiding this comment

johnzangwill Jan 24, 2022 • edited Loading

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

johnzangwill commented Dec 5, 2021 •

edited

Loading

rhshadrach commented Dec 5, 2021 •

edited

Loading

pep8speaks commented Dec 6, 2021 •

edited

Loading

johnzangwill commented Dec 6, 2021 •

edited

Loading

johnzangwill Jan 24, 2022 •

edited

Loading