Skip to content

ENH: reset_index on a MultiIndex with duplicate levels raises a ValueError #44755

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 75 commits into from
Jan 30, 2022

Conversation

johnzangwill
Copy link
Contributor

@johnzangwill johnzangwill marked this pull request as draft December 4, 2021 18:18
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a whatsnew note as well

@phofl
Copy link
Member

phofl commented Dec 4, 2021

We should discuss this first, @mroeschke and I agreed that we should only add a note to the docs

@johnzangwill johnzangwill marked this pull request as ready for review December 5, 2021 14:57
@johnzangwill
Copy link
Contributor Author

johnzangwill commented Dec 5, 2021

@phofl

We should discuss this first, @mroeschke and I agreed that we should only add a note to the docs

By all means! This arose due to @rhshadrach's efforts to break my #44267. I found that a trivial and sensible change to frame.py fixed the problem, and also fixed #44410. I know that you decided to document rather than fix. But since the bug was a regression, and the fix seems simple, I suggest to accept my fix. It will also be a prerequisite for my #44267.

@phofl
Copy link
Member

phofl commented Dec 5, 2021

Imo this change is neither trivial nor simple. Accidentially ending up with duplictated column names is not desireable. This is a real PITA in indexing among others. For example:

df = pd.DataFrame([[1, 2, 3]], columns=["a", "a", "b"])

df["b"] returns a Series while df["a"] returns a DataFrame.
IMO this is more of an api change instead of a bugfix. Also I think you can make a point that this should be changed inside of insert, not on a higher level.

@rhshadrach
Copy link
Member

rhshadrach commented Dec 5, 2021

Accidentially ending up with duplictated column names is not desireable.
IMO this is more of an api change instead of a bugfix.

Agreed on both counts. For #44267, it is the case that the user called the function with duplicate columns, the internals of groupby moved them into the index, and we want to restore them as columns. For this, it would suffice to have an internal version of reset_index where we can override allow_duplicates, i.e. this pattern:

def foo(x):
    return _foo(x, hidden_arg=False)

def _foo(x, hidden_arg=False):
    pass

where #44267 could then call _foo(x, hidden_arg=True) (where hidden_arg is allow_duplicates).

Also I think you can make a point that this should be changed inside of insert, not on a higher level.

I'm not sure what this means (in other words, what should be changed inside of insert?). In any case, for #44267, I don't think we should be reimplementing reset_index via calling insert directly (to be clear, I'm not suggesting you're saying that).

@phofl
Copy link
Member

phofl commented Dec 5, 2021

What I meant with respect to insert:

If we want to respect the flag, we should change the keyword allow_duplicates to None or lib.NO_DEFAULT and figure out the correct default inside of insert instead of the caller.

But I would prefer not changing the public behavior, so I like your idea with a private implementation better.

@johnzangwill
Copy link
Contributor Author

There is a whole chapter on this: https://pandas.pydata.org/pandas-docs/stable/user_guide/duplicates.html.
It implies that the Pandas policy is to default to allowing duplicates:

Both Series and DataFrame disallow duplicate labels by calling .set_flags(allows_duplicate_labels=False). (the default is to allow them).

It seems to me that if you have a legal Series (with duplicates), then it is OK for reset_index() to translate it to a legal DataFrame (with duplicates).
My change just takes allow_duplicates from the frame's own allows_duplicate_labels flag.
But it is not for me to say. If you want reset_index() to have an allow_duplicates flag (over-riding the frame's flag) then let me know.
I can't see anything gained by hiding it in a private method. SInce #44410 is easily resolved then why not let it be resolved?

@rhshadrach
Copy link
Member

SInce #44410 is easily resolved then why not let it be resolved?

I personally am not a fan of this resolution. My understanding is that flags.allow_duplicate_labels will not allow duplicates in both the index and the columns. In using pandas, I very often find use in having duplicates in the index, but never want them in the columns. In other words, I want reset_index to throw an error when the result would have duplicate columns but I am not able to set flags.allow_duplicate_labels to be True.

In any case, this an API change that will need to have a FutureWarning before it can be implemented in 2.0 if we go this route.

@pep8speaks
Copy link

pep8speaks commented Dec 6, 2021

Hello @johnzangwill! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-01-28 22:44:47 UTC

@johnzangwill
Copy link
Contributor Author

johnzangwill commented Dec 6, 2021

I have added allow_duplicates to DataFrame.reset_index() and Series.reset_index(). I have reverted and added to the tests accordingly, and changed the whatsnew from bug to enhancement.
The defaults are both False, so the current (pre-PR) behaviour is unchanged: duplicates are not created, regardless of the frame flag.
#44410 still raises ValueError. But adding allow_duplicates=None or allow_duplicates=True makes reset_index() work as the OP proposed.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine one comment. ping on greenish

@@ -5792,6 +5805,8 @@ class max type
new_obj = self
else:
new_obj = self.copy()
if allow_duplicates is not lib.no_default:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt we do the same as you are doing in insert and just set this here e.g.


if allow_duplicates is not lib.no_default:
    allow_duplicates = False
.....

simpler & more readable

Copy link
Contributor Author

@johnzangwill johnzangwill Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Your suggestion has got to be a typo. That change would render the argument always False and fail my tests.

If I take out the next line, which is type-checking, then I will have to take out my test https://github.com/johnzangwill/pandas/blob/dddae0734e37ab9fa4ae4a89e68043c4631fd68c/pandas/tests/frame/methods/test_reset_index.py#L474

If I copy in the code from insert, then the lib.no_default no longer has any function.

I only used lib.no_default to avoid your original objections to this argument. I don't like it, because it makes no sense in the documentation and disguises the real default, which is False.

I would be more than happy to take it out and go back to the original much simpler allow_duplicates: bool = False, which is the case for insert in main and for Series in this PR: https://github.com/johnzangwill/pandas/blob/dddae0734e37ab9fa4ae4a89e68043c4631fd68c/pandas/core/series.py#L1369

@jreback Let me know what you want:

  1. Leave as is.
  2. Revert lib.no_default and have all the arguments allow_duplicates: bool = False

cc @rhshadrach, @phofl, @mroeschke. Comments on this?

@johnzangwill johnzangwill requested a review from jreback January 24, 2022 10:45
@jreback jreback added this to the 1.5 milestone Jan 30, 2022
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @johnzangwill

@jreback jreback merged commit fd6f23f into pandas-dev:main Jan 30, 2022
@johnzangwill johnzangwill deleted the reset_index-duplicate-labels branch January 30, 2022 22:42
phofl pushed a commit to phofl/pandas that referenced this pull request Feb 14, 2022
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: reset_index on a MultiIndex with duplicate levels raises a ValueError
6 participants