ENH: Use flags.allows_duplicate_labels to define default insert behavior #45109

johnzangwill · 2021-12-29T12:13:10Z

closes BUG: reset_index on a MultiIndex with duplicate levels raises a ValueError #44410
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

Alternative to #44755 to resolve #44410 and #44992 following @jreback's review requests.
The default for DataFrame.insert is now to use the flag, which defaults to allow duplicate column labels.
Comments please @rhshadrach, @phofl, @jbrockmendel

Some tests no longer raise and I have xfailed all of these these for the time being.
This is Draft and I have changed #44755 to Draft until there is agreement!

pep8speaks · 2021-12-29T12:13:14Z

Hello @johnzangwill! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-12-29 18:10:07 UTC

phofl · 2021-12-29T12:24:32Z

This is an API change which is not easy to see but has a big impact on users. We definitely can not do this without a deprecation cycle.

That said, I don't like duplicated columns in DataFrames as a user and as a maintainer. This causes lots of problems at various places. Imho we should not make this the default. I am a big fan that insert raises if somehow you try to insert the column twice.

Could you find out where the flag has an impact right now?

phofl · 2021-12-29T12:33:34Z

Even if you set the flag to False when creating your DataFrames, the information may be lost somewhere along the way, because we do not call finalize everywhere, for example:

left = pd.DataFrame({"a": [1, 2, 3]})
left.flags.allows_duplicate_labels = False

right = pd.DataFrame({"b": [1, 2, 3]})
right.flags.allows_duplicate_labels = False

result = left.merge(right, left_on="a", right_on="b")
result.flags.allows_duplicate_labels

johnzangwill · 2021-12-29T12:47:00Z

Plus, my issue #44958 shows that the flag is not completely implemented.

jbrockmendel · 2021-12-29T16:44:32Z

Plus, my issue #44958 shows that the flag is not completely implemented.

allows_duplicate_labels/flags is a neat idea, but it should either be fully implemented or deprecated, not left in a half-implemented state indefinitely.

jreback · 2021-12-29T16:48:53Z

allows_duplicate_labels/flags is a neat idea, but it should either be fully implemented or deprecated, not left in a half-implemented state indefinitely.

sure but we need to incrementally get there.

phofl · 2021-12-29T16:50:21Z

If we want to support it fully, we should make sure that the flag is not lost when calling something (e.g. finish the finalize issue) before using it more actively I think

johnzangwill · 2021-12-29T17:45:12Z

allows_duplicate_labels/flags is a neat idea

I'm not so sure. I believe that it was intended to move towards defaulting to False, changing DataFrame's well-documented default behavior. But the value of True was precisely so that data could be loaded from the outside world, with or without unique column labels.

So if, in real-world practice, it is always going to need to be True, what purpose would a default of False serve?

The current situation is that flags/allows_duplicate_labels=False is broken and/or untested in a variety of ways, some quite difficult to fix.

@rhshadrach and @phofl have argued that reset_index() and insert() should default to raising on duplicates, regardless of the status of flags/allows_duplicate_labels. That was what my #44755 was based on, and it does not require fixing flags.

jreback · 2021-12-29T17:53:08Z

@rhshadrach and @phofl have argued that reset_index() and insert() should default to raising on duplicates, regardless of the status of flags/allows_duplicate_labels. That was what my #44755 was based on, and it does not require fixing flags.

that's not correct. pandas does support duplicate columns names. sure a lot of times you want to actively prevent that but there are many times where you do allow this. hence the flag.

…bels

johnzangwill · 2021-12-29T18:29:57Z

Please note:

flag = df.flags.allows_duplicate_labels
df.set_flag(allows_duplicate_labels=True)
df.insert(...)
df.set_flag(allows_duplicate_labels=flag)

is different from the current

df.insert(..., allow_duplicates=True)

which raises if flags.allows_duplicate_labels is False.

Which do we prefer?

jbrockmendel · 2021-12-29T18:41:38Z

allows_duplicate_labels/flags is a neat idea, but it should either be fully implemented or deprecated, not left in a half-implemented state indefinitely. [me]

sure but we need to incrementally get there. [jreback]

If we want to support it fully, we should make sure that the flag is not lost when calling something (e.g. finish the finalize issue) before using it more actively I think [phofl]

I guess I'm less optimistic than others about this actually happening. ATM a third of the finalize tests and a quarter of the duplicate_labels tests are xfailed. Together these constitute ~18% of our overall xfails (which is a stat I'm definitely more-than-averagely focused on).

To be clear, I'm not advocating any particular change here. Mostly signaling that I'm open to it if it comes into conflict with bugfixes or other priorities.

jreback · 2021-12-29T19:30:40Z

Please note:
flag = df.flags.allows_duplicate_labels
df.set_flag(allows_duplicate_labels=True)
df.insert(...)
df.set_flag(allows_duplicate_labels=flag)
is different from the current

df.insert(..., allow_duplicates=True)

which raises if flags.allows_duplicate_labels is False.

Which do we prefer?

we prefer allowing both the option AND having it handle the flag. My point is until we can do that (reliabley / w/o xfailing the world etc). we cannot change any code here.

johnzangwill · 2022-01-19T14:32:36Z

This PR was to demonstrate to @jreback a solution using the frame flags to provide default insert behavior. Since neither I nor other reviewers believe that this is a valid or acceptable solution, I am closing it

johnzangwill added 4 commits December 29, 2021 10:13

Use flag

4bca496

xfail more tests

a445c4f

Update test_reset_index.py

b14779e

Update test_frame_value_counts.py

9e18fc3

Update test_put.py

44b02ca

johnzangwill marked this pull request as draft December 29, 2021 12:14

johnzangwill added 2 commits December 29, 2021 18:09

xfail the rest of the tests

93b65ef

Merge branch 'pandas-dev:master' into Using-flags.allows_duplicate_la…

902583a

…bels

johnzangwill closed this Jan 19, 2022

johnzangwill deleted the Using-flags.allows_duplicate_labels branch January 30, 2022 22:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Use flags.allows_duplicate_labels to define default insert behavior #45109

ENH: Use flags.allows_duplicate_labels to define default insert behavior #45109

johnzangwill commented Dec 29, 2021 •

edited

Loading

pep8speaks commented Dec 29, 2021 •

edited

Loading

phofl commented Dec 29, 2021

phofl commented Dec 29, 2021

johnzangwill commented Dec 29, 2021

jbrockmendel commented Dec 29, 2021

jreback commented Dec 29, 2021

phofl commented Dec 29, 2021

johnzangwill commented Dec 29, 2021

jreback commented Dec 29, 2021

johnzangwill commented Dec 29, 2021

jbrockmendel commented Dec 29, 2021

jreback commented Dec 29, 2021

johnzangwill commented Jan 19, 2022

ENH: Use flags.allows_duplicate_labels to define default insert behavior #45109

ENH: Use flags.allows_duplicate_labels to define default insert behavior #45109

Conversation

johnzangwill commented Dec 29, 2021 • edited Loading

pep8speaks commented Dec 29, 2021 • edited Loading

Comment last updated at 2021-12-29 18:10:07 UTC

phofl commented Dec 29, 2021

phofl commented Dec 29, 2021

johnzangwill commented Dec 29, 2021

jbrockmendel commented Dec 29, 2021

jreback commented Dec 29, 2021

phofl commented Dec 29, 2021

johnzangwill commented Dec 29, 2021

jreback commented Dec 29, 2021

johnzangwill commented Dec 29, 2021

jbrockmendel commented Dec 29, 2021

jreback commented Dec 29, 2021

johnzangwill commented Jan 19, 2022

johnzangwill commented Dec 29, 2021 •

edited

Loading

pep8speaks commented Dec 29, 2021 •

edited

Loading