-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Use flags.allows_duplicate_labels to define default insert behavior #45109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Use flags.allows_duplicate_labels to define default insert behavior #45109
Conversation
Hello @johnzangwill! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-12-29 18:10:07 UTC |
This is an API change which is not easy to see but has a big impact on users. We definitely can not do this without a deprecation cycle. That said, I don't like duplicated columns in DataFrames as a user and as a maintainer. This causes lots of problems at various places. Imho we should not make this the default. I am a big fan that insert raises if somehow you try to insert the column twice. Could you find out where the flag has an impact right now? |
Even if you set the flag to
|
Plus, my issue #44958 shows that the flag is not completely implemented. |
allows_duplicate_labels/flags is a neat idea, but it should either be fully implemented or deprecated, not left in a half-implemented state indefinitely. |
sure but we need to incrementally get there. |
If we want to support it fully, we should make sure that the flag is not lost when calling something (e.g. finish the finalize issue) before using it more actively I think |
I'm not so sure. I believe that it was intended to move towards defaulting to False, changing DataFrame's well-documented default behavior. But the value of True was precisely so that data could be loaded from the outside world, with or without unique column labels. So if, in real-world practice, it is always going to need to be True, what purpose would a default of False serve? The current situation is that flags/allows_duplicate_labels=False is broken and/or untested in a variety of ways, some quite difficult to fix. @rhshadrach and @phofl have argued that reset_index() and insert() should default to raising on duplicates, regardless of the status of flags/allows_duplicate_labels. That was what my #44755 was based on, and it does not require fixing flags. |
that's not correct. pandas does support duplicate columns names. sure a lot of times you want to actively prevent that but there are many times where you do allow this. hence the flag. |
Please note:
is different from the current
which raises if flags.allows_duplicate_labels is False. Which do we prefer? |
I guess I'm less optimistic than others about this actually happening. ATM a third of the finalize tests and a quarter of the duplicate_labels tests are xfailed. Together these constitute ~18% of our overall xfails (which is a stat I'm definitely more-than-averagely focused on). To be clear, I'm not advocating any particular change here. Mostly signaling that I'm open to it if it comes into conflict with bugfixes or other priorities. |
we prefer allowing both the option AND having it handle the flag. My point is until we can do that (reliabley / w/o xfailing the world etc). we cannot change any code here. |
This PR was to demonstrate to @jreback a solution using the frame flags to provide default insert behavior. Since neither I nor other reviewers believe that this is a valid or acceptable solution, I am closing it |
Alternative to #44755 to resolve #44410 and #44992 following @jreback's review requests.
The default for DataFrame.insert is now to use the flag, which defaults to allow duplicate column labels.
Comments please @rhshadrach, @phofl, @jbrockmendel
Some tests no longer raise and I have xfailed all of these these for the time being.
This is Draft and I have changed #44755 to Draft until there is agreement!