-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: Consistent handling of duplicate input columns #47718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
-1 on a global option +1 in a single name across functions we already use allow_duplicates=bool in various places; extending this would be good |
Thanks for the feedback. I see Does using a string instead of a bool, with the above options and default |
use raise (mirrors what we do in errors keyword) |
So to clarify, an example docstring would be?
Not the cleanest but maybe I'm not understanding correctly. Also probably need to be aware that "column names" doesn't necessarily mean that the labels are strings. |
What I'm proposing is more like:
This would replace the current Another option maybe worth considering, is not letting the creation of duplicates, so not having the The approach is not that simple, but that's the best I can think of that allows all use cases, and also should allow us to raise exception for incorrect values if the provided value is not one of the list, or is not a format containing Note that I'd also allow using for example |
Great, that API makes sense to me. Maybe |
I like the flexibility. As a note: |
I think it's tricky for all cases. Having duplicate column names can make things break easily. For example, if two columns exist with the name Is there anything in particular for |
Yes, the implementation itself. The columns are stored as keys of a dictionary internally before we are creating the DataFrame. We would have to refactor this in a way to allow duplicated column names. Yes you are right. I don’t like duplicated columns at all as a user |
I don't think we should allow duplicates for IO functions that don't already support them and same for dropping some all columns but one. In my opinion, if we introduce these two options it should be only for backward compatibility, and eventually deprecate them. |
If I read the issue description correctly, the only reason for |
Any update on this? We actually have a use case where we want to raise if the user tries to I agree with the general consensus here that there should be an option to decide, and that |
Looks like now, read_json(orient='split') no longer reads in duplicate columns. Great 🎉 Do we still want to move forward with duplicate_column_action , but without 'allow' as an option? |
It did surprise me that At least I think the documentation is not good here, the behaviour with duplicated columns should be mentioned. |
I know it would be a big refactor to the parser, but tonight I may have stumbled upon a possible use case where 'allow' might actually be a good thing to add: https://www.reddit.com/r/learnpython/comments/1fjgbq4/comment/lnol3ff/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button TLDR: OP over at r/learnpython has a csv with a dupe column pattern:
|
When loading data into pandas with
pandas.read_X()
methods, the behavior when duplicate columns exist changes depending on the format.For
read_csv
,read_fwf
andread_excel
we have amangle_dupe_cols
parameter that we can provide. By default it appends.1
,.2
... to duplicated column names. Setting it toFalse
raises an exception about not being implemented.html
also appends.1
... but the option is not provided.read_json(orient='split')
loads data with the duplicate column names.read_xml
drops the columns if they are duplicated (I assume one columns keeps overwriting the previous with the same name).Personally, I think we should have consistency among all them. What I would do is to control this with an option (e.g.
io.duplicate_columns
. Could also be an argument for all theread_
methods, but I think these methods have already too many arguments, and I think the number of cases when users want to change this to be small, and very unlikely that they want to have different ways of handling duplicate column names in different calls to read methods.Whether it's an option or an argument, we could allow the next options (feel free to propose better names):
raise
: If duplicate column names exist, raise an exceptiondrop
: Keep one (maybe the first) and ignore the restallow
Load data with duplicate columns. Based on discussions in the data apis consortium and ENH: Support mangle_dupe_cols=False in pd.read_csv() #13262, I'd add this for backward compatibility only, but we shouldn't probably allow duplicate column names after a deprecation period. Or we can simply remove this option{col}.{i}
,{col}_{i}
...: Allow appending an autonumeric with a custom format. By default,'{col}.{i}'
could be used, as this seems to be the preferred way based on the current API. This would address shouldn't mangle_dupe_cols add an underscore rather than a dot in read_csv? #8908,I think it'd be good to have a single function that receives the input column names and return the final column names (indices of columns to use may also be needed, for cases like
drop
), or raises when appropriate. And allread_
functions should use it if the format can have duplicate columns names.Thoughts?
The text was updated successfully, but these errors were encountered: