Skip to content

API: Consistent handling of duplicate input columns #47718

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
datapythonista opened this issue Jul 14, 2022 · 15 comments
Open

API: Consistent handling of duplicate input columns #47718

datapythonista opened this issue Jul 14, 2022 · 15 comments
Labels
API - Consistency Internal Consistency of API/Behavior API Design Needs Discussion Requires discussion from core team before further action

Comments

@datapythonista
Copy link
Member

datapythonista commented Jul 14, 2022

When loading data into pandas with pandas.read_X() methods, the behavior when duplicate columns exist changes depending on the format.

For read_csv, read_fwf and read_excel we have a mangle_dupe_cols parameter that we can provide. By default it appends .1, .2... to duplicated column names. Setting it to False raises an exception about not being implemented.

html also appends .1... but the option is not provided.

read_json(orient='split') loads data with the duplicate column names.

read_xml drops the columns if they are duplicated (I assume one columns keeps overwriting the previous with the same name).

Personally, I think we should have consistency among all them. What I would do is to control this with an option (e.g. io.duplicate_columns. Could also be an argument for all the read_ methods, but I think these methods have already too many arguments, and I think the number of cases when users want to change this to be small, and very unlikely that they want to have different ways of handling duplicate column names in different calls to read methods.

Whether it's an option or an argument, we could allow the next options (feel free to propose better names):

  • raise: If duplicate column names exist, raise an exception
  • drop: Keep one (maybe the first) and ignore the rest
  • allow Load data with duplicate columns. Based on discussions in the data apis consortium and ENH: Support mangle_dupe_cols=False in pd.read_csv() #13262, I'd add this for backward compatibility only, but we shouldn't probably allow duplicate column names after a deprecation period. Or we can simply remove this option
  • {col}.{i}, {col}_{i}...: Allow appending an autonumeric with a custom format. By default, '{col}.{i}' could be used, as this seems to be the preferred way based on the current API. This would address shouldn't mangle_dupe_cols add an underscore rather than a dot in read_csv? #8908,

I think it'd be good to have a single function that receives the input column names and return the final column names (indices of columns to use may also be needed, for cases like drop), or raises when appropriate. And all read_ functions should use it if the format can have duplicate columns names.

Thoughts?

@datapythonista datapythonista added API Design Needs Discussion Requires discussion from core team before further action API - Consistency Internal Consistency of API/Behavior labels Jul 14, 2022
@jreback
Copy link
Contributor

jreback commented Jul 14, 2022

-1 on a global option

+1 in a single name across functions

we already use allow_duplicates=bool in various places; extending this would be good

@datapythonista
Copy link
Member Author

-1 on a global option

+1 in a single name across functions

we already use allow_duplicates=bool in various places; extending this would be good

Thanks for the feedback. I see allow_duplicates is used in .insert and .reset_index. I guess we can also standardize the same behavior with them. And in any other case where column names can become duplicated.

Does using a string instead of a bool, with the above options and default {col}.{i} (col.1, col.2...) sound good to you?

@jreback
Copy link
Contributor

jreback commented Jul 14, 2022

use raise (mirrors what we do in errors keyword)

@mroeschke
Copy link
Member

So to clarify, an example docstring would be?

allow_duplicates: bool, str, default None
    True: Load data with duplicate columns.
    False: If duplicate column names exist, raise an exception
    <any_string>: Load duplicate columns renamed to {col_label}{any_string}{autoincrementing int} (?)

Not the cleanest but maybe I'm not understanding correctly.

Also probably need to be aware that "column names" doesn't necessarily mean that the labels are strings.

@datapythonista
Copy link
Member Author

What I'm proposing is more like:

duplicate_column_action : {'raise', 'drop', 'allow'} or str, default '{col}.{i}'
    Define the behavior in case the operation results in columns having repeated names.
    - 'raise': Raise an exception and don't allow the operation
    - 'drop': Keep the original or first column with the name, and discard the following
    - 'allow': Let the operation create multiple columns with the same name
    - str: Columns with repeated names will be renamed following this pattern. Use `{col}`
      as a placeholder for the column name, and `{i}` for an autonumeric number that will
      be increased by one for every column with the same name. For example, if `{col}_{i}`
      is used and a second `my_col` is created, the first column will kept as `my_col`, the second
      will be renamed to `my_col_1` the third to `my_col_2`, etc.

This would replace the current allow_duplicates as Jeff suggested, which will be handled with this new parameter. And will make obsolete the never implemented mangle_dupe_cols.

Another option maybe worth considering, is not letting the creation of duplicates, so not having the allow option. And also not having the drop option to keep just one. If we do that, maybe we can implement this as an optional string that will always be the pattern, and None would mean raise.

The approach is not that simple, but that's the best I can think of that allows all use cases, and also should allow us to raise exception for incorrect values if the provided value is not one of the list, or is not a format containing {col} and {i}` (or other names).

Note that I'd also allow using for example Duplicate {i} of column {col} or whatever, not only using {col}{whatever}{i}. Otherwise, instead of the format I'd simply use any other string as the separator (but I think this makes things note only more limited, but harder to understand).

@mroeschke
Copy link
Member

Great, that API makes sense to me.

Maybe str could alternatively be Callable[[Any, int], str] (column_label: Any, position: int) -> str to give the user even more flexibility. No strong opinion though.

@phofl
Copy link
Member

phofl commented Jul 20, 2022

I like the flexibility. As a note: "allow" is tricky for the read_csv case.

@datapythonista
Copy link
Member Author

I like the flexibility. As a note: "allow" is tricky for the read_csv case.

I think it's tricky for all cases. Having duplicate column names can make things break easily. For example, if two columns exist with the name mycol then col = df['mycol'] returns a DataFrame instead of a Series, andthe code will likely fail afterwards when col is used, and it'll be quite difficult to debug. While failing when the repeated column names are detected will make things very easy to fix.

Is there anything in particular for read_csv that you have in mind besides the general problems?

@phofl
Copy link
Member

phofl commented Jul 20, 2022

Yes, the implementation itself. The columns are stored as keys of a dictionary internally before we are creating the DataFrame. We would have to refactor this in a way to allow duplicated column names.

Yes you are right. I don’t like duplicated columns at all as a user

@datapythonista
Copy link
Member Author

I don't think we should allow duplicates for IO functions that don't already support them and same for dropping some all columns but one. In my opinion, if we introduce these two options it should be only for backward compatibility, and eventually deprecate them.

@rgommers
Copy link
Contributor

I don't think we should allow duplicates for IO functions that don't already support them and same for dropping some all columns but one. In my opinion, if we introduce these two options it should be only for backward compatibility, and eventually deprecate them.

If I read the issue description correctly, the only reason for "allow" is read_json(orient='split') loads data with the duplicate column names.. That looks like a simple mistake in the initial implementation of read_json, every other I/O function doesn't do this. Adding a new "allow" option for this keyword seems overly complicated to me. Why not deprecate the incorrect read_json behavior, and then remove it? There doesn't seem to be a justification for adding "allow" now, it's clear also from the comments above that it's very much undesirable.

@kasuteru
Copy link
Contributor

kasuteru commented Mar 6, 2023

Any update on this? We actually have a use case where we want to raise if the user tries to read_csv a csv file with duplicate columns (because it is very likely the user made a mistake in their preprocessing pipeline earlier, and we want them to know it).

I agree with the general consensus here that there should be an option to decide, and that allow should not be one of the options. We had a very had to find bug due to a dataframe with duplicate column names recently. As others posted above, allowing duplicate column names breaks a lot of pandas core functionality (or worse, leads to unexpected outcomes)

@MarcoGorelli
Copy link
Member

Looks like now, read_json(orient='split') no longer reads in duplicate columns. Great 🎉

Do we still want to move forward with duplicate_column_action , but without 'allow' as an option?

@olbermann
Copy link

It did surprise me that
pd.read_json(StringIO('{"columns":["a","a"], "index":["b", "b"], "data":[["c","c"],["c", "c"]]}', orient="split")
gives a different result from
pd.DataFrame(**{"columns":["a","a"], "index":["b", "b"], "data":[["c","c"],["c", "c"]]}).
Strictly speaking the latter is a constructor and no IO function, but why would this be treated differently?

At least I think the documentation is not good here, the behaviour with duplicated columns should be mentioned.

@Phillyclause89
Copy link

Looks like now, read_json(orient='split') no longer reads in duplicate columns. Great 🎉

Do we still want to move forward with duplicate_column_action , but without 'allow' as an option?

I know it would be a big refactor to the parser, but tonight I may have stumbled upon a possible use case where 'allow' might actually be a good thing to add: https://www.reddit.com/r/learnpython/comments/1fjgbq4/comment/lnol3ff/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

TLDR: OP over at r/learnpython has a csv with a dupe column pattern: ["Year","Animal","Number"]*total_years_in_dataset. Their end goal is to get a concat DataFrame with just ["Year","Animal","Number"] columns. I know asking for this to be possible entirely from the parser function is a bit much, but at least with duplicate_column_action=allow the solutions you see in that reddit post can become something like:

df = pd.read_csv("test.csv", duplicate_column_action=allow)
pd.concat([df.iloc[:, i:i+3] for i in range(0, len(df.columns), 3)])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior API Design Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

9 participants