API: Consistent handling of duplicate input columns

When loading data into pandas with `pandas.read_X()` methods, the behavior when duplicate columns exist changes depending on the format.

For `read_csv`, `read_fwf` and `read_excel` we have a `mangle_dupe_cols` parameter that we can provide. By default it appends `.1`, `.2`... to duplicated column names. Setting it to `False` raises an exception about not being implemented.

`html` also appends `.1`... but the option is not provided.

`read_json(orient='split')` loads data with the duplicate column names.

`read_xml` drops the columns if they are duplicated (I assume one columns keeps overwriting the previous with the same name).

Personally, I think we should have consistency among all them. What I would do is to control this with an option (e.g. `io.duplicate_columns`. Could also be an argument for all the `read_` methods, but I think these methods have already too many arguments, and I think the number of cases when users want to change this to be small, and very unlikely that they want to have different ways of handling duplicate column names in different calls to read methods.

Whether it's an option or an argument, we could allow the next options (feel free to propose better names):

-  `raise`: If duplicate column names exist, raise an exception
- `drop`: Keep one (maybe the first) and ignore the rest
- `allow` Load data with duplicate columns. Based on discussions in the data apis consortium and #13262, I'd add this for backward compatibility only, but we shouldn't probably allow duplicate column names after a deprecation period. Or we can simply remove this option
- `{col}.{i}`, `{col}_{i}`...: Allow appending an autonumeric with a custom format. By default, `'{col}.{i}'` could be used, as this seems to be the preferred way based on the current API. This would address #8908, 

I think it'd be good to have a single function that receives the input column names and return the final column names (indices of columns to use may also be needed, for cases like `drop`), or raises when appropriate. And all `read_` functions should use it if the format can have duplicate columns names.

Thoughts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

API: Consistent handling of duplicate input columns #47718

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

API: Consistent handling of duplicate input columns #47718

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions