FR: Allow duplicate column names in pandas.read_csv #19383

njvack · 2018-01-24T21:19:19Z

Right now, pandas's read_csv() supports forcing column names read from CSV data to be unique:

>>> import pandas as pd
p>>> pd.__version__
u'0.20.3'
>>> from StringIO import StringIO
>>> csv_data = """a,a,b
... 1,2,3
... 4,5,6"""
>>> df = pd.read_csv(StringIO(csv_data))
>>> df.columns.tolist()
['a', 'a.1', 'b']

The documentation suggests that passing mangle_dupe_cols=False to read_csv() will change this behavior to one where it'll overwrite data on load. That doesn't seem to be implemented as of this version:

>>> df = pd.read_csv(StringIO(csv_data), mangle_dupe_cols=False)
# snip
ValueError: Setting mangle_dupe_cols=False is not supported yet

However! Pandas doesn't fundamentally disallow duplicate column names. There's a simple (but somewhat convoluted) trick to get duplicate columns from a CSV file into your headers:

>>> df = pd.read_csv(StringIO(csv_data), header=None)
>>> df.columns = df.iloc[0]
>>> df = df.reindex(df.index.drop(0)).reset_index(drop=True)
>>> df
0  a  a  b
0  1  2  3
1  4  5  6

(Then again, maybe this isn't so simple; there's a funny "0" in there with the column headings and that seems weird.)

Problem description

I would like a native way to read CSV files with repeated headers. In my application, it's literally so I can warn people about duplicated column headers. Yes, I could use python's built-in csv module for this, but then I'm using two methods to read CSV files and it gets weird.

Since mangle_dupe_cols=False is not yet implemented, I might propose this behavior in this case.

Output of `pd.show_versions()`

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 17.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.20.3
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.2.7
Cython: None
numpy: 1.13.1
scipy: None
xarray: None
IPython: 5.4.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

njvack · 2018-01-25T16:13:06Z

Probably a slightly better "turn the first row into column headers" incantation:

df = df.rename(columns=df.iloc[0], copy=False).iloc[1:].reset_index(drop=True)

Also, I've upgraded to pandas 0.22 and all this works in that version as well.

Checks to see if any columns (other than the id column) are duplicated, either in one file or across files. The behind-the-scenes change that *could* have reprecussions is that this changes how we're reading the CSV files into dataframes. pandas mangles duplicated column names when reading CSV files; however, we can get around this by having pandas not interpret the header row and instead, read it as a normal row and then use it as column headers. Yes, this is silly. See pandas-dev/pandas#19383

chris-b1 · 2018-01-25T19:40:37Z

Thanks for the report, this is a duplicate of #13262. A PR to support this would be welcome if you're interested!

njvack · 2018-01-25T20:41:22Z

Thanks — I suspected I wasn't the first to report this, but somehow failed to find that issue. I'll try and work on a PR for this (it shouldn't be too hard?) once I can figure out how pandas's source works some; this is a pretty intimidating codebase...

grofte · 2018-04-08T10:06:55Z

Since pandas does support non-unique column names it would be really great if pandas had some kind of function to warn about them. Maybe in df.info() and / or other functions commonly used when running into trouble.

Example of crashing: trying to use seaborn boxplot on a dataframe with duplicate column names. The crash comes from pandas but I don't know what the actual crash stems from.

claresloggett · 2022-02-16T03:05:41Z

While this doesn't directly address the issue, for the specific use-case that @njvack is encountering - wanting to warn the user about duplicate columns - I've resorted to a pre-load of the header row with

pd.read_csv(myfile, header=None, nrows=1).iloc[0,:].value_counts()

and then any values with count > 1 can be reported to the user.

As a general issue though, mangle_dupe_cols still needs implementation by the looks of it (as of Pandas version 1.4.1).

chris-b1 closed this as completed Jan 25, 2018

chris-b1 added the Duplicate Report Duplicate issue or pull request label Jan 25, 2018

chris-b1 added this to the No action milestone Jan 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FR: Allow duplicate column names in pandas.read_csv #19383

FR: Allow duplicate column names in pandas.read_csv #19383

njvack commented Jan 24, 2018

INSTALLED VERSIONS

njvack commented Jan 25, 2018

chris-b1 commented Jan 25, 2018

njvack commented Jan 25, 2018

grofte commented Apr 8, 2018

claresloggett commented Feb 16, 2022

FR: Allow duplicate column names in pandas.read_csv #19383

FR: Allow duplicate column names in pandas.read_csv #19383

Comments

njvack commented Jan 24, 2018

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

njvack commented Jan 25, 2018

chris-b1 commented Jan 25, 2018

njvack commented Jan 25, 2018

grofte commented Apr 8, 2018

claresloggett commented Feb 16, 2022

Output of `pd.show_versions()`