Skip to content

read_csv(...,mangle_dupe_cols=True) causes silent data loss for certain column names. Request introduction of mangle_dupe_cols_str #14704

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wilsoc5 opened this issue Nov 21, 2016 · 6 comments · Fixed by #44641
Labels
Enhancement IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@wilsoc5
Copy link

wilsoc5 commented Nov 21, 2016

Lets say we have data with both column and row headers. Lets say this dataset also simply outputs a 0 in that csv file for the cell at the intersection of the column and row headers (A1 in excel notation, or cell (0,0) in the csv file). Additionally, both "0" and "0.1" are valid column names:

RH\CH -0.1 0 0.1
10.0 123 456 789
20.0 012 345 678

Thus, "RH\CH" is replaced by "0" on export (of which I have no control).

import pandas as pd
from cStringIO import StringIO

data = '0,-0.1,0,0.1\n10.0,123,456,789\n20.0,012,345,678'
df = pd.read_csv(StringIO(data),index_col=0,header=0)
print(df)

      -0.1  0.1  0.1
0                   
10.0   123  789  789
20.0    12  678  678

The name mangling will change the duplicate "0" column containing data to "0.1", thus if a real data column has the name "0.1", this data will be copied (?) back to the mangled duplicate "0" column. Additionally, if the true "0.1" data name is missing, then this name mangling is frustrating as there is no obvious way to determine if the now present "0.1" column is a duplicate "0" that has been mangled or is a real "0.1" data series.

I propose the addition of a mangle_dupe_cols_str keyword option that defaults to '.' to preserve the current behavior. However, it can be passed as a kwarg to read_csv in cases where the period name mangling could result in further duplicate columns.

In https://github.com/pandas-dev/pandas/blob/v0.19.1/pandas/io/parsers.py#L2109-L2111:

                        if cur_count > 0:
                            this_columns[i] = '%s.%d' % (col, cur_count)
                        counts[col] = cur_count + 1

should be adapted to

                        if cur_count > 0:
                            this_columns[i] = '%s%s%d' % (col, self.mangle_dupe_cols_str, cur_count)
                        counts[col] = cur_count + 1

and lines 1042-1043 in https://github.com/pandas-dev/pandas/blob/v0.19.1/pandas/io/parsers.py#L1042-L1043 changed to:

        self.mangle_dupe_cols = kwds.get('mangle_dupe_cols', True)
        self.mangle_dupe_cols_str = kwds.get('mangle_dupe_cols_str', '.')        
        self.infer_datetime_format = kwds.pop('infer_datetime_format', False)

Expected Output

Then the corrected output would be:

>>> data = '0,-0.1,0,0.1\n10.0,123,456,789\n20.0,012,345,678'
>>> df = pd.read_csv(StringIO(data),index_col=0,header=0,mangle_dupe_cols_str='__')
>>> print(df)

      -0.1  0__1  0.1
0                   
10.0   123  456  789
20.0    12  345  678

where subsequent operations could identify any columns with the mangle_dupe_cols_str, if needed. The current behavior silently causes data loss for certain column names.

>>> pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: None
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.3
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

@wilsoc5
Copy link
Author

wilsoc5 commented Nov 21, 2016

I realize this is related to Issue #8908, but extends it in a more general way. Alternatively, mangle_dup_cols could be converted from boolean to an actual string, though the case of a desired empty string would be harder to handle. Thus, I think the introduction of mangle_dupe_cols_str is a cleaner implementation.

@sinhrks
Copy link
Member

sinhrks commented Nov 22, 2016

I think using names is simple in such complex cases.

I'm slightly -1 to add further less-frequently used options to read_csv. One idea is to additionally allow mangle_dupe_cols to accept str (mangled using specified str).

@sinhrks sinhrks added API Design IO CSV read_csv, to_csv labels Nov 22, 2016
@wilsoc5
Copy link
Author

wilsoc5 commented Nov 22, 2016

I would agree if I had such a simple dataset that didn't have row and column headers that changed frequently. In one particular dataset, I have over 400 columns and 500+ rows. names becomes unwieldy in those cases and really should be populated from the dataset. I'm all for using mangle_dup_cols as a strong. The one question the community (or developers) should answer if you go that route is if it should also support the empty string for concatenation (which evaluates False).

@jreback
Copy link
Contributor

jreback commented Nov 22, 2016

I am also -1 on expanding duplicate handling at all. Generally if you have duplicates it is more sensible to skip the header row and set the values later. I suppose allowing a string arg to mangle_dup_cols is ok, but would require a community based PR.

@jorisvandenbossche
Copy link
Member

I think there are two different things here to discuss:

  • the fact that our "unduplicating column names" mangling introduces a new duplicate column name is something that can be solved (and I would regard this as a bug). The code should check if a new column name that it creates does not yet already exist, and otherwise move to a next number (so eg 0.2 instead of 0.1)
  • whether we want to provide the functionality to be able to specify the string used to mangle or not. If we want this, I agree that reusing the existing mangle_dupe_cols seems better, in order to not introduce yet another keyword. But if this does not create new columns (previous point), question is if this is really needed.

@jorisvandenbossche
Copy link
Member

Actually, my first point seems already partly solved in latest master and 0.19.1:

In [2]: data = '0,-0.1,0,0.1\n10.0,123,456,789\n20.0,012,345,678'

In [3]: pd.read_csv(StringIO(data),index_col=0)
Out[3]: 
      -0.1  0.1  0.1.1
0                     
10.0   123  456    789
20.0    12  345    678

In [4]: pd.__version__
Out[4]: '0.19.1'

So the actual data in the resulting frame is now correct. It is maybe only a bit surprising that the mangling rename 0 -> 0.1, and then 0.1 to 0.1.1, where you could argue that only the 0 should have been mangled as 0.1 was not a duplicate column (eg to 0.2).

chris-b1 added a commit to chris-b1/pandas that referenced this issue May 9, 2017
jreback pushed a commit that referenced this issue May 12, 2017
* BUG: pathlib.Path in io

* CLN: factor out pathlib roundtrip

* add localpath tests for other io

* fixup

* xfail SAS; type in parser

* missing import

* xfail for #14704

* fix to_csv

* lint

* lint cleanup

* add feather (xfail)
pcluo pushed a commit to pcluo/pandas that referenced this issue May 22, 2017
* BUG: pathlib.Path in io

* CLN: factor out pathlib roundtrip

* add localpath tests for other io

* fixup

* xfail SAS; type in parser

* missing import

* xfail for pandas-dev#14704

* fix to_csv

* lint

* lint cleanup

* add feather (xfail)
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this issue May 29, 2017
* BUG: pathlib.Path in io

* CLN: factor out pathlib roundtrip

* add localpath tests for other io

* fixup

* xfail SAS; type in parser

* missing import

* xfail for pandas-dev#14704

* fix to_csv

* lint

* lint cleanup

* add feather (xfail)

(cherry picked from commit 4cd8458)
TomAugspurger pushed a commit that referenced this issue May 30, 2017
* BUG: pathlib.Path in io

* CLN: factor out pathlib roundtrip

* add localpath tests for other io

* fixup

* xfail SAS; type in parser

* missing import

* xfail for #14704

* fix to_csv

* lint

* lint cleanup

* add feather (xfail)

(cherry picked from commit 4cd8458)
stangirala pushed a commit to stangirala/pandas that referenced this issue Jun 11, 2017
* BUG: pathlib.Path in io

* CLN: factor out pathlib roundtrip

* add localpath tests for other io

* fixup

* xfail SAS; type in parser

* missing import

* xfail for pandas-dev#14704

* fix to_csv

* lint

* lint cleanup

* add feather (xfail)
@mroeschke mroeschke added Enhancement Needs Discussion Requires discussion from core team before further action and removed API Design labels May 2, 2021
@jreback jreback added this to the 1.4 milestone Nov 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants