-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
read_csv(...,mangle_dupe_cols=True) causes silent data loss for certain column names. Request introduction of mangle_dupe_cols_str #14704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I realize this is related to Issue #8908, but extends it in a more general way. Alternatively, |
I think using I'm slightly -1 to add further less-frequently used options to |
I would agree if I had such a simple dataset that didn't have row and column headers that changed frequently. In one particular dataset, I have over 400 columns and 500+ rows. |
I am also -1 on expanding duplicate handling at all. Generally if you have duplicates it is more sensible to skip the header row and set the values later. I suppose allowing a string arg to |
I think there are two different things here to discuss:
|
Actually, my first point seems already partly solved in latest master and 0.19.1:
So the actual data in the resulting frame is now correct. It is maybe only a bit surprising that the mangling rename |
* BUG: pathlib.Path in io * CLN: factor out pathlib roundtrip * add localpath tests for other io * fixup * xfail SAS; type in parser * missing import * xfail for #14704 * fix to_csv * lint * lint cleanup * add feather (xfail)
* BUG: pathlib.Path in io * CLN: factor out pathlib roundtrip * add localpath tests for other io * fixup * xfail SAS; type in parser * missing import * xfail for pandas-dev#14704 * fix to_csv * lint * lint cleanup * add feather (xfail)
* BUG: pathlib.Path in io * CLN: factor out pathlib roundtrip * add localpath tests for other io * fixup * xfail SAS; type in parser * missing import * xfail for pandas-dev#14704 * fix to_csv * lint * lint cleanup * add feather (xfail) (cherry picked from commit 4cd8458)
* BUG: pathlib.Path in io * CLN: factor out pathlib roundtrip * add localpath tests for other io * fixup * xfail SAS; type in parser * missing import * xfail for pandas-dev#14704 * fix to_csv * lint * lint cleanup * add feather (xfail)
Lets say we have data with both column and row headers. Lets say this dataset also simply outputs a
0
in that csv file for the cell at the intersection of the column and row headers (A1 in excel notation, or cell (0,0) in the csv file). Additionally, both"0"
and"0.1"
are valid column names:Thus, "RH\CH" is replaced by
"0"
on export (of which I have no control).The name mangling will change the duplicate
"0"
column containing data to"0.1"
, thus if a real data column has the name"0.1"
, this data will be copied (?) back to the mangled duplicate"0"
column. Additionally, if the true"0.1"
data name is missing, then this name mangling is frustrating as there is no obvious way to determine if the now present"0.1"
column is a duplicate"0"
that has been mangled or is a real"0.1"
data series.I propose the addition of a
mangle_dupe_cols_str
keyword option that defaults to'.'
to preserve the current behavior. However, it can be passed as a kwarg toread_csv
in cases where the period name mangling could result in further duplicate columns.In https://github.com/pandas-dev/pandas/blob/v0.19.1/pandas/io/parsers.py#L2109-L2111:
should be adapted to
and lines 1042-1043 in https://github.com/pandas-dev/pandas/blob/v0.19.1/pandas/io/parsers.py#L1042-L1043 changed to:
Expected Output
Then the corrected output would be:
where subsequent operations could identify any columns with the mangle_dupe_cols_str, if needed. The current behavior silently causes data loss for certain column names.
INSTALLED VERSIONS
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US
pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: None
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.3
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None
The text was updated successfully, but these errors were encountered: