BUG: in read_excel for mangled columns only the original/first column dtype is correct, col.N is not parsed correctly #35211

jeroenbrouwer · 2020-07-10T11:25:18Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

This requires setting up an excel file. Reproducible by setting 2 columns with equal headers like a and a row with values 1. Schematic:

a a
1 1

Then using:

df = pd.read_excel(file, dtype={'a': str})
assert type(df['a'][0]) == type(df['a.1'][0])

Problem description

The problem is that column a is parsed as a str and column a.1 is parsed as a float64.

Expected Output

Both a and a.1 should be parsed as str, such that the assertion is passed.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Darwin
OS-release : 19.4.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.5
numpy : 1.19.0
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 40.8.0
Cython : 0.29.20
pytest : 5.3.5
hypothesis : None
sphinx : 3.0.4
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.1
html5lib : None
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.11.1
IPython : 7.15.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.5.1
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.5
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : 1.2.9
numba : 0.50.0

The text was updated successfully, but these errors were encountered:

simonjayhawkins · 2020-07-10T11:42:40Z

Thanks @jeroenbrouwer for the report. without looking at the code, the coercion is probably being applied after the name mangling otherwise I would expected the following to raise

>>> pd.__version__
'1.1.0.dev0+2067.g2c3edaaaa'
>>>
>>> import tempfile
>>>
>>> df = pd.DataFrame([[1, 1]], columns=list("aa"))
>>> df
   a  a
0  1  1
>>>
>>> with tempfile.TemporaryFile() as fp:
...     df.to_excel(fp)
...     res = pd.read_excel(fp, dtype={"a": str})
...
>>> print(res)
   Unnamed: 0  a  a.1
0           0  1    1
>>>
>>> res.dtypes
Unnamed: 0     int64
a             object
a.1            int64
dtype: object
>>>
>>> with tempfile.TemporaryFile() as fp:
...     df.to_excel(fp)
...     res = pd.read_excel(fp, dtype={"a": str, "a.1": str})
...
>>> print(res)
   Unnamed: 0  a a.1
0           0  1   1
>>>
>>> res.dtypes
Unnamed: 0     int64
a             object
a.1           object
dtype: object
>>>

further investigation and PRs welcome.

sharvanir · 2020-07-29T13:19:24Z

@simonjayhawkins can I check it out to see if I can fix?

sharvanir · 2020-07-29T13:24:15Z

take

jeroenbrouwer · 2020-09-05T12:34:28Z

Any luck so far @sharvanir ?

sharvanir · 2020-10-11T05:51:48Z

hi @jeroenbrouwer looks like an issue with python's AST module, trying to find a workaround

asishm · 2020-10-13T05:38:42Z

read_excel uses the PythonParser behind the scene to convert the parsed data to a dataframe.

Simpler repro would be

In [1]: import pandas as pd

In [2]: data = '''a,a\n1,1'''

In [3]: from io import StringIO

In [4]: df = pd.read_csv(StringIO(data), dtype={'a': str}, mangle_dupe_cols=True, engine='python')

In [9]: df
Out[9]: 
   a  a.1
0  1    1

In [10]: df.dtypes
Out[10]: 
a      object
a.1     int64
dtype: object

engine='c' has the same behavior as the python engine currently.

If this behavior is changed at the PythonParser level, then I believe a corresponding change should happen in the CParserWrapper to remain consistent.

sharvanir · 2020-10-13T11:01:04Z

sorry, got confused, it's not an ast issue. I will unassign this as I am unable to put in time due to professional commitments.
Will get back if possible

jeroenbrouwer added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 10, 2020

jeroenbrouwer changed the title ~~BUG: in read_excel for mangled columns only the original/first column is parsed, col.N is not parsed~~ BUG: in read_excel for mangled columns only the original/first column dtype is correct, col.N is not parsed correctly Jul 10, 2020

simonjayhawkins added IO Excel read_excel, to_excel and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 10, 2020

simonjayhawkins added this to the Contributions Welcome milestone Jul 10, 2020

github-actions bot assigned sharvanir Jul 29, 2020

sharvanir removed their assignment Oct 13, 2020

mzeitlin11 added the IO CSV read_csv, to_csv label Dec 25, 2020

phofl mentioned this issue May 10, 2021

Bug in read_csv and read_excel not applying dtype to second col with dup cols #41411

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 1.3 May 12, 2021

jreback closed this as completed in #41411 May 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: in read_excel for mangled columns only the original/first column dtype is correct, col.N is not parsed correctly #35211

BUG: in read_excel for mangled columns only the original/first column dtype is correct, col.N is not parsed correctly #35211

jeroenbrouwer commented Jul 10, 2020 •

edited

Loading

INSTALLED VERSIONS

simonjayhawkins commented Jul 10, 2020

sharvanir commented Jul 29, 2020

sharvanir commented Jul 29, 2020

jeroenbrouwer commented Sep 5, 2020

sharvanir commented Oct 11, 2020 •

edited

Loading

asishm commented Oct 13, 2020 •

edited

Loading

sharvanir commented Oct 13, 2020

BUG: in read_excel for mangled columns only the original/first column dtype is correct, col.N is not parsed correctly #35211

BUG: in read_excel for mangled columns only the original/first column dtype is correct, col.N is not parsed correctly #35211

Comments

jeroenbrouwer commented Jul 10, 2020 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

simonjayhawkins commented Jul 10, 2020

sharvanir commented Jul 29, 2020

sharvanir commented Jul 29, 2020

jeroenbrouwer commented Sep 5, 2020

sharvanir commented Oct 11, 2020 • edited Loading

asishm commented Oct 13, 2020 • edited Loading

sharvanir commented Oct 13, 2020

jeroenbrouwer commented Jul 10, 2020 •

edited

Loading

Output of `pd.show_versions()`

sharvanir commented Oct 11, 2020 •

edited

Loading

asishm commented Oct 13, 2020 •

edited

Loading