Skip to content

BUG: in read_excel for mangled columns only the original/first column dtype is correct, col.N is not parsed correctly #35211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
jeroenbrouwer opened this issue Jul 10, 2020 · 7 comments · Fixed by #41411
Labels
Bug IO CSV read_csv, to_csv IO Excel read_excel, to_excel
Milestone

Comments

@jeroenbrouwer
Copy link

jeroenbrouwer commented Jul 10, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

This requires setting up an excel file. Reproducible by setting 2 columns with equal headers like a and a row with values 1. Schematic:

a a
1 1

Then using:

df = pd.read_excel(file, dtype={'a': str})
assert type(df['a'][0]) == type(df['a.1'][0]) 

Problem description

The problem is that column a is parsed as a str and column a.1 is parsed as a float64.

Expected Output

Both a and a.1 should be parsed as str, such that the assertion is passed.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Darwin
OS-release : 19.4.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.5
numpy : 1.19.0
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 40.8.0
Cython : 0.29.20
pytest : 5.3.5
hypothesis : None
sphinx : 3.0.4
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.1
html5lib : None
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.11.1
IPython : 7.15.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.5.1
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.5
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : 1.2.9
numba : 0.50.0

@jeroenbrouwer jeroenbrouwer added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 10, 2020
@jeroenbrouwer jeroenbrouwer changed the title BUG: in read_excel for mangled columns only the original/first column is parsed, col.N is not parsed BUG: in read_excel for mangled columns only the original/first column dtype is correct, col.N is not parsed correctly Jul 10, 2020
@simonjayhawkins
Copy link
Member

Thanks @jeroenbrouwer for the report. without looking at the code, the coercion is probably being applied after the name mangling otherwise I would expected the following to raise

>>> pd.__version__
'1.1.0.dev0+2067.g2c3edaaaa'
>>>
>>> import tempfile
>>>
>>> df = pd.DataFrame([[1, 1]], columns=list("aa"))
>>> df
   a  a
0  1  1
>>>
>>> with tempfile.TemporaryFile() as fp:
...     df.to_excel(fp)
...     res = pd.read_excel(fp, dtype={"a": str})
...
>>> print(res)
   Unnamed: 0  a  a.1
0           0  1    1
>>>
>>> res.dtypes
Unnamed: 0     int64
a             object
a.1            int64
dtype: object
>>>
>>> with tempfile.TemporaryFile() as fp:
...     df.to_excel(fp)
...     res = pd.read_excel(fp, dtype={"a": str, "a.1": str})
...
>>> print(res)
   Unnamed: 0  a a.1
0           0  1   1
>>>
>>> res.dtypes
Unnamed: 0     int64
a             object
a.1           object
dtype: object
>>>

further investigation and PRs welcome.

@simonjayhawkins simonjayhawkins added IO Excel read_excel, to_excel and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 10, 2020
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Jul 10, 2020
@sharvanir
Copy link

@simonjayhawkins can I check it out to see if I can fix?

@sharvanir
Copy link

take

@jeroenbrouwer
Copy link
Author

Any luck so far @sharvanir ?

@sharvanir
Copy link

sharvanir commented Oct 11, 2020

hi @jeroenbrouwer looks like an issue with python's AST module, trying to find a workaround

@asishm
Copy link
Contributor

asishm commented Oct 13, 2020

read_excel uses the PythonParser behind the scene to convert the parsed data to a dataframe.

Simpler repro would be

In [1]: import pandas as pd

In [2]: data = '''a,a\n1,1'''

In [3]: from io import StringIO

In [4]: df = pd.read_csv(StringIO(data), dtype={'a': str}, mangle_dupe_cols=True, engine='python')

In [9]: df
Out[9]: 
   a  a.1
0  1    1

In [10]: df.dtypes
Out[10]: 
a      object
a.1     int64
dtype: object

engine='c' has the same behavior as the python engine currently.

If this behavior is changed at the PythonParser level, then I believe a corresponding change should happen in the CParserWrapper to remain consistent.

@sharvanir
Copy link

sorry, got confused, it's not an ast issue. I will unassign this as I am unable to put in time due to professional commitments.
Will get back if possible

@sharvanir sharvanir removed their assignment Oct 13, 2020
@mzeitlin11 mzeitlin11 added the IO CSV read_csv, to_csv label Dec 25, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.3 May 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants