Skip to content

BUG: ZERO WIDTH NO-BREAK SPACE in column name causes a reading failure #36343

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
omarbelaggoun opened this issue Sep 13, 2020 · 4 comments · Fixed by #36365
Closed
2 of 3 tasks

BUG: ZERO WIDTH NO-BREAK SPACE in column name causes a reading failure #36343

omarbelaggoun opened this issue Sep 13, 2020 · 4 comments · Fixed by #36365
Labels
Bug IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@omarbelaggoun
Copy link

omarbelaggoun commented Sep 13, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

test excel.xlsx

# Your code here
import pandas as pd
df=pd.read_excel('test excel.xlsx')
df

Problem description

I have a customer that is sending me a file that has ZERO WIDTH NO-BREAK SPACE
only the first header reads unless the character is deleted
I created a repro by copying the header from the original file to the new one in excel

Expected Output

Building Park Name	Building Name	New / Renewal	Address	City	State	Class 

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.1.1
setuptools : 41.2.0
Cython : None
pytest : 5.3.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.2.8
lxml.etree : None
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.2
numexpr : None
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.12
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : 1.2.8

@omarbelaggoun omarbelaggoun added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 13, 2020
@MarcoGorelli
Copy link
Member

Thanks @omarbelaggoun for the report, here's something copy-and-pasteable:

import pandas as pd                                                     

df = pd.read_excel('https://github.com/pandas-dev/pandas/files/5214984/test.excel.xlsx')
df

which returns

Empty DataFrame
Columns: [Building Park Name]
Index: []

@MarcoGorelli MarcoGorelli added IO Excel read_excel, to_excel and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 14, 2020
@asishm
Copy link
Contributor

asishm commented Sep 14, 2020

problem arises from this section that gets called from self._infer_columns()

https://github.com/pandas-dev/pandas/blob/master/pandas/io/parsers.py#L2942-L2946

        # This was the first line of the file,
        # which could contain the BOM at the
        # beginning of it.
        if self.pos == 1:
            line = self._check_for_bom(line)

this changes line from ['\ufeffBuilding Park Name', 'Building Name', 'New / Renewal', 'Address', 'City', 'State', 'Class'] to ['Building Park Name']

@asishm
Copy link
Contributor

asishm commented Sep 14, 2020

a simpler example (without going the entire excel route)

In [1]: import pandas as pd
In [11]: from io import StringIO

In [20]: data = '''\ufeffHead1,Head2,Head3'''

In [21]: pd.read_csv(StringIO(data))
Out[21]: 
Empty DataFrame
Columns: [Head1, Head2, Head3]
Index: []

In [22]: pd.read_csv(StringIO(data), engine='python')
Out[22]: 
Empty DataFrame
Columns: [Head1]
Index: []

@MarcoGorelli MarcoGorelli added IO Data IO issues that don't fit into a more specific label and removed IO Excel read_excel, to_excel labels Sep 14, 2020
@MarcoGorelli
Copy link
Member

Thanks @asishm !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants