-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
read_csv does not parse in header with BOM utf-8 #4793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
which version of Python are you using? ( |
Python 2.7.5 :: Anaconda 1.6.1 (x86_64) And the code is written in a python notebook using ipython 1.0.0. |
@john-orange-aa can you provide a reproducible example (link to a file if you need to) |
@jreblack, here is the link to the folder: BOM-temp.csv is the offending file. BOM-temp2.csv is the same file with headers removed. The "pandas BOM utf-8 bug.ipynb" is the ipython notebook that illustrates the bug. On Sep 28, 2013, at 12:50 PM, jreback [email protected] wrote:
|
This isn't exactly the same issue, but I'm also having trouble with BOMs. File with a utf-8 BOM here.
|
It looks like you should use 'utf-8-sig' as the encoding for utf-8 files with a BOM, so my comment is likely invalid. |
Is it possible for Pandas to infer the encoding from the BOM automatically - or is it really required to pass this information through in the encoding? I think the purpose of the BOM was to provide this kind of capability? |
'utf-8-sig' does not resolve the issue. I faced the same issue but using 'utf-8-sig' just got me another decoding problem |
I ran into this problem as well (version 0.15.2). I tried 'utf-8-sig' encoding, and though I didn't see an error, the result was not quite right as the first key is quoted and none of the other keys are, though all column headers/values are quoted throughout the file.
Note the extra set of quotes on the first key |
With 0.15.2, I am able to use |
In 0.15.2, I find that the BOM disappears, but the quotes around the first column header are erroneously preserved, while quotes around all other column headers (and all other values) are stripped. So the problem with utf-8-sig seems to only affect quoted column headers. Here's an example file to try https://dl.dropboxusercontent.com/u/27287953/bom.csv ... |
Hi, I'm using pandas 0.16 in python 3.4 (anaconda distro). I'm reading the file with:
When I print the column names:
So, there seems to be a 'TimeStamp' column. Let me check what is in there:
And, I don't seem to be able to find out what is the real name of the TimeStamp column =/ I had this in a longer script and it took me forever to understand where the problem came from. Adding the |
@zelite In this example, your column was actually spelled |
@shoyer I don't think that was a space thing; it was definitely the BOM. If you do something like df.columns[0] you will see the utf-8 characters. |
OK, I'm clearly lost. Just looked up exactly what "BOM" is :). |
I can reproduce the issue with Pandas 0.16.0. I can get the same error on first column name with read_csv , or get the first row of the first column erroneous too with a file without columns on row 1 and with names= argument in the read_csv call. It sounds like read_csv interprets correctly encoding='utf-8-sig' by skipping the 3 first characters of the file, then interpreting the file as UTF8. However, the bug experienced makes me think that Pandas "forgets" to skip the first 3 characters when it starts to parse the file and create the dataframe. Something like the offset of the beginning of the effective data in the file didn't get +len(UTF8_BOM), thus leading to have the BOM included in the first column name or in the first cell of the dataframe. The most misleading part is that the characters do not print naturally when the dataframe of the column names are displayed in ipython, but the BOM is clearly kept and behind that cell string as pointed out in a previous comment above. HTH |
With encoding 'utf-8-sig', the BOM is correctly skipped rather than prepending it to the first column label. However, as described by others, the quotes around the first label remain. The zip archive at contains a script for demonstration as well as 2 csv files that differ only by having / not having a BOM. The one without BOM is parsed correctly, with BOM the first label remains quoted. Python 3.5.1_x86, PD 18.1, Win7x64 |
A minimal example for future reference: >>> from pandas.compat import BytesIO
>>> from pandas import read_csv
>>> import codecs
>>>
>>> BOM = codecs.BOM_UTF8
>>> data = '"name"\n"foo"'.encode('utf-8')
>>>
>>> read_csv(BytesIO(data), encoding='utf-8', engine='c')
# same result if engine='python'
name
0 foo
>>>
>>> read_csv(BytesIO(BOM + data), encoding='utf-8', engine='c')
# same result if engine='python'
"name"
0 foo While I agree that there is a bug in the C engine, I don't believe the same can be said with the Python engine, as >>> from io import TextIOWrapper
>>> from csv import reader
>>> for row in reader(BytesIO(BOM + data), encoding='utf-8'): print(row)
['\ufeff"name"']
['foo'] Since the Python engine failure is beyond our control, the question is then can this issue be closed if we can patch the C engine? |
this is mainly an issue on windows, where these BOM markers can easily be put in files, so if possible to patch would be good. |
Sorry for my ignorance, not sure why this issue is closed. I have a file UTF-8 with BOM. The (quoted) content is:
... with the problem that the first header 'node_ID' has kept its quotes, same as dr-leo's comment on Jun 23, 2016 above. All other quotes were removed correctly. If I use a UTF-8 file without BOM and
With the quotes correctly removed from the firste header.
|
the issue was closed in 0.19.0 try using a more recent version |
I'm using INSTALLED VERSIONScommit: None pandas: 0.19.0 |
Try with pandas 20.3.
… Am 17.10.2017 um 23:41 schrieb sf_jac ***@***.***>:
I'm using pandas.read_fwf in 0.19.0 and seeing a similar issue.
INSTALLED VERSIONS
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.19.0
nose: 1.3.7
pip: 9.0.1
setuptools: 28.8.0
Cython: 0.25.2
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.3.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.1
pytz: 2016.7
blosc: None
bottleneck: 1.1.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.3.2
html5lib: 0.9999999
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: 0.7.9.None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
use sep=' \t ' |
(False alarm -- I was opening the file with |
I ran into this problem as well. I exported a table as a .csv file from DBeaver. There is an option to "Insert BOM" that was selected. The header was not read in properly by read_csv() (only the first column name was read). When I exported with "Insert BOM" un-selected read_csv() worked correctly. I am running Python 3.7.2 with Pandas 0.23.4 on Linux Mint 19.1. |
Bug is still persistent in 1.0.3 as of today. Replicated with all combinations of the Python UTF 8 encoding string with or without hyphens, underscores and "sig" extensions. Stepping through some of the code was showing that the encoding was stuck in code-point 1252 for a very long time before it became read as UTF8 - Is it not being set early enough?
ETA: Nothing online appears to catch this but it appears this can be replicated (and solved) as follows: import pandas as pd ... with open(filename, encoding="xxx") as f_handle: data = pd.read_csv(f_handle, encoding="yyy") It was not clear from the documentation that encoding |
I am using Pandas version 0.12.0 on a Mac.
I noticed that when there is a BOM utf-8 file, and if the header row is in the first line, the read_csv() method will leave a leading quotation mark in the first column's name. However, if the header row is further down the file and I use the "header=" option, then the whole header row gets parsed correctly.
Here is an example code:
bing_kw = pd.read_csv('../../data/sem/Bing-Keyword_daily.csv', header=9, thousands=',', encoding='utf-8')
Parses the header correctly.
bing_kw = pd.read_csv('../../data/sem/Bing-Keyword_daily.csv', thousands=',', encoding='utf-8')
Parses the first header column name incorrectly by leaving the leading quotation mark.
The text was updated successfully, but these errors were encountered: