-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
csv_read() fails on properly decoding latin-1(i.e. non utf8) encoded file from URL #10424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Why don't you give this change a try and see if it breaks anything else? See here for instructions: http://pandas.pydata.org/pandas-docs/stable/contributing.html |
Well, I'm quite not used to contributing here -- in fact, it's the first time in a project which is not in my job's context -- so I'm not sure about the contribution process. |
If you install the pandas from git, you can run tests locally. Another option is to commit the changes and issue a pull request. Then the tests will be run automatically by our continuous integration system (Travis CI). |
Another testcase, and a workaround by using import pandas as pd
import urllib
# Data encoded with CP1252, just one non-ASCII byte 0x92 == U+2019 RIGHT SINGLE QUOTATION MARK
url = "https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv"
df1 = pd.read_csv(url, sep=';', encoding='cp1252')
print(df1[' '][102]) # Korea, Dem. People’s Rep.
print(df1[' '][102].encode('cp1252').decode('utf8')) # Korea, Dem. People’s Rep.
with urllib.request.urlopen(url) as resp:
df2 = pd.read_csv(resp, sep=";", encoding='cp1252')
print(df2[' '][102]) # Korea, Dem. People’s Rep. Pandas seems to have decoded as CP1252 twice, with an intermediary UTF-8 encoding applied. You get the same data when you manually decode the data as CP1252, then encoding again as UTF-8, then decoding once more as CP1252: >>> b'Korea, Dem. People\x92s Rep.'.decode('cp1252')
'Korea, Dem. People’s Rep.'
>>> b'Korea, Dem. People\x92s Rep.'.decode('cp1252').encode('utf8').decode('cp1252')
'Korea, Dem. People’s Rep.' Passing in a file-like object from |
Problem
Here is a problem that we had with a colleague, working on data available on a ftp (or http) server (internal network, we're sorry we can't have a proper example file to point to).
Reading a csv file (with csv_read) encoded with non utf8 (like latin-1), with special character in header, fails to properly unicode the header when file is accessed through an URL (http or ftp), but not when file is local, nor when it's utf-8 (local or distant) file.
The result looks like the file was decoded twice.
An example shoud be clearer.
Let's say we have 2 CSV files (on a distant server), data.latin1.csv and data.utf8.csv, encoded in latin-1 and utf-8, and both containing :
Then following code :
will give :
This was tested with Python 2.7.6 + Pandas 0.13.1 and Python 3.4.0 + Pandas 0.15.2 with same result.
Same action on local files will give appropriate result, i.e. like previous 'utf8' encoding output (this REALLY IS a matter of URL+latin1 or anything but utf-8). It looks like data was decoded twice, as we can see in output length as latin1 escape code for '°' is considered as a "normal" character being converted to utf-8.
This test will raise an error ("UnicodeEncodeError: 'ascii' codec can't encode character u'\xb0' in position 3: ordinal not in range(128)") when python engine is used for read_csv() .
in pandas code
Now, having a look at Pandas' code, I would focus on 2 points in pandas.io.parsers :
This would explain the twice decoding scheme when file is url, and normal decoding when file is local.
Furthermore, in pandas.io.common, when replacing (in maybe_read_encoded_stream() function) :
by :
this problem seems to be solved (which is logical when we look at which StringIO/ByteIO functions are pointing to (depending on Python version) and which data they're handling).
So it seems to me that the problem is located at that point, and it would then be a bug.
However, it could be a feature ;-) as I don't know whether there could be side-effects for other cases than the one discussed here, especially if StringIO was intentionally used for a purpose I can't figure out.
The text was updated successfully, but these errors were encountered: