csv_read() fails on properly decoding latin-1(i.e. non utf8) encoded file from URL #10424

BotoKopo · 2015-06-24T07:05:36Z

Problem

Here is a problem that we had with a colleague, working on data available on a ftp (or http) server (internal network, we're sorry we can't have a proper example file to point to).

Reading a csv file (with csv_read) encoded with non utf8 (like latin-1), with special character in header, fails to properly unicode the header when file is accessed through an URL (http or ftp), but not when file is local, nor when it's utf-8 (local or distant) file.
The result looks like the file was decoded twice.

An example shoud be clearer.

Let's say we have 2 CSV files (on a distant server), data.latin1.csv and data.utf8.csv, encoded in latin-1 and utf-8, and both containing :

a,b°
1.1,2.2

Then following code :

import sys
import os.path as op
import pandas as pd

path = "ftp://sorry/I/cant/supply/such/a/path/for/the/example/data.encoding.csv"

for enc in ('latin1', 'utf8') :
    f = path.replace('encoding', enc)
    data = pd.read_csv(f, encoding=enc)
    print("encoding {0} : non-ascii={1} , length={2}".format(enc, data.columns[1].encode('utf8'), len(data.columns[1])))

will give :

encoding latin1 : non-ascii=bÂ° , length=3
encoding utf8 : non-ascii=b° , length=2

This was tested with Python 2.7.6 + Pandas 0.13.1 and Python 3.4.0 + Pandas 0.15.2 with same result.

Same action on local files will give appropriate result, i.e. like previous 'utf8' encoding output (this REALLY IS a matter of URL+latin1 or anything but utf-8). It looks like data was decoded twice, as we can see in output length as latin1 escape code for '°' is considered as a "normal" character being converted to utf-8.

This test will raise an error ("UnicodeEncodeError: 'ascii' codec can't encode character u'\xb0' in position 3: ordinal not in range(128)") when python engine is used for read_csv() .

in pandas code

Now, having a look at Pandas' code, I would focus on 2 points in pandas.io.parsers :

when file is an url, data is opened through urllib (or urllib2), then read, decoded (according to requested encoding) and result is fed into a StringIO stream (Cf. pandas.io.common.maybe_read_encoded_stream() ) ,
as far as I could trace it, file seems to be decoded later, especially for 'c'-engine in pandas.io.parsers.CParserWrapper.read() method (in fact by _parser.read() at the end, which is C-parser)

This would explain the twice decoding scheme when file is url, and normal decoding when file is local.

Furthermore, in pandas.io.common, when replacing (in maybe_read_encoded_stream() function) :

from pandas.compat import StringIO
...
reader = StringIO(reader.read().decode(encoding, errors))

by :

from pandas.compat import StringIO, BytesIO
...
reader = BytesIO(reader.read())

this problem seems to be solved (which is logical when we look at which StringIO/ByteIO functions are pointing to (depending on Python version) and which data they're handling).

So it seems to me that the problem is located at that point, and it would then be a bug.
However, it could be a feature ;-) as I don't know whether there could be side-effects for other cases than the one discussed here, especially if StringIO was intentionally used for a purpose I can't figure out.

The text was updated successfully, but these errors were encountered:

shoyer · 2015-06-24T07:36:41Z

Why don't you give this change a try and see if it breaks anything else? See here for instructions: http://pandas.pydata.org/pandas-docs/stable/contributing.html

BotoKopo · 2015-06-25T08:15:11Z

Well, I'm quite not used to contributing here -- in fact, it's the first time in a project which is not in my job's context -- so I'm not sure about the contribution process.
So, just to be sure I understand well, you mean I should commit this change and request for a pull (sorry for this naive question) ?

shoyer · 2015-06-25T14:42:41Z

If you install the pandas from git, you can run tests locally.

Another option is to commit the changes and issue a pull request. Then the tests will be run automatically by our continuous integration system (Travis CI).

mjpieters · 2017-09-01T12:52:27Z

Another testcase, and a workaround by using urllib.request to load the data instead of leaving this to pandas:

import pandas as pd
import urllib

# Data encoded with CP1252, just one non-ASCII byte 0x92 == U+2019 RIGHT SINGLE QUOTATION MARK
url = "https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv"

df1 = pd.read_csv(url, sep=';', encoding='cp1252')

print(df1[' '][102])  # Korea, Dem. Peopleâ€™s Rep.
print(df1[' '][102].encode('cp1252').decode('utf8'))  # Korea, Dem. People’s Rep.

with urllib.request.urlopen(url) as resp:
    df2 = pd.read_csv(resp, sep=";", encoding='cp1252')
print(df2[' '][102])  # Korea, Dem. People’s Rep.

Pandas seems to have decoded as CP1252 twice, with an intermediary UTF-8 encoding applied. You get the same data when you manually decode the data as CP1252, then encoding again as UTF-8, then decoding once more as CP1252:

>>> b'Korea, Dem. People\x92s Rep.'.decode('cp1252')
'Korea, Dem. People’s Rep.'
>>> b'Korea, Dem. People\x92s Rep.'.decode('cp1252').encode('utf8').decode('cp1252')
'Korea, Dem. Peopleâ€™s Rep.'

Passing in a file-like object from urllib.request neatly evades the issue. Perhaps Pandas is getting confused by the text/plain; charset=utf-8 Content-Type header?

BotoKopo added a commit to BotoKopo/pandas that referenced this issue Jul 8, 2015

BUG : read_csv() twice decodes stream on URL file pandas-dev#10424

7a0c3fc

BotoKopo mentioned this issue Jul 8, 2015

BUG : read_csv() twice decodes stream on URL file #10424 #10529

Closed

jreback added Bug IO Data IO issues that don't fit into a more specific label Unicode Unicode strings labels Jul 8, 2015

jreback added this to the Next Major Release milestone Jul 8, 2015

jreback mentioned this issue Aug 17, 2015

Categorical can not be used as key in merges #10832

Closed

jbrockmendel added the IO CSV read_csv, to_csv label Jul 25, 2018

jbrockmendel removed the IO Data IO issues that don't fit into a more specific label label Dec 1, 2019

mroeschke added the IO Network Local or Cloud (AWS, GCS, etc.) IO Issues label Apr 14, 2020

twoertwein mentioned this issue Aug 15, 2020

TST: encoding for URLs in read_csv #35742

Merged

5 tasks

jreback modified the milestones: Contributions Welcome, 1.2 Aug 17, 2020

jreback closed this as completed in #35742 Aug 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

csv_read() fails on properly decoding latin-1(i.e. non utf8) encoded file from URL #10424

csv_read() fails on properly decoding latin-1(i.e. non utf8) encoded file from URL #10424

BotoKopo commented Jun 24, 2015

shoyer commented Jun 24, 2015

BotoKopo commented Jun 25, 2015

shoyer commented Jun 25, 2015

mjpieters commented Sep 1, 2017

csv_read() fails on properly decoding latin-1(i.e. non utf8) encoded file from URL #10424

csv_read() fails on properly decoding latin-1(i.e. non utf8) encoded file from URL #10424

Comments

BotoKopo commented Jun 24, 2015

Problem

in pandas code

shoyer commented Jun 24, 2015

BotoKopo commented Jun 25, 2015

shoyer commented Jun 25, 2015

mjpieters commented Sep 1, 2017