Skip to content

BUG: read_csv throws UnicodeDecodeError with unicode aliases #13571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
d485c4a
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 5, 2016
ae62350
BUG: `read_csv` throws UnicodeDecodeError with unicode
nateGeorge Jul 6, 2016
36bcdd8
Merge branch 'master' of github.com:pydata/pandas into fix/read_csv-u…
nateGeorge Jul 6, 2016
285ccf9
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 6, 2016
173c38b
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 6, 2016
78d46d6
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 6, 2016
35dfb13
chore: matched master
nateGeorge Jul 12, 2016
71f084e
DOC: add pd.read_csv bug #13549
nateGeorge Jul 12, 2016
da8fce4
TST: out-> result and tm.ensure_clean
nateGeorge Jul 12, 2016
1825486
TST: conform to PEP8
nateGeorge Jul 12, 2016
1d30333
TST: condense test_read_utf_aliases test
nateGeorge Jul 12, 2016
4f680d7
Merge branch 'master' of github.com:pydata/pandas into fix/read_csv-u…
nateGeorge Jul 12, 2016
b582195
Merge branch 'master' of github.com:pydata/pandas into fix/read_csv-u…
nateGeorge Jul 13, 2016
e26c92a
CLN: remove unnecessary BytesIO import
nateGeorge Jul 13, 2016
d14b69e
CLN: remove unnecessary csv write line
nateGeorge Jul 13, 2016
eeb7011
Merge branch 'master' of github.com:pydata/pandas into fix/read_csv-u…
nateGeorge Jul 13, 2016
b8d78c4
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 5, 2016
75869f4
BUG: `read_csv` throws UnicodeDecodeError with unicode
nateGeorge Jul 6, 2016
9c88919
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 6, 2016
6725536
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 6, 2016
671ad41
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 6, 2016
3c4a798
BUG: Groupby.nth includes group key inconsistently #12839
adneu Jul 6, 2016
5675b82
In gbq, use googleapiclient instead of apiclient #13454 (#13458)
parthea Jul 7, 2016
ff6117e
RLS: switch master from 0.18.2 to 0.19.0 (#13586)
jorisvandenbossche Jul 8, 2016
b983957
BUG: Datetime64Formatter not respecting ``formatter``
haleemur Jul 8, 2016
451c054
BUG: Fix TimeDelta to Timedelta (#13600)
yui-knk Jul 9, 2016
33278a9
COMPAT: 32-bit compat fixes mainly in testing
jreback Jul 7, 2016
181cecd
BUG: DatetimeIndex - Period shows ununderstandable error
sinhrks Jul 10, 2016
a2e5d54
ENH: add downcast to pd.to_numeric
gfyoung Jul 10, 2016
6c8b21b
CLN: remove radd workaround in ops.py
sinhrks Jul 10, 2016
5d99cff
DEPR: rename Timestamp.offset to .freq
sinhrks Jul 10, 2016
8e7904f
CLN: Remove the engine parameter in CSVFormatter and to_csv
gfyoung Jun 10, 2016
a07b5d3
BUG: Block/DTI doesnt handle tzlocal properly
sinhrks Jul 10, 2016
ff2a335
BUG: Series contains NaT with object dtype comparison incorrect (#13592)
sinhrks Jul 11, 2016
1f8cc7f
CLN/TST: Add tests for nan/nat mixed input (#13477)
sinhrks Jul 11, 2016
f743eb3
BUG: groupby apply on selected columns yielding scalar (GH13568) (#13…
jorisvandenbossche Jul 11, 2016
e161699
TST: Clean up tests of DataFrame.sort_{index,values} (#13496)
IamJeffG Jul 11, 2016
5765b92
DOC: add pd.read_csv bug #13549
nateGeorge Jul 12, 2016
ac18b36
TST: out-> result and tm.ensure_clean
nateGeorge Jul 12, 2016
1fc6b90
TST: conform to PEP8
nateGeorge Jul 12, 2016
6b0e2ca
TST: condense test_read_utf_aliases test
nateGeorge Jul 12, 2016
41a6fae
DOC: asfreq clarify original NaNs are not filled (GH9963) (#13617)
jorisvandenbossche Jul 12, 2016
f730e60
BUG: Invalid Timedelta op may raise ValueError
sinhrks Jul 12, 2016
05a2d04
CLN: Cleanup ops.py
sinhrks Jul 12, 2016
c4e93bd
CLN: Removed outtype in DataFrame.to_dict (#13627)
gfyoung Jul 12, 2016
430273d
CLN: Fix compile time warnings
yui-knk Jul 13, 2016
1fa91b9
CLN: remove unnecessary BytesIO import
nateGeorge Jul 13, 2016
e379e9f
CLN: remove unnecessary csv write line
nateGeorge Jul 13, 2016
a35521e
Pin IPython for doc build to 4.x (see #13639)
jorisvandenbossche Jul 13, 2016
6c09821
CLN: reorg type inference & introspection
jreback Jul 13, 2016
5584dff
BLD: included pandas.api.* in setup.py (#13640)
gfyoung Jul 13, 2016
9463dee
docs: add note about read_csv() bug
nateGeorge Aug 15, 2016
5198179
cln: trying to merge with master
nateGeorge Aug 15, 2016
3c30cd0
CLN: merge with master
nateGeorge Aug 15, 2016
e77ac2d
Merge branch 'fix/read_csv-utf-aliases' of github.com:nateGeorge/pand…
nateGeorge Aug 19, 2016
69ab536
CLN: reset to master branch
nateGeorge Aug 19, 2016
1eb478d
Merge branch 'master' of github.com:pydata/pandas into fix/read_csv-u…
nateGeorge Aug 19, 2016
a2f178f
CLN: fix small diff from upstream/master
nateGeorge Aug 19, 2016
8e05f7e
BUG: _read encoding fix
nateGeorge Aug 19, 2016
ab153d5
DOC: add note on read_csv bug
nateGeorge Aug 19, 2016
0c1de9f
TST: add test for read_csv with unicode bug
nateGeorge Aug 19, 2016
77ec966
CLN: fix indents and spacings
nateGeorge Aug 19, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.19.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -549,3 +549,5 @@ Bug Fixes
- Bug in ``groupby`` with ``as_index=False`` returns all NaN's when grouping on multiple columns including a categorical one (:issue:`13204`)

- Bug where ``pd.read_gbq()`` could throw ``ImportError: No module named discovery`` as a result of a naming conflict with another python package called apiclient (:issue:`13454`)

- Bug in ``pd.read_csv()``, where aliases for utf-xx (e.g. UTF-xx, UTF_xx, utf_xx) raised ``UnicodeDecodeError`` (:issue:`13549`)
4 changes: 4 additions & 0 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -339,6 +339,10 @@ def _validate_nrows(nrows):
def _read(filepath_or_buffer, kwds):
"Generic reader of line files."
encoding = kwds.get('encoding', None)
if encoding is not None:
encoding = re.sub('_', '-', encoding).lower()
kwds['encoding'] = encoding

skipfooter = kwds.pop('skipfooter', None)
if skipfooter is not None:
kwds['skip_footer'] = skipfooter
Expand Down
78 changes: 49 additions & 29 deletions pandas/io/tests/parser/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -1339,8 +1339,8 @@ def test_compact_ints_use_unsigned(self):
'b': np.array([9], dtype=np.int64),
'c': np.array([258], dtype=np.int64),
})
out = self.read_csv(StringIO(data))
tm.assert_frame_equal(out, expected)
result = self.read_csv(StringIO(data))
tm.assert_frame_equal(result, expected)

expected = DataFrame({
'a': np.array([1], dtype=np.int8),
Expand All @@ -1351,14 +1351,14 @@ def test_compact_ints_use_unsigned(self):
# default behaviour for 'use_unsigned'
with tm.assert_produces_warning(
FutureWarning, check_stacklevel=False):
out = self.read_csv(StringIO(data), compact_ints=True)
tm.assert_frame_equal(out, expected)
result = self.read_csv(StringIO(data), compact_ints=True)
tm.assert_frame_equal(result, expected)

with tm.assert_produces_warning(
FutureWarning, check_stacklevel=False):
out = self.read_csv(StringIO(data), compact_ints=True,
use_unsigned=False)
tm.assert_frame_equal(out, expected)
result = self.read_csv(StringIO(data), compact_ints=True,
use_unsigned=False)
tm.assert_frame_equal(result, expected)

expected = DataFrame({
'a': np.array([1], dtype=np.uint8),
Expand All @@ -1368,9 +1368,9 @@ def test_compact_ints_use_unsigned(self):

with tm.assert_produces_warning(
FutureWarning, check_stacklevel=False):
out = self.read_csv(StringIO(data), compact_ints=True,
use_unsigned=True)
tm.assert_frame_equal(out, expected)
result = self.read_csv(StringIO(data), compact_ints=True,
use_unsigned=True)
tm.assert_frame_equal(result, expected)

def test_compact_ints_as_recarray(self):
data = ('0,1,0,0\n'
Expand Down Expand Up @@ -1399,27 +1399,28 @@ def test_as_recarray(self):
data = 'a,b\n1,a\n2,b'
expected = np.array([(1, 'a'), (2, 'b')],
dtype=[('a', '<i8'), ('b', 'O')])
out = self.read_csv(StringIO(data), as_recarray=True)
tm.assert_numpy_array_equal(out, expected)
result = self.read_csv(StringIO(data), as_recarray=True)
tm.assert_numpy_array_equal(result, expected)

# index_col ignored
with tm.assert_produces_warning(
FutureWarning, check_stacklevel=False):
data = 'a,b\n1,a\n2,b'
expected = np.array([(1, 'a'), (2, 'b')],
dtype=[('a', '<i8'), ('b', 'O')])
out = self.read_csv(StringIO(data), as_recarray=True, index_col=0)
tm.assert_numpy_array_equal(out, expected)
result = self.read_csv(
StringIO(data), as_recarray=True, index_col=0)
tm.assert_numpy_array_equal(result, expected)

# respects names
with tm.assert_produces_warning(
FutureWarning, check_stacklevel=False):
data = '1,a\n2,b'
expected = np.array([(1, 'a'), (2, 'b')],
dtype=[('a', '<i8'), ('b', 'O')])
out = self.read_csv(StringIO(data), names=['a', 'b'],
header=None, as_recarray=True)
tm.assert_numpy_array_equal(out, expected)
result = self.read_csv(StringIO(data), names=['a', 'b'],
header=None, as_recarray=True)
tm.assert_numpy_array_equal(result, expected)

# header order is respected even though it conflicts
# with the natural ordering of the column names
Expand All @@ -1428,16 +1429,17 @@ def test_as_recarray(self):
data = 'b,a\n1,a\n2,b'
expected = np.array([(1, 'a'), (2, 'b')],
dtype=[('b', '<i8'), ('a', 'O')])
out = self.read_csv(StringIO(data), as_recarray=True)
tm.assert_numpy_array_equal(out, expected)
result = self.read_csv(StringIO(data), as_recarray=True)
tm.assert_numpy_array_equal(result, expected)

# overrides the squeeze parameter
with tm.assert_produces_warning(
FutureWarning, check_stacklevel=False):
data = 'a\n1'
expected = np.array([(1,)], dtype=[('a', '<i8')])
out = self.read_csv(StringIO(data), as_recarray=True, squeeze=True)
tm.assert_numpy_array_equal(out, expected)
result = self.read_csv(
StringIO(data), as_recarray=True, squeeze=True)
tm.assert_numpy_array_equal(result, expected)

# does data conversions before doing recarray conversion
with tm.assert_produces_warning(
Expand All @@ -1446,18 +1448,18 @@ def test_as_recarray(self):
conv = lambda x: int(x) + 1
expected = np.array([(2, 'a'), (3, 'b')],
dtype=[('a', '<i8'), ('b', 'O')])
out = self.read_csv(StringIO(data), as_recarray=True,
converters={'a': conv})
tm.assert_numpy_array_equal(out, expected)
result = self.read_csv(StringIO(data), as_recarray=True,
converters={'a': conv})
tm.assert_numpy_array_equal(result, expected)

# filters by usecols before doing recarray conversion
with tm.assert_produces_warning(
FutureWarning, check_stacklevel=False):
data = 'a,b\n1,a\n2,b'
expected = np.array([(1,), (2,)], dtype=[('a', '<i8')])
out = self.read_csv(StringIO(data), as_recarray=True,
usecols=['a'])
tm.assert_numpy_array_equal(out, expected)
result = self.read_csv(StringIO(data), as_recarray=True,
usecols=['a'])
tm.assert_numpy_array_equal(result, expected)

def test_memory_map(self):
mmap_file = os.path.join(self.dirpath, 'test_mmap.csv')
Expand All @@ -1467,5 +1469,23 @@ def test_memory_map(self):
'c': ['I', 'II', 'III']
})

out = self.read_csv(mmap_file, memory_map=True)
tm.assert_frame_equal(out, expected)
result = self.read_csv(mmap_file, memory_map=True)
tm.assert_frame_equal(result, expected)

def test_read_csv_utf_aliases(self):
# see gh issue 13549
path = 'test.csv'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the context manager

with tm.ensure_clean(path) as path:

remove os.remove(..)

expected = DataFrame({'A': [0, 1], 'B': [2, 3],
Copy link
Member

@gfyoung gfyoung Jul 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We like to have tests that are as compact as possible. Do we really need to have this many rows for this test? Can we get away with just one? This becomes pertinent for my next point:

  2. To make these tests as unit-like as possible, we would prefer NOT to use to_csv (if possible) and follow the StringIO(data) paradigm. I believe that is possible here because you can encode strings as utf-8 or utf-16.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could do one row as

expected = pd.DataFrame({'mb_num': [4.8], 'multibyte': ['test']})

I used BytesIO because I don't think StringIO can support different encodings (I tried and wasn't able to get StringIO to work).

'multibyte_test': ['testing123', 'bananabis'],
'mb_nums': [154.868, 457.8798]})
with tm.ensure_clean(path) as path:
for byte in [8, 16]:
expected.to_csv(path, encoding='utf-' + str(byte), index=False)
for fmt in ['utf-{0}', 'utf_{0}', 'UTF-{0}', 'UTF_{0}']:
encoding = fmt.format(byte)
for engine in ['c', 'python', None]:
Copy link
Member

@gfyoung gfyoung Jul 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary (nor is the engine keyword). The test suite will cover both engines (and the default case is not needed here). That's why it's self.read_csv and not read_csv. You are in fact running the test TWICE for each engine the way you have written it. You just need to write:

# 'path' can most likely be changed as I referenced above
result = self.read_csv(path, encoding=encoding)
tm.assert_frame_equal(result, expected)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright.

result = self.read_csv(
path,
engine=engine,
encoding=encoding)
tm.assert_frame_equal(result, expected)