Skip to content

Commit 6645b2b

Browse files
nateGeorgejorisvandenbossche
authored andcommitted
BUG: fix read_csv c engine to accept unicode aliases for encoding (pandas-dev#14060)
1 parent ba2df22 commit 6645b2b

File tree

3 files changed

+15
-0
lines changed

3 files changed

+15
-0
lines changed

doc/source/whatsnew/v0.19.0.txt

+2
Original file line numberDiff line numberDiff line change
@@ -1095,3 +1095,5 @@ Bug Fixes
10951095
- Bug in ``Index`` raises ``KeyError`` displaying incorrect column when column is not in the df and columns contains duplicate values (:issue:`13822`)
10961096
- Bug in ``Period`` and ``PeriodIndex`` creating wrong dates when frequency has combined offset aliases (:issue:`13874`)
10971097
- Bug in ``.to_string()`` when called with an integer ``line_width`` and ``index=False`` raises an UnboundLocalError exception because ``idx`` referenced before assignment.
1098+
1099+
- Bug in ``read_csv()``, where aliases for utf-xx (e.g. UTF-xx, UTF_xx, utf_xx) raised UnicodeDecodeError (:issue:`13549`)

pandas/io/parsers.py

+3
Original file line numberDiff line numberDiff line change
@@ -343,6 +343,9 @@ def _validate_nrows(nrows):
343343
def _read(filepath_or_buffer, kwds):
344344
"Generic reader of line files."
345345
encoding = kwds.get('encoding', None)
346+
if encoding is not None:
347+
encoding = re.sub('_', '-', encoding).lower()
348+
kwds['encoding'] = encoding
346349

347350
# If the input could be a filename, check for a recognizable compression
348351
# extension. If we're reading from a URL, the `get_filepath_or_buffer`

pandas/io/tests/parser/common.py

+10
Original file line numberDiff line numberDiff line change
@@ -1583,3 +1583,13 @@ def test_temporary_file(self):
15831583
new_file.close()
15841584
expected = DataFrame([[0, 0]])
15851585
tm.assert_frame_equal(result, expected)
1586+
1587+
def test_read_csv_utf_aliases(self):
1588+
# see gh issue 13549
1589+
expected = pd.DataFrame({'mb_num': [4.8], 'multibyte': ['test']})
1590+
for byte in [8, 16]:
1591+
for fmt in ['utf-{0}', 'utf_{0}', 'UTF-{0}', 'UTF_{0}']:
1592+
encoding = fmt.format(byte)
1593+
data = 'mb_num,multibyte\n4.8,test'.encode(encoding)
1594+
result = self.read_csv(BytesIO(data), encoding=encoding)
1595+
tm.assert_frame_equal(result, expected)

0 commit comments

Comments
 (0)