Skip to content

Commit 518ebf2

Browse files
gfyoungPingviinituutti
authored andcommitted
COMPAT: Properly encode filenames in read_csv (pandas-dev#24758)
Python 3.6+ changes the default encoding to UTF8 (PEP 529), which conflicts with the encoding of Windows (MBCS). This fix checks if we're using Python 3.6+ and on Windows, after which we force the encoding to "mbcs". Closes pandas-devgh-15086.
1 parent 5d511bf commit 518ebf2

File tree

3 files changed

+20
-1
lines changed

3 files changed

+20
-1
lines changed

doc/source/whatsnew/v0.24.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -1790,6 +1790,7 @@ I/O
17901790
- Bug in :meth:`DataFrame.to_dict` when the resulting dict contains non-Python scalars in the case of numeric data (:issue:`23753`)
17911791
- :func:`DataFrame.to_string()`, :func:`DataFrame.to_html()`, :func:`DataFrame.to_latex()` will correctly format output when a string is passed as the ``float_format`` argument (:issue:`21625`, :issue:`22270`)
17921792
- Bug in :func:`read_csv` that caused it to raise ``OverflowError`` when trying to use 'inf' as ``na_value`` with integer index column (:issue:`17128`)
1793+
- Bug in :func:`read_csv` that caused the C engine on Python 3.6+ on Windows to improperly read CSV filenames with accented or special characters (:issue:`15086`)
17931794
- Bug in :func:`read_fwf` in which the compression type of a file was not being properly inferred (:issue:`22199`)
17941795
- Bug in :func:`pandas.io.json.json_normalize` that caused it to raise ``TypeError`` when two consecutive elements of ``record_path`` are dicts (:issue:`22706`)
17951796
- Bug in :meth:`DataFrame.to_stata`, :class:`pandas.io.stata.StataWriter` and :class:`pandas.io.stata.StataWriter117` where a exception would leave a partially written and invalid dta file (:issue:`23573`)

pandas/_libs/parsers.pyx

+7-1
Original file line numberDiff line numberDiff line change
@@ -677,7 +677,13 @@ cdef class TextReader:
677677

678678
if isinstance(source, basestring):
679679
if not isinstance(source, bytes):
680-
source = source.encode(sys.getfilesystemencoding() or 'utf-8')
680+
if compat.PY36 and compat.is_platform_windows():
681+
# see gh-15086.
682+
encoding = "mbcs"
683+
else:
684+
encoding = sys.getfilesystemencoding() or "utf-8"
685+
686+
source = source.encode(encoding)
681687

682688
if self.memory_map:
683689
ptr = new_mmap(source)

pandas/tests/io/parser/test_common.py

+12
Original file line numberDiff line numberDiff line change
@@ -1904,6 +1904,18 @@ def test_suppress_error_output(all_parsers, capsys):
19041904
assert captured.err == ""
19051905

19061906

1907+
def test_filename_with_special_chars(all_parsers):
1908+
# see gh-15086.
1909+
parser = all_parsers
1910+
df = DataFrame({"a": [1, 2, 3]})
1911+
1912+
with tm.ensure_clean("sé-es-vé.csv") as path:
1913+
df.to_csv(path, index=False)
1914+
1915+
result = parser.read_csv(path)
1916+
tm.assert_frame_equal(result, df)
1917+
1918+
19071919
def test_read_table_deprecated(all_parsers):
19081920
# see gh-21948
19091921
parser = all_parsers

0 commit comments

Comments
 (0)