BUG: encode filepaths on windows in 3.6 #15092

JGoutin · 2017-01-09T20:07:07Z

Not for Merge, just a small test for #15086 with AppVeyor.

jreback · 2017-01-09T20:12:17Z

pandas/tests/test_accent.py

+from os.path import join
+from pandas import read_csv
+import pandas.util.testing as tm
+


FYI, you can simply write these files dynamically, e.g use

with ensure_clean(.....) as path: df.to_csv(path) result = pd.read_csv(path)

jreback · 2017-01-09T22:20:05Z

so looks your test failed (good!)

so if you can update the test to make more generic like I showed. then you can do the fix (which will be to encode the filename on windows : https://docs.python.org/3/whatsnew/3.6.html#pep-529-change-windows-filesystem-encoding-to-utf-8

so I would probably do this only on windows and only on >= 3.6

from pandas.compat import PY3, is_platform_windows

if PY3 and is_platform_windows():
    fn = fn.encode(sys.getfilesystemencoding())

codecov-io · 2017-01-10T01:24:46Z

Codecov Report

Merging #15092 into master will increase coverage by -1.58%.

@@            Coverage Diff             @@
##           master   #15092      +/-   ##
==========================================
- Coverage   86.33%   84.75%   -1.58%     
==========================================
  Files         139      145       +6     
  Lines       51149    51279     +130     
==========================================
- Hits        44157    43464     -693     
- Misses       6992     7815     +823

Impacted Files	Coverage Δ
pandas/io/common.py	`67.65% <80%> (+0.26%)`	✅
pandas/core/groupby.py	`77.24% <ø> (-17.91%)`	❌
pandas/tools/plotting.py	`68.48% <ø> (-3.31%)`	❌
pandas/types/cast.py	`84.18% <ø> (-1.24%)`	❌
pandas/core/base.py	`94.05% <ø> (-1.11%)`	❌
pandas/types/dtypes.py	`94.3% <ø> (-1.04%)`	❌
pandas/tools/hashing.py	`98.14% <ø> (-0.88%)`	❌
pandas/indexes/multi.py	`95.81% <ø> (-0.77%)`	❌
pandas/compat/numpy/function.py	`93.5% <ø> (-0.77%)`	❌
pandas/compat/init.py	`61.71% <ø> (-0.46%)`	❌
... and 34 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c26e5bb...dc5877b. Read the comment docs.

JGoutin · 2017-01-11T08:16:13Z

With Python 3.6 sys.getfilesystemencoding() return utf-8.

Calling sys._enablelegacywindowsfsencoding() before call Pandas fix this bug. This function set files system encoding to mbcs (Like on Python < 3.6).

jreback · 2017-01-11T11:46:43Z

pandas/tests/test_accent.py

+from pandas.util.testing import ensure_clean
+
+class FileSystemEncodingTest(unittest.TestCase):
+    """Test compatibility with file system encoding"""


needs to go with tests in pandas/io/tests/parser

try to find a good place there

OK, I'll move it when everything work.

jreback · 2017-01-11T11:48:03Z

pandas/parser.pyx

@@ -662,7 +662,11 @@ cdef class TextReader:

        if isinstance(source, basestring):
            if not isinstance(source, bytes):
-                source = source.encode(sys.getfilesystemencoding() or 'utf-8')
+                if compat.PY36 and compat.is_platform_windows():
+                    fs_encoding = 'mbcs'


this also need to go in pandas/io/common in _get_filehandle_or_buffer so prob needs to be in a common routine

further this is not actually fixing the issue but using the legacy code path

Since sys.getfilesystemencoding() return the bad encoding (Utf-8) and is already called on source, I forced the encoding value.

Maybe not the best way to do (I don't know Pandas low-level machinery, maybe better way in the C code of the new_file_source function from parser/io.h), but AppVeyor say it work.

pls move these changes to io/common.py as this needs to be used in several places. make a function called _get_filename_encoding

jreback · 2017-01-11T11:59:50Z

pandas/parser.pyx

@@ -662,7 +662,11 @@ cdef class TextReader:

        if isinstance(source, basestring):
            if not isinstance(source, bytes):
-                source = source.encode(sys.getfilesystemencoding() or 'utf-8')
+                if compat.PY36 and compat.is_platform_windows():
+                    fs_encoding = 'mbcs'


further this is not actually fixing the issue but using the legacy code path

jreback · 2017-01-11T12:01:09Z

pandas/tests/test_accent.py

+    """Test compatibility with file system encoding"""
+    def test_load(self):
+        for filename in ('test_e.txt', 'test_é.txt'):
+            with ensure_clean(filename) as path:


this is also not testing the issue - unicode paths should already be ok ; it is byte paths that are not working

The problem is with Unicode paths, not byte path.

I used os.path.join(pandas.util.testing.get_data_path(), 'test_é.txt') in the previous test version which is also a Unicode path and failed as intended with AppVeyor.

jreback · 2017-01-12T13:37:35Z

pandas/io/tests/parser/common.py

@@ -1635,3 +1635,12 @@ def test_file_handles(self):
                if PY3:
                    self.assertFalse(m.closed)
                m.close()
+
+    def test_file_system_encoding(self):
+        """


don't use a comment here with triple-quotes, instead use #
add the issue number as a comment

jreback · 2017-01-12T13:37:53Z

pandas/io/tests/parser/common.py

+        """
+        Test compatibility with file system encoding.
+        """
+        for filename in ('test_e.txt', 'test_é.txt'):


all use a byte file name

jreback · 2017-01-12T13:38:49Z

pandas/io/tests/parser/common.py

+        for filename in ('test_e.txt', 'test_é.txt'):
+            with tm.ensure_clean(filename) as path:
+                pd.DataFrame().to_csv(path)
+                pd.read_csv(path)


use self.read_csv to read with the various parsers

use a non-zero dataframe, then compare the return results.

jreback · 2017-01-12T13:40:34Z

pandas/parser.pyx

@@ -662,7 +662,11 @@ cdef class TextReader:

        if isinstance(source, basestring):
            if not isinstance(source, bytes):
-                source = source.encode(sys.getfilesystemencoding() or 'utf-8')
+                if compat.PY36 and compat.is_platform_windows():
+                    fs_encoding = 'mbcs'


pls move these changes to io/common.py as this needs to be used in several places. make a function called _get_filename_encoding

jreback · 2017-01-13T17:23:35Z

this looks reasonable so far.

can you add some tests for excel, hdf5 similar to the above; I think you will need to modify
about here:
https://github.com/pandas-dev/pandas/blob/master/pandas/io/common.py#L228

to call (and encode) the filename, IOW, I think after the _expand_user you can have a line which checks if its a string, if so, then get the encoding and encode it before returning it.

alternatively might need to do this in _get_handle instead, right before actually opening the file.

JGoutin · 2017-01-13T17:30:19Z

The problem is not present for all I/O functions from pandas. Actually, I see it only with read_csv and read_table, but not for read_excel or read_fwf by example. This need to be tested before modifying anything else.

jreback · 2017-01-13T17:34:04Z

@JGoutin agreed. If you want to add tests for non-parser things would be great. Its possible this only affects the c-parser as that is the only thing that actually encodes the filenames (others use open(...) directly) and might not be affected by this change.

jreback · 2017-01-21T15:36:29Z

can you rebase and push again to force the ci.

jreback · 2017-01-21T15:38:32Z

can you add a similar test for

read_excel, read_hdf. I think they will work, but just checking.

jreback · 2017-01-25T12:51:19Z

pls add a whatsnew note

JGoutin · 2017-01-25T20:19:03Z

In which file is your whatsnew ?

jreback · 2017-01-25T20:20:44Z

https://github.com/pandas-dev/pandas/blob/master/doc/source/whatsnew/v0.20.0.txt

jreback · 2017-01-25T21:14:38Z

doc/source/whatsnew/v0.20.0.txt

@@ -360,3 +360,4 @@ Bug Fixes


 - Bug in ``pd.read_csv()`` for the C engine where ``usecols`` were being indexed incorrectly with ``parse_dates`` (:issue:`14792`)
+- Bug in ``pd.read_csv()``/``pd.DataFrame.to_csv()`` for Python 3.6 on Windows with filename encoding.


add the issue number here. put a pointer to the pep here as well .

jreback · 2017-01-25T21:14:49Z

pandas/io/common.py

+        Filename encoding
+    """
+    if compat.PY36 and compat.is_platform_windows():
+        # Python 3.6 use UTF-8 as internal platform encoding on Windows


add a pointer to the pep

pls do this

jreback · 2017-01-25T23:50:35Z

pandas/io/tests/test_excel.py

+            with tm.ensure_clean(filename) as path:
+                expected.to_excel(path)
+                df = read_excel(path)
+                tm.assert_frame_equal(df, expected)


make all of your columns floats and this will work

jreback · 2017-01-30T14:22:52Z

can you rebase and push again. I want to see all green on the CI.

jreback · 2017-01-30T14:23:09Z

doc/source/whatsnew/v0.20.0.txt

@@ -360,3 +360,5 @@ Bug Fixes


 - Bug in ``pd.read_csv()`` for the C engine where ``usecols`` were being indexed incorrectly with ``parse_dates`` (:issue:`14792`)
+- Bug in ``pd.read_csv()``/``pd.DataFrame.to_csv()`` for Python 3.6 (and PEP529) on Windows with filename encoding (:issue:`15086`)


put the link in for the PEP

JGoutin · 2017-01-31T18:41:17Z

All work was lost when trying to rebasing... I never did this before and read a bad tutorial on this. 😒

Sorry, I don't have the time to re-doing this.

jreback · 2017-01-31T19:09:50Z

@JGoutin do you have your existing branch? (you can recover doing reflog). If you push up what you had (or anything even close), I will fix it.

jreback · 2017-01-31T19:10:07Z

or even copy-past your changes here is ok too.

jreback · 2017-01-31T19:10:52Z

http://stackoverflow.com/questions/3973994/how-can-i-recover-from-an-erronous-git-push-f-origin-master

JGoutin · 2017-02-07T17:15:48Z

All work seem to be lost. I tried reflog and somes other methods but the repo seem to be reseted to its early post-fork state.

jreback reviewed Jan 9, 2017

View reviewed changes

jreback added Bug Python 3.6 Windows Windows OS labels Jan 9, 2017

jreback changed the title ~~AppVeyor test for #15086~~ BUG: encode filepaths on windows in 3.6 Jan 9, 2017

jreback requested changes Jan 11, 2017

View reviewed changes

jreback requested changes Jan 12, 2017

View reviewed changes

jreback approved these changes Jan 21, 2017

View reviewed changes

jreback reviewed Jan 25, 2017

View reviewed changes

jreback reviewed Jan 30, 2017

View reviewed changes

JGoutin closed this Jan 31, 2017

jreback mentioned this pull request Aug 24, 2017

OSError when reading file with accents in file path #15086

Closed

jreback mentioned this pull request Oct 4, 2017

read_csv fails to read file if there are cyrillic symbols in filename #17773

Closed

		@@ -360,3 +360,4 @@ Bug Fixes


		- Bug in ``pd.read_csv()`` for the C engine where ``usecols`` were being indexed incorrectly with ``parse_dates`` (:issue:`14792`)
		- Bug in ``pd.read_csv()``/``pd.DataFrame.to_csv()`` for Python 3.6 on Windows with filename encoding.

		@@ -360,3 +360,5 @@ Bug Fixes


		- Bug in ``pd.read_csv()`` for the C engine where ``usecols`` were being indexed incorrectly with ``parse_dates`` (:issue:`14792`)
		- Bug in ``pd.read_csv()``/``pd.DataFrame.to_csv()`` for Python 3.6 (and PEP529) on Windows with filename encoding (:issue:`15086`)

BUG: encode filepaths on windows in 3.6 #15092

BUG: encode filepaths on windows in 3.6 #15092

Conversation

JGoutin commented Jan 9, 2017

Choose a reason for hiding this comment

jreback commented Jan 9, 2017

codecov-io commented Jan 10, 2017 • edited Loading

Codecov Report

JGoutin commented Jan 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 13, 2017 • edited Loading

JGoutin commented Jan 13, 2017 • edited Loading

jreback commented Jan 13, 2017

jreback commented Jan 21, 2017

jreback commented Jan 21, 2017

jreback commented Jan 25, 2017

JGoutin commented Jan 25, 2017

jreback commented Jan 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 30, 2017

Choose a reason for hiding this comment

JGoutin commented Jan 31, 2017

jreback commented Jan 31, 2017

jreback commented Jan 31, 2017

jreback commented Jan 31, 2017

JGoutin commented Feb 7, 2017 • edited Loading

codecov-io commented Jan 10, 2017 •

edited

Loading

jreback commented Jan 13, 2017 •

edited

Loading

JGoutin commented Jan 13, 2017 •

edited

Loading

JGoutin commented Feb 7, 2017 •

edited

Loading