Skip to content

BUG: encode filepaths on windows in 3.6 #15092

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 0 commits into from
Closed

BUG: encode filepaths on windows in 3.6 #15092

wants to merge 0 commits into from

Conversation

JGoutin
Copy link

@JGoutin JGoutin commented Jan 9, 2017

Not for Merge, just a small test for #15086 with AppVeyor.

from os.path import join
from pandas import read_csv
import pandas.util.testing as tm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, you can simply write these files dynamically, e.g use

with ensure_clean(.....) as path:
    df.to_csv(path)
    result = pd.read_csv(path)

@jreback
Copy link
Contributor

jreback commented Jan 9, 2017

so looks your test failed (good!)

so if you can update the test to make more generic like I showed. then you can do the fix (which will be to encode the filename on windows : https://docs.python.org/3/whatsnew/3.6.html#pep-529-change-windows-filesystem-encoding-to-utf-8

so I would probably do this only on windows and only on >= 3.6

from pandas.compat import PY3, is_platform_windows

if PY3 and is_platform_windows():
    fn = fn.encode(sys.getfilesystemencoding())

@jreback jreback changed the title AppVeyor test for #15086 BUG: encode filepaths on windows in 3.6 Jan 9, 2017
@codecov-io
Copy link

codecov-io commented Jan 10, 2017

Codecov Report

Merging #15092 into master will increase coverage by -1.58%.

@@            Coverage Diff             @@
##           master   #15092      +/-   ##
==========================================
- Coverage   86.33%   84.75%   -1.58%     
==========================================
  Files         139      145       +6     
  Lines       51149    51279     +130     
==========================================
- Hits        44157    43464     -693     
- Misses       6992     7815     +823
Impacted Files Coverage Δ
pandas/io/common.py 67.65% <80%> (+0.26%)
pandas/core/groupby.py 77.24% <ø> (-17.91%)
pandas/tools/plotting.py 68.48% <ø> (-3.31%)
pandas/types/cast.py 84.18% <ø> (-1.24%)
pandas/core/base.py 94.05% <ø> (-1.11%)
pandas/types/dtypes.py 94.3% <ø> (-1.04%)
pandas/tools/hashing.py 98.14% <ø> (-0.88%)
pandas/indexes/multi.py 95.81% <ø> (-0.77%)
pandas/compat/numpy/function.py 93.5% <ø> (-0.77%)
pandas/compat/init.py 61.71% <ø> (-0.46%)
... and 34 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c26e5bb...dc5877b. Read the comment docs.

@JGoutin
Copy link
Author

JGoutin commented Jan 11, 2017

With Python 3.6 sys.getfilesystemencoding() return utf-8.

Calling sys._enablelegacywindowsfsencoding() before call Pandas fix this bug. This function set files system encoding to mbcs (Like on Python < 3.6).

from pandas.util.testing import ensure_clean

class FileSystemEncodingTest(unittest.TestCase):
"""Test compatibility with file system encoding"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to go with tests in pandas/io/tests/parser

try to find a good place there

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll move it when everything work.

@@ -662,7 +662,11 @@ cdef class TextReader:

if isinstance(source, basestring):
if not isinstance(source, bytes):
source = source.encode(sys.getfilesystemencoding() or 'utf-8')
if compat.PY36 and compat.is_platform_windows():
fs_encoding = 'mbcs'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also need to go in pandas/io/common in _get_filehandle_or_buffer so prob needs to be in a common routine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

further this is not actually fixing the issue but using the legacy code path

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since sys.getfilesystemencoding() return the bad encoding (Utf-8) and is already called on source, I forced the encoding value.

Maybe not the best way to do (I don't know Pandas low-level machinery, maybe better way in the C code of the new_file_source function from parser/io.h), but AppVeyor say it work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls move these changes to io/common.py as this needs to be used in several places. make a function called _get_filename_encoding

@@ -662,7 +662,11 @@ cdef class TextReader:

if isinstance(source, basestring):
if not isinstance(source, bytes):
source = source.encode(sys.getfilesystemencoding() or 'utf-8')
if compat.PY36 and compat.is_platform_windows():
fs_encoding = 'mbcs'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

further this is not actually fixing the issue but using the legacy code path

"""Test compatibility with file system encoding"""
def test_load(self):
for filename in ('test_e.txt', 'test_é.txt'):
with ensure_clean(filename) as path:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also not testing the issue - unicode paths should already be ok ; it is byte paths that are not working

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is with Unicode paths, not byte path.

I used os.path.join(pandas.util.testing.get_data_path(), 'test_é.txt') in the previous test version which is also a Unicode path and failed as intended with AppVeyor.

@@ -1635,3 +1635,12 @@ def test_file_handles(self):
if PY3:
self.assertFalse(m.closed)
m.close()

def test_file_system_encoding(self):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use a comment here with triple-quotes, instead use #
add the issue number as a comment

"""
Test compatibility with file system encoding.
"""
for filename in ('test_e.txt', 'test_é.txt'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all use a byte file name

for filename in ('test_e.txt', 'test_é.txt'):
with tm.ensure_clean(filename) as path:
pd.DataFrame().to_csv(path)
pd.read_csv(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use self.read_csv to read with the various parsers

use a non-zero dataframe, then compare the return results.

@@ -662,7 +662,11 @@ cdef class TextReader:

if isinstance(source, basestring):
if not isinstance(source, bytes):
source = source.encode(sys.getfilesystemencoding() or 'utf-8')
if compat.PY36 and compat.is_platform_windows():
fs_encoding = 'mbcs'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls move these changes to io/common.py as this needs to be used in several places. make a function called _get_filename_encoding

@jreback
Copy link
Contributor

jreback commented Jan 13, 2017

this looks reasonable so far.

can you add some tests for excel, hdf5 similar to the above; I think you will need to modify
about here:
https://github.com/pandas-dev/pandas/blob/master/pandas/io/common.py#L228

to call (and encode) the filename, IOW, I think after the _expand_user you can have a line which checks if its a string, if so, then get the encoding and encode it before returning it.

alternatively might need to do this in _get_handle instead, right before actually opening the file.

@JGoutin
Copy link
Author

JGoutin commented Jan 13, 2017

The problem is not present for all I/O functions from pandas. Actually, I see it only with read_csv and read_table, but not for read_excel or read_fwf by example. This need to be tested before modifying anything else.

@jreback
Copy link
Contributor

jreback commented Jan 13, 2017

@JGoutin agreed. If you want to add tests for non-parser things would be great. Its possible this only affects the c-parser as that is the only thing that actually encodes the filenames (others use open(...) directly) and might not be affected by this change.

@jreback
Copy link
Contributor

jreback commented Jan 21, 2017

can you rebase and push again to force the ci.

@jreback
Copy link
Contributor

jreback commented Jan 21, 2017

can you add a similar test for

read_excel, read_hdf. I think they will work, but just checking.

@jreback
Copy link
Contributor

jreback commented Jan 25, 2017

pls add a whatsnew note

@JGoutin
Copy link
Author

JGoutin commented Jan 25, 2017

In which file is your whatsnew ?

@jreback
Copy link
Contributor

jreback commented Jan 25, 2017

@@ -360,3 +360,4 @@ Bug Fixes


- Bug in ``pd.read_csv()`` for the C engine where ``usecols`` were being indexed incorrectly with ``parse_dates`` (:issue:`14792`)
- Bug in ``pd.read_csv()``/``pd.DataFrame.to_csv()`` for Python 3.6 on Windows with filename encoding.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number here. put a pointer to the pep here as well .

Filename encoding
"""
if compat.PY36 and compat.is_platform_windows():
# Python 3.6 use UTF-8 as internal platform encoding on Windows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a pointer to the pep

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls do this

with tm.ensure_clean(filename) as path:
expected.to_excel(path)
df = read_excel(path)
tm.assert_frame_equal(df, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make all of your columns floats and this will work

@jreback
Copy link
Contributor

jreback commented Jan 30, 2017

can you rebase and push again. I want to see all green on the CI.

@@ -360,3 +360,5 @@ Bug Fixes


- Bug in ``pd.read_csv()`` for the C engine where ``usecols`` were being indexed incorrectly with ``parse_dates`` (:issue:`14792`)
- Bug in ``pd.read_csv()``/``pd.DataFrame.to_csv()`` for Python 3.6 (and PEP529) on Windows with filename encoding (:issue:`15086`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the link in for the PEP

@JGoutin JGoutin closed this Jan 31, 2017
@JGoutin
Copy link
Author

JGoutin commented Jan 31, 2017

All work was lost when trying to rebasing... I never did this before and read a bad tutorial on this. 😒

Sorry, I don't have the time to re-doing this.

@jreback
Copy link
Contributor

jreback commented Jan 31, 2017

@JGoutin do you have your existing branch? (you can recover doing reflog). If you push up what you had (or anything even close), I will fix it.

@jreback
Copy link
Contributor

jreback commented Jan 31, 2017

or even copy-past your changes here is ok too.

@JGoutin
Copy link
Author

JGoutin commented Feb 7, 2017

All work seem to be lost. I tried reflog and somes other methods but the repo seem to be reseted to its early post-fork state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Windows Windows OS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants