-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: encode filepaths on windows in 3.6 #15092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/tests/test_accent.py
Outdated
from os.path import join | ||
from pandas import read_csv | ||
import pandas.util.testing as tm | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, you can simply write these files dynamically, e.g use
with ensure_clean(.....) as path:
df.to_csv(path)
result = pd.read_csv(path)
so looks your test failed (good!) so if you can update the test to make more generic like I showed. then you can do the fix (which will be to encode the filename on windows : https://docs.python.org/3/whatsnew/3.6.html#pep-529-change-windows-filesystem-encoding-to-utf-8 so I would probably do this only on windows and only on >= 3.6
|
Codecov Report
@@ Coverage Diff @@
## master #15092 +/- ##
==========================================
- Coverage 86.33% 84.75% -1.58%
==========================================
Files 139 145 +6
Lines 51149 51279 +130
==========================================
- Hits 44157 43464 -693
- Misses 6992 7815 +823
Continue to review full report at Codecov.
|
With Python 3.6 Calling |
pandas/tests/test_accent.py
Outdated
from pandas.util.testing import ensure_clean | ||
|
||
class FileSystemEncodingTest(unittest.TestCase): | ||
"""Test compatibility with file system encoding""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs to go with tests in pandas/io/tests/parser
try to find a good place there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll move it when everything work.
pandas/parser.pyx
Outdated
@@ -662,7 +662,11 @@ cdef class TextReader: | |||
|
|||
if isinstance(source, basestring): | |||
if not isinstance(source, bytes): | |||
source = source.encode(sys.getfilesystemencoding() or 'utf-8') | |||
if compat.PY36 and compat.is_platform_windows(): | |||
fs_encoding = 'mbcs' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also need to go in pandas/io/common in _get_filehandle_or_buffer so prob needs to be in a common routine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
further this is not actually fixing the issue but using the legacy code path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since sys.getfilesystemencoding()
return the bad encoding (Utf-8) and is already called on source
, I forced the encoding value.
Maybe not the best way to do (I don't know Pandas low-level machinery, maybe better way in the C code of the new_file_source
function from parser/io.h
), but AppVeyor say it work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls move these changes to io/common.py as this needs to be used in several places. make a function called _get_filename_encoding
pandas/parser.pyx
Outdated
@@ -662,7 +662,11 @@ cdef class TextReader: | |||
|
|||
if isinstance(source, basestring): | |||
if not isinstance(source, bytes): | |||
source = source.encode(sys.getfilesystemencoding() or 'utf-8') | |||
if compat.PY36 and compat.is_platform_windows(): | |||
fs_encoding = 'mbcs' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
further this is not actually fixing the issue but using the legacy code path
pandas/tests/test_accent.py
Outdated
"""Test compatibility with file system encoding""" | ||
def test_load(self): | ||
for filename in ('test_e.txt', 'test_é.txt'): | ||
with ensure_clean(filename) as path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also not testing the issue - unicode paths should already be ok ; it is byte paths that are not working
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is with Unicode paths, not byte path.
I used os.path.join(pandas.util.testing.get_data_path(), 'test_é.txt')
in the previous test version which is also a Unicode path and failed as intended with AppVeyor.
pandas/io/tests/parser/common.py
Outdated
@@ -1635,3 +1635,12 @@ def test_file_handles(self): | |||
if PY3: | |||
self.assertFalse(m.closed) | |||
m.close() | |||
|
|||
def test_file_system_encoding(self): | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use a comment here with triple-quotes, instead use #
add the issue number as a comment
pandas/io/tests/parser/common.py
Outdated
""" | ||
Test compatibility with file system encoding. | ||
""" | ||
for filename in ('test_e.txt', 'test_é.txt'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all use a byte file name
pandas/io/tests/parser/common.py
Outdated
for filename in ('test_e.txt', 'test_é.txt'): | ||
with tm.ensure_clean(filename) as path: | ||
pd.DataFrame().to_csv(path) | ||
pd.read_csv(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use self.read_csv
to read with the various parsers
use a non-zero dataframe, then compare the return results.
pandas/parser.pyx
Outdated
@@ -662,7 +662,11 @@ cdef class TextReader: | |||
|
|||
if isinstance(source, basestring): | |||
if not isinstance(source, bytes): | |||
source = source.encode(sys.getfilesystemencoding() or 'utf-8') | |||
if compat.PY36 and compat.is_platform_windows(): | |||
fs_encoding = 'mbcs' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls move these changes to io/common.py as this needs to be used in several places. make a function called _get_filename_encoding
this looks reasonable so far. can you add some tests for excel, hdf5 similar to the above; I think you will need to modify to call (and encode) the filename, IOW, I think after the alternatively might need to do this in |
The problem is not present for all I/O functions from pandas. Actually, I see it only with read_csv and read_table, but not for read_excel or read_fwf by example. This need to be tested before modifying anything else. |
@JGoutin agreed. If you want to add tests for non-parser things would be great. Its possible this only affects the c-parser as that is the only thing that actually encodes the filenames (others use |
can you rebase and push again to force the ci. |
can you add a similar test for
|
pls add a whatsnew note |
In which file is your whatsnew ? |
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -360,3 +360,4 @@ Bug Fixes | |||
|
|||
|
|||
- Bug in ``pd.read_csv()`` for the C engine where ``usecols`` were being indexed incorrectly with ``parse_dates`` (:issue:`14792`) | |||
- Bug in ``pd.read_csv()``/``pd.DataFrame.to_csv()`` for Python 3.6 on Windows with filename encoding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number here. put a pointer to the pep here as well .
pandas/io/common.py
Outdated
Filename encoding | ||
""" | ||
if compat.PY36 and compat.is_platform_windows(): | ||
# Python 3.6 use UTF-8 as internal platform encoding on Windows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a pointer to the pep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls do this
pandas/io/tests/test_excel.py
Outdated
with tm.ensure_clean(filename) as path: | ||
expected.to_excel(path) | ||
df = read_excel(path) | ||
tm.assert_frame_equal(df, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make all of your columns floats and this will work
can you rebase and push again. I want to see all green on the CI. |
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -360,3 +360,5 @@ Bug Fixes | |||
|
|||
|
|||
- Bug in ``pd.read_csv()`` for the C engine where ``usecols`` were being indexed incorrectly with ``parse_dates`` (:issue:`14792`) | |||
- Bug in ``pd.read_csv()``/``pd.DataFrame.to_csv()`` for Python 3.6 (and PEP529) on Windows with filename encoding (:issue:`15086`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put the link in for the PEP
All work was lost when trying to rebasing... I never did this before and read a bad tutorial on this. 😒 Sorry, I don't have the time to re-doing this. |
@JGoutin do you have your existing branch? (you can recover doing |
or even copy-past your changes here is ok too. |
All work seem to be lost. I tried reflog and somes other methods but the repo seem to be reseted to its early post-fork state. |
Not for Merge, just a small test for #15086 with AppVeyor.