BUG: reading windows utf8 filenames in py3.6 #25769

vnlitvinov · 2019-03-18T15:35:16Z

closes OSError when reading file with accents in file path #15086 properly
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Test that was added in prior fix was failing on my machine, where my locale was set to CP-1251 (which apparently cannot handle some unicode symbols in the test even when using MBCS, as MBCS encoding is locale-dependent).

This PR fixes the issue removing the workaround by switching to converting UTF8 to UTF16 and using Win32 Unicode file API (much like Python >= 3.6 does internally).

P.S. This as a bonus adds possibility to use \\?\C\long\file\name addressing if desired to overcome 260 symbols limitation on path length.

simonjayhawkins · 2019-03-18T15:49:25Z

Test that was added in prior fix was failing on my machine, where my locale was set to CP-1251

is there an open issue about this?

cc @gfyoung

vnlitvinov · 2019-03-18T15:52:14Z

I didn't open an issue when I faced that, instead I have fixed the code.
Should I file an issue before the code could be accepted?

simonjayhawkins · 2019-03-18T15:55:03Z

It normally helps to create a code sample which can be used as the test and for others to recreate the problem.

vnlitvinov · 2019-03-18T16:04:50Z

The test that failed was the same test that was introduced in #24758, it just didn't pass on my machine, most likely due to different locale set up. My fix is locale-independent, so it should pass on any machine with any locale set.

simonjayhawkins · 2019-03-18T16:09:02Z

so i guess that test_filename_with_special_chars should be parametrised over different locales. would monkeypatch allow this here?

vnlitvinov · 2019-03-18T18:41:26Z

My diggings show that when Python faces a request to convert from Unicode to bytestring using mbcs encoding on Windows it uses CP_ACP as codepage for WideCharToMultiByte argument (that is, default ANSI code page).

According to https://stackoverflow.com/a/9039278 there is no programmatic way to change CP_ACP value, as this is a system-wide setting controlled via Region settings in Windows Control Panel.

So, in short, the answer to this question:

would monkeypatch allow this here?

is "no" :(

What does seem to be possible is to change the test so it would fail both for 1252 and 1251 codepages unless fixed as I do here by using a superset of characters some of which are not representable in 1251 and some in 1252.

Note that it would require some dancing in the test to not break Python < 3.6, though.

gfyoung · 2019-03-18T18:54:42Z

is there an open issue about this?

Not that I'm aware of.

What does seem to be possible is to change the test so it would fail both for 1252 and 1251 codepages unless fixed as I do here by using a superset of characters some of which are not representable in 1251 and some in 1252.

I would just add a new test actually.

Second, if you could post the output that you were getting prior to this fix for that test, that would be good for everyone's reference.

Finally, if you could add a whatsnew note for 0.25.0. Even though it's fixing an already-closed issue, you're adding a more robust fix. Otherwise, so far so good!

vnlitvinov · 2019-03-18T19:04:19Z

I would just add a new test actually.

I'm not sure what second test would add here, what I'm thinking is simply crafting a string that isn't representable in neither 1251 nor 1252 (my current stab at it is sé-es-vé-sй.csv - it cannot be encoded neither as 1251 nor as 1252 bytestrings). The only quirk that remains there is that I'd probably have to disable the test for Python < 3.6 (otherwise it would forever fail there on strange locales - as prior to 3.6 I cannot be sure that what this function gets as an input parameter is a UTF-8-encoded string).

Second, if you could post the output that you were getting prior to this fix for that test, that would be good for everyone's reference.

Failure without my patch is easy:

________________ test_filename_with_special_chars[c_high] ____________________________________

all_parsers = <pandas.tests.io.parser.conftest.CParserHighMemory object at 0x000002CA11F92320>

    def test_filename_with_special_chars(all_parsers):
        # see gh-15086.
        parser = all_parsers
        df = DataFrame({"a": [1, 2, 3]})

        with tm.ensure_clean("sé-es-vé.csv") as path:
            df.to_csv(path, index=False)

>           result = parser.read_csv(path)

pandas\tests\io\parser\test_common.py:1915:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\tests\io\parser\conftest.py:22: in read_csv
    return read_csv(*args, **kwargs)
pandas\io\parsers.py:702: in parser_f
    return _read(filepath_or_buffer, kwds)
pandas\io\parsers.py:429: in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
pandas\io\parsers.py:895: in __init__
    self._make_engine(self.engine)
pandas\io\parsers.py:1122: in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
pandas\io\parsers.py:1853: in __init__
    self._reader = parsers.TextReader(src, **kwds)
pandas\_libs\parsers.pyx:388: in pandas._libs.parsers.TextReader.__cinit__
    self._setup_parser_source(source)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   source = source.encode(encoding)
E   UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character

pandas\_libs\parsers.pyx:687: UnicodeEncodeError

Finally, if you could add a whatsnew note for 0.25.0

Will do tomorrow.

gfyoung · 2019-03-18T19:10:45Z

The only quirk that remains there is that I'd probably have to disable the test for Python < 3.6 (otherwise it would forever fail there on strange locales - as prior to 3.6 I cannot be sure that what this function gets as an input parameter is a UTF-8-encoded string)

No worries. Feel free to skip the test for Python < 3.6 (we're already in talks about phasing out 3.5 support once we finish with 2.7).

jreback · 2019-03-18T19:17:38Z

pls don’t skip for 3.5

gfyoung · 2019-03-18T19:19:17Z

pls don’t skip for 3.5

@jreback : @vnlitvin fix relies on leveraging Python 3.6+ internal functionality that can't be replicated in Python 3.5. That is why I suggested he add a new test (instead of modifying the existing one) that can then be skipped.

The changes he has made are otherwise compatible with the rest of our tests.

jreback · 2019-03-18T19:25:09Z

ah I c. carry on then.

vnlitvinov · 2019-03-18T19:46:17Z

the existing one

The problem with existing one is that it would fail on Python 3.5 with e.g. 1251 (Cyrillic) locale set.
What can I add is I can skip the test on non-1252 locales for Python < 3.6 (though it seems that the test is kinda useless for < 3.6 - it basically tests that Windows open() function works - doesn't seem to be a job for a Pandas test).

gfyoung · 2019-03-18T20:09:47Z

What can I add is I can skip the test on non-1252 locales for Python < 3.6

Yep, that sounds like a plan. Go for it!

pep8speaks · 2019-03-19T14:59:06Z

Hello @vnlitvin! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-03-20 10:01:09 UTC

codecov · 2019-03-19T14:59:13Z

Codecov Report

Merging #25769 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #25769   +/-   ##
=======================================
  Coverage   91.25%   91.25%           
=======================================
  Files         172      172           
  Lines       52965    52965           
=======================================
  Hits        48335    48335           
  Misses       4630     4630

Flag	Coverage Δ
#multiple	`89.83% <ø> (ø)`	⬆️
#single	`41.74% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e8d951d...cf2e062. Read the comment docs.

codecov · 2019-03-19T14:59:16Z

Codecov Report

Merging #25769 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #25769   +/-   ##
=======================================
  Coverage   91.26%   91.26%           
=======================================
  Files         173      173           
  Lines       52982    52982           
=======================================
  Hits        48356    48356           
  Misses       4626     4626

Flag	Coverage Δ
#multiple	`89.83% <ø> (ø)`	⬆️
#single	`41.75% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4663951...6095ea3. Read the comment docs.

doc/source/whatsnew/v0.25.0.rst

pandas/tests/io/parser/test_common.py

vnlitvinov · 2019-03-19T15:43:14Z

Note for reviewers: it seems that testing is somehow broken in the upstream, as I didn't change anything related to numpy or pytables, but they now fail. It seems that the image used for testing has changed.

vnlitvinov · 2019-03-19T16:51:44Z

@jreback should I mark as resolved the conversations I fixed?

jreback · 2019-03-19T16:53:42Z

@vnlitvin yes you can resolve things.

we already know that the CI is failing. I will ask you to rebase at some point.

pandas/tests/io/parser/test_common.py

jreback · 2019-03-19T18:19:02Z

can you merge master

…s of chars Updated whatsnew

Use CP-1252 and CP-1251 filenames separately, skip the test on Windows on < 3.6 as it won't pass

vnlitvinov · 2019-03-20T10:02:00Z

can you merge master

I've rebased on latest master, so commit history will be clean.

jreback · 2019-03-20T12:22:42Z

thanks @vnlitvin

* upstream/master: (55 commits) PERF: Improve performance of StataReader (pandas-dev#25780) Speed up tokenizing of a row in csv and xstrtod parsing (pandas-dev#25784) BUG: Fix _binop for operators for serials which has more than one returns (divmod/rdivmod). (pandas-dev#25588) BUG-24971 copying blocks also considers ndim (pandas-dev#25521) CLN: Panel reference from documentation (pandas-dev#25649) ENH: Quoting column names containing spaces with backticks to use them in query and eval. (pandas-dev#24955) BUG: reading windows utf8 filenames in py3.6 (pandas-dev#25769) DOC: clean bug fix section in whatsnew (pandas-dev#25792) DOC: Fixed PeriodArray api ref (pandas-dev#25526) Move locale code out of tm, into _config (pandas-dev#25757) Unpin pycodestyle (pandas-dev#25789) Add test for rdivmod on EA array (GH23287) (pandas-dev#24047) ENH: Support datetime.timezone objects (pandas-dev#25065) Cython language level 3 (pandas-dev#24538) API: concat on sparse values (pandas-dev#25719) TST: assert_produces_warning works with filterwarnings (pandas-dev#25721) make core.config self-contained (pandas-dev#25613) CLN: replace %s syntax with .format in pandas.io.parsers (pandas-dev#24721) TST: Check pytables<3.5.1 when skipping (pandas-dev#25773) DOC: Fix typo in docstring of DataFrame.memory_usage (pandas-dev#25770) ...

@jreback

* Fix pandas-devgh-15086 properly instead of making a workaround * fix code style * Make sure test_filename_with_special_chars properly tests combinations of chars Updated whatsnew * Address comments by @jreback * Parametrize test_filename_with_special_chars Use CP-1252 and CP-1251 filenames separately, skip the test on Windows on < 3.6 as it won't pass

gfyoung added Bug IO CSV read_csv, to_csv Windows Windows OS and removed IO CSV read_csv, to_csv labels Mar 18, 2019

vnlitvinov force-pushed the fix-gh-15086 branch from cf2e062 to c3a1d15 Compare March 19, 2019 15:00

jreback requested changes Mar 19, 2019

View reviewed changes

doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved

pandas/tests/io/parser/test_common.py Outdated Show resolved Hide resolved

pandas/tests/io/parser/test_common.py Outdated Show resolved Hide resolved

pandas/tests/io/parser/test_common.py Outdated Show resolved Hide resolved

jreback changed the title ~~Fix gh-15086 properly instead of making a workaround~~ BUG: reading windows utf8 filenames in py3.6 Mar 19, 2019

gfyoung reviewed Mar 19, 2019

View reviewed changes

pandas/tests/io/parser/test_common.py Outdated Show resolved Hide resolved

Fix pandas-devgh-15086 properly instead of making a workaround

d3dede6

anmyachev and others added 4 commits March 20, 2019 12:49

fix code style

810416e

Make sure test_filename_with_special_chars properly tests combination…

8fd2214

…s of chars Updated whatsnew

Address comments by @jreback

7961889

Parametrize test_filename_with_special_chars

6095ea3

Use CP-1252 and CP-1251 filenames separately, skip the test on Windows on < 3.6 as it won't pass

vnlitvinov force-pushed the fix-gh-15086 branch from 816452a to 6095ea3 Compare March 20, 2019 10:01

jreback added this to the 0.25.0 milestone Mar 20, 2019

jreback approved these changes Mar 20, 2019

View reviewed changes

jreback merged commit 6e979d8 into pandas-dev:master Mar 20, 2019

vnlitvinov deleted the fix-gh-15086 branch March 28, 2019 14:30

Uh oh!

BUG: reading windows utf8 filenames in py3.6 #25769

BUG: reading windows utf8 filenames in py3.6 #25769

Uh oh!

Conversation

vnlitvinov commented Mar 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simonjayhawkins commented Mar 18, 2019

Uh oh!

vnlitvinov commented Mar 18, 2019

Uh oh!

simonjayhawkins commented Mar 18, 2019

Uh oh!

vnlitvinov commented Mar 18, 2019

Uh oh!

simonjayhawkins commented Mar 18, 2019

Uh oh!

vnlitvinov commented Mar 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gfyoung commented Mar 18, 2019

Uh oh!

vnlitvinov commented Mar 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gfyoung commented Mar 18, 2019

Uh oh!

jreback commented Mar 18, 2019

Uh oh!

gfyoung commented Mar 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Mar 18, 2019

Uh oh!

vnlitvinov commented Mar 18, 2019

Uh oh!

gfyoung commented Mar 18, 2019

Uh oh!

pep8speaks commented Mar 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2019-03-20 10:01:09 UTC

Uh oh!

codecov bot commented Mar 19, 2019

Codecov Report

Uh oh!

codecov bot commented Mar 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vnlitvinov commented Mar 19, 2019

Uh oh!

vnlitvinov commented Mar 19, 2019

Uh oh!

jreback commented Mar 19, 2019

Uh oh!

Uh oh!

jreback commented Mar 19, 2019

Uh oh!

vnlitvinov commented Mar 20, 2019

Uh oh!

jreback commented Mar 20, 2019

Uh oh!

Uh oh!

vnlitvinov commented Mar 18, 2019 •

edited

Loading

vnlitvinov commented Mar 18, 2019 •

edited

Loading

vnlitvinov commented Mar 18, 2019 •

edited

Loading

gfyoung commented Mar 18, 2019 •

edited

Loading

pep8speaks commented Mar 19, 2019 •

edited

Loading

codecov bot commented Mar 19, 2019 •

edited

Loading