Skip to content

BUG: reading windows utf8 filenames in py3.6 #25769

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 20, 2019

Conversation

vnlitvinov
Copy link
Contributor

@vnlitvinov vnlitvinov commented Mar 18, 2019

Test that was added in prior fix was failing on my machine, where my locale was set to CP-1251 (which apparently cannot handle some unicode symbols in the test even when using MBCS, as MBCS encoding is locale-dependent).

This PR fixes the issue removing the workaround by switching to converting UTF8 to UTF16 and using Win32 Unicode file API (much like Python >= 3.6 does internally).

P.S. This as a bonus adds possibility to use \\?\C\long\file\name addressing if desired to overcome 260 symbols limitation on path length.

@simonjayhawkins
Copy link
Member

Test that was added in prior fix was failing on my machine, where my locale was set to CP-1251

is there an open issue about this?

cc @gfyoung

@vnlitvinov
Copy link
Contributor Author

I didn't open an issue when I faced that, instead I have fixed the code.
Should I file an issue before the code could be accepted?

@simonjayhawkins
Copy link
Member

It normally helps to create a code sample which can be used as the test and for others to recreate the problem.

@vnlitvinov
Copy link
Contributor Author

The test that failed was the same test that was introduced in #24758, it just didn't pass on my machine, most likely due to different locale set up. My fix is locale-independent, so it should pass on any machine with any locale set.

@simonjayhawkins
Copy link
Member

so i guess that test_filename_with_special_chars should be parametrised over different locales. would monkeypatch allow this here?

@vnlitvinov
Copy link
Contributor Author

vnlitvinov commented Mar 18, 2019

My diggings show that when Python faces a request to convert from Unicode to bytestring using mbcs encoding on Windows it uses CP_ACP as codepage for WideCharToMultiByte argument (that is, default ANSI code page).

According to https://stackoverflow.com/a/9039278 there is no programmatic way to change CP_ACP value, as this is a system-wide setting controlled via Region settings in Windows Control Panel.

So, in short, the answer to this question:

would monkeypatch allow this here?

is "no" :(

What does seem to be possible is to change the test so it would fail both for 1252 and 1251 codepages unless fixed as I do here by using a superset of characters some of which are not representable in 1251 and some in 1252.

Note that it would require some dancing in the test to not break Python < 3.6, though.

@gfyoung
Copy link
Member

gfyoung commented Mar 18, 2019

is there an open issue about this?

Not that I'm aware of.

What does seem to be possible is to change the test so it would fail both for 1252 and 1251 codepages unless fixed as I do here by using a superset of characters some of which are not representable in 1251 and some in 1252.

I would just add a new test actually.

Second, if you could post the output that you were getting prior to this fix for that test, that would be good for everyone's reference.

Finally, if you could add a whatsnew note for 0.25.0. Even though it's fixing an already-closed issue, you're adding a more robust fix. Otherwise, so far so good!

@gfyoung gfyoung added Bug IO CSV read_csv, to_csv Windows Windows OS and removed IO CSV read_csv, to_csv labels Mar 18, 2019
@vnlitvinov
Copy link
Contributor Author

vnlitvinov commented Mar 18, 2019

I would just add a new test actually.

I'm not sure what second test would add here, what I'm thinking is simply crafting a string that isn't representable in neither 1251 nor 1252 (my current stab at it is sé-es-vé-sй.csv - it cannot be encoded neither as 1251 nor as 1252 bytestrings). The only quirk that remains there is that I'd probably have to disable the test for Python < 3.6 (otherwise it would forever fail there on strange locales - as prior to 3.6 I cannot be sure that what this function gets as an input parameter is a UTF-8-encoded string).

Second, if you could post the output that you were getting prior to this fix for that test, that would be good for everyone's reference.

Failure without my patch is easy:

________________ test_filename_with_special_chars[c_high] ____________________________________

all_parsers = <pandas.tests.io.parser.conftest.CParserHighMemory object at 0x000002CA11F92320>

    def test_filename_with_special_chars(all_parsers):
        # see gh-15086.
        parser = all_parsers
        df = DataFrame({"a": [1, 2, 3]})

        with tm.ensure_clean("sé-es-vé.csv") as path:
            df.to_csv(path, index=False)

>           result = parser.read_csv(path)

pandas\tests\io\parser\test_common.py:1915:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\tests\io\parser\conftest.py:22: in read_csv
    return read_csv(*args, **kwargs)
pandas\io\parsers.py:702: in parser_f
    return _read(filepath_or_buffer, kwds)
pandas\io\parsers.py:429: in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
pandas\io\parsers.py:895: in __init__
    self._make_engine(self.engine)
pandas\io\parsers.py:1122: in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
pandas\io\parsers.py:1853: in __init__
    self._reader = parsers.TextReader(src, **kwds)
pandas\_libs\parsers.pyx:388: in pandas._libs.parsers.TextReader.__cinit__
    self._setup_parser_source(source)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   source = source.encode(encoding)
E   UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character

pandas\_libs\parsers.pyx:687: UnicodeEncodeError

Finally, if you could add a whatsnew note for 0.25.0

Will do tomorrow.

@gfyoung
Copy link
Member

gfyoung commented Mar 18, 2019

The only quirk that remains there is that I'd probably have to disable the test for Python < 3.6 (otherwise it would forever fail there on strange locales - as prior to 3.6 I cannot be sure that what this function gets as an input parameter is a UTF-8-encoded string)

No worries. Feel free to skip the test for Python < 3.6 (we're already in talks about phasing out 3.5 support once we finish with 2.7).

@jreback
Copy link
Contributor

jreback commented Mar 18, 2019

pls don’t skip for 3.5

@gfyoung
Copy link
Member

gfyoung commented Mar 18, 2019

pls don’t skip for 3.5

@jreback : @vnlitvin fix relies on leveraging Python 3.6+ internal functionality that can't be replicated in Python 3.5. That is why I suggested he add a new test (instead of modifying the existing one) that can then be skipped.

The changes he has made are otherwise compatible with the rest of our tests.

@jreback
Copy link
Contributor

jreback commented Mar 18, 2019

ah I c. carry on then.

@vnlitvinov
Copy link
Contributor Author

the existing one

The problem with existing one is that it would fail on Python 3.5 with e.g. 1251 (Cyrillic) locale set.
What can I add is I can skip the test on non-1252 locales for Python < 3.6 (though it seems that the test is kinda useless for < 3.6 - it basically tests that Windows open() function works - doesn't seem to be a job for a Pandas test).

@gfyoung
Copy link
Member

gfyoung commented Mar 18, 2019

What can I add is I can skip the test on non-1252 locales for Python < 3.6

Yep, that sounds like a plan. Go for it!

@pep8speaks
Copy link

pep8speaks commented Mar 19, 2019

Hello @vnlitvin! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-03-20 10:01:09 UTC

@codecov
Copy link

codecov bot commented Mar 19, 2019

Codecov Report

Merging #25769 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #25769   +/-   ##
=======================================
  Coverage   91.25%   91.25%           
=======================================
  Files         172      172           
  Lines       52965    52965           
=======================================
  Hits        48335    48335           
  Misses       4630     4630
Flag Coverage Δ
#multiple 89.83% <ø> (ø) ⬆️
#single 41.74% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e8d951d...cf2e062. Read the comment docs.

@codecov
Copy link

codecov bot commented Mar 19, 2019

Codecov Report

Merging #25769 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #25769   +/-   ##
=======================================
  Coverage   91.26%   91.26%           
=======================================
  Files         173      173           
  Lines       52982    52982           
=======================================
  Hits        48356    48356           
  Misses       4626     4626
Flag Coverage Δ
#multiple 89.83% <ø> (ø) ⬆️
#single 41.75% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4663951...6095ea3. Read the comment docs.

@vnlitvinov
Copy link
Contributor Author

Note for reviewers: it seems that testing is somehow broken in the upstream, as I didn't change anything related to numpy or pytables, but they now fail. It seems that the image used for testing has changed.

@jreback jreback changed the title Fix gh-15086 properly instead of making a workaround BUG: reading windows utf8 filenames in py3.6 Mar 19, 2019
@vnlitvinov
Copy link
Contributor Author

@jreback should I mark as resolved the conversations I fixed?

@jreback
Copy link
Contributor

jreback commented Mar 19, 2019

@vnlitvin yes you can resolve things.

we already know that the CI is failing. I will ask you to rebase at some point.

@jreback
Copy link
Contributor

jreback commented Mar 19, 2019

can you merge master

anmyachev and others added 4 commits March 20, 2019 12:49
Use CP-1252 and CP-1251 filenames separately,
skip the test on Windows on < 3.6 as it won't pass
@vnlitvinov
Copy link
Contributor Author

can you merge master

I've rebased on latest master, so commit history will be clean.

@jreback jreback added this to the 0.25.0 milestone Mar 20, 2019
@jreback jreback merged commit 6e979d8 into pandas-dev:master Mar 20, 2019
@jreback
Copy link
Contributor

jreback commented Mar 20, 2019

thanks @vnlitvin

thoo added a commit to thoo/pandas that referenced this pull request Mar 20, 2019
* upstream/master: (55 commits)
  PERF: Improve performance of StataReader (pandas-dev#25780)
  Speed up tokenizing of a row in csv and xstrtod parsing (pandas-dev#25784)
  BUG: Fix _binop for operators for serials which has more than one returns (divmod/rdivmod). (pandas-dev#25588)
  BUG-24971 copying blocks also considers ndim (pandas-dev#25521)
  CLN: Panel reference from documentation (pandas-dev#25649)
  ENH: Quoting column names containing spaces with backticks to use them in query and eval. (pandas-dev#24955)
  BUG: reading windows utf8 filenames in py3.6 (pandas-dev#25769)
  DOC: clean bug fix section in whatsnew (pandas-dev#25792)
  DOC: Fixed PeriodArray api ref (pandas-dev#25526)
  Move locale code out of tm, into _config (pandas-dev#25757)
  Unpin pycodestyle (pandas-dev#25789)
  Add test for rdivmod on EA array (GH23287) (pandas-dev#24047)
  ENH: Support datetime.timezone objects (pandas-dev#25065)
  Cython language level 3 (pandas-dev#24538)
  API: concat on sparse values (pandas-dev#25719)
  TST: assert_produces_warning works with filterwarnings (pandas-dev#25721)
  make core.config self-contained (pandas-dev#25613)
  CLN: replace %s syntax with .format in pandas.io.parsers (pandas-dev#24721)
  TST: Check pytables<3.5.1 when skipping (pandas-dev#25773)
  DOC: Fix typo in docstring of DataFrame.memory_usage  (pandas-dev#25770)
  ...
@vnlitvinov vnlitvinov deleted the fix-gh-15086 branch March 28, 2019 14:30
anmyachev pushed a commit to anmyachev/pandas that referenced this pull request Apr 18, 2019
* Fix pandas-devgh-15086 properly instead of making a workaround

* fix code style

* Make sure test_filename_with_special_chars properly tests combinations of chars
Updated whatsnew

* Address comments by @jreback

* Parametrize test_filename_with_special_chars

Use CP-1252 and CP-1251 filenames separately,
skip the test on Windows on < 3.6 as it won't pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Windows Windows OS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OSError when reading file with accents in file path
6 participants