-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: reading windows utf8 filenames in py3.6 #25769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
is there an open issue about this? cc @gfyoung |
I didn't open an issue when I faced that, instead I have fixed the code. |
It normally helps to create a code sample which can be used as the test and for others to recreate the problem. |
The test that failed was the same test that was introduced in #24758, it just didn't pass on my machine, most likely due to different locale set up. My fix is locale-independent, so it should pass on any machine with any locale set. |
so i guess that |
My diggings show that when Python faces a request to convert from Unicode to bytestring using According to https://stackoverflow.com/a/9039278 there is no programmatic way to change So, in short, the answer to this question:
is "no" :( What does seem to be possible is to change the test so it would fail both for Note that it would require some dancing in the test to not break Python < 3.6, though. |
Not that I'm aware of.
I would just add a new test actually. Second, if you could post the output that you were getting prior to this fix for that test, that would be good for everyone's reference. Finally, if you could add a whatsnew note for |
I'm not sure what second test would add here, what I'm thinking is simply crafting a string that isn't representable in neither
Failure without my patch is easy:
Will do tomorrow. |
No worries. Feel free to skip the test for Python < 3.6 (we're already in talks about phasing out 3.5 support once we finish with 2.7). |
pls don’t skip for 3.5 |
@jreback : @vnlitvin fix relies on leveraging Python 3.6+ internal functionality that can't be replicated in Python 3.5. That is why I suggested he add a new test (instead of modifying the existing one) that can then be skipped. The changes he has made are otherwise compatible with the rest of our tests. |
ah I c. carry on then. |
The problem with existing one is that it would fail on Python 3.5 with e.g. 1251 (Cyrillic) locale set. |
Yep, that sounds like a plan. Go for it! |
Codecov Report
@@ Coverage Diff @@
## master #25769 +/- ##
=======================================
Coverage 91.25% 91.25%
=======================================
Files 172 172
Lines 52965 52965
=======================================
Hits 48335 48335
Misses 4630 4630
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25769 +/- ##
=======================================
Coverage 91.26% 91.26%
=======================================
Files 173 173
Lines 52982 52982
=======================================
Hits 48356 48356
Misses 4626 4626
Continue to review full report at Codecov.
|
cf2e062
to
c3a1d15
Compare
Note for reviewers: it seems that testing is somehow broken in the upstream, as I didn't change anything related to |
@jreback should I mark as resolved the conversations I fixed? |
@vnlitvin yes you can resolve things. we already know that the CI is failing. I will ask you to rebase at some point. |
can you merge master |
…s of chars Updated whatsnew
Use CP-1252 and CP-1251 filenames separately, skip the test on Windows on < 3.6 as it won't pass
816452a
to
6095ea3
Compare
I've rebased on latest master, so commit history will be clean. |
thanks @vnlitvin |
* upstream/master: (55 commits) PERF: Improve performance of StataReader (pandas-dev#25780) Speed up tokenizing of a row in csv and xstrtod parsing (pandas-dev#25784) BUG: Fix _binop for operators for serials which has more than one returns (divmod/rdivmod). (pandas-dev#25588) BUG-24971 copying blocks also considers ndim (pandas-dev#25521) CLN: Panel reference from documentation (pandas-dev#25649) ENH: Quoting column names containing spaces with backticks to use them in query and eval. (pandas-dev#24955) BUG: reading windows utf8 filenames in py3.6 (pandas-dev#25769) DOC: clean bug fix section in whatsnew (pandas-dev#25792) DOC: Fixed PeriodArray api ref (pandas-dev#25526) Move locale code out of tm, into _config (pandas-dev#25757) Unpin pycodestyle (pandas-dev#25789) Add test for rdivmod on EA array (GH23287) (pandas-dev#24047) ENH: Support datetime.timezone objects (pandas-dev#25065) Cython language level 3 (pandas-dev#24538) API: concat on sparse values (pandas-dev#25719) TST: assert_produces_warning works with filterwarnings (pandas-dev#25721) make core.config self-contained (pandas-dev#25613) CLN: replace %s syntax with .format in pandas.io.parsers (pandas-dev#24721) TST: Check pytables<3.5.1 when skipping (pandas-dev#25773) DOC: Fix typo in docstring of DataFrame.memory_usage (pandas-dev#25770) ...
* Fix pandas-devgh-15086 properly instead of making a workaround * fix code style * Make sure test_filename_with_special_chars properly tests combinations of chars Updated whatsnew * Address comments by @jreback * Parametrize test_filename_with_special_chars Use CP-1252 and CP-1251 filenames separately, skip the test on Windows on < 3.6 as it won't pass
git diff upstream/master -u -- "*.py" | flake8 --diff
Test that was added in prior fix was failing on my machine, where my locale was set to
CP-1251
(which apparently cannot handle some unicode symbols in the test even when using MBCS, as MBCS encoding is locale-dependent).This PR fixes the issue removing the workaround by switching to converting UTF8 to UTF16 and using Win32 Unicode file API (much like Python >= 3.6 does internally).
P.S. This as a bonus adds possibility to use
\\?\C\long\file\name
addressing if desired to overcome 260 symbols limitation on path length.