-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fixed Unicode decoding error in Period.strftime
when a locale-specific directive is used
#46405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fixed Unicode decoding error in Period.strftime
when a locale-specific directive is used
#46405
Conversation
Period.strftime
and PeriodIndex.strftime
when a locale-specific directive is usedPeriod.strftime
when a locale-specific directive is used
I have an issue: I can reproduce the bug locally but not on CI. The issue can only be reproduced when the current locale encoding is not utf8: (zh_CN defaults encoding is gb2312) What is the recommended way to proceed ?
I'll try the second option for now (when github issues are fixed, it seems down for now) |
gha runner was not triggered apparently :( re-trying |
Unnfortunately it does not work :(
EDIT: retrying with zh_CN.gb2312 |
ok with lower case |
…ure/46319_locale_strftime_period � Conflicts: � pandas/tests/io/formats/test_format.py
I made progress: I was able to install the missing locale. However this leads to a segmentation fault ! https://github.com/pandas-dev/pandas/runs/5606041220?check_suite_focus=true#step:12:22 I will not try to resolve this in this PR. Instead, my new attempt will be to install (but not activate) the |
It worked ! :) https://github.com/pandas-dev/pandas/runs/5606395924?check_suite_focus=true#step:12:57 Now that the bug can be reproduced on CI, I will push the fix, probably monday. |
All set - eventually I did it today :) |
pandas/conftest.py
Outdated
params=[ | ||
pytest.param(None, id=str(locale.getlocale())), | ||
"it_IT.utf8", | ||
"zh_CN", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe
"zh_CN", | |
"zh_CN", # Note: encoding is automatically 'gb2312' |
What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to add zh_CH.utf8
and zh_CH.gb2312
explicitly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll add zh_CH.utf8
.
Strangely enough, zh_CN.gb2312
did not seem to be working, but zh_CN
does involve gb2312 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will also add it_IT
by the way, so that both locales have their utf8 AND non-utf8 encoding activated in the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
.github/workflows/posix.yml
Outdated
- name: Generate extra locales | ||
# These extra locales will be available for locale.setlocale() calls in tests | ||
run: | | ||
sudo locale-gen ${{ env.EXTRA_LOC }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean the Locale: it_IT.utf8
isnt working either?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends what you mean by "working". Do you refer to the bug #46319 or to something more general ?
- The issue BUG:
UnicodeError
when usingPeriod.strftime
with non-utf8 locale #46319 happens only when the current locale uses a non-unicode encoding (no utf-8). So it does not happen in any of the workers before this PR - So in the test fixture parameters I added the
zh_CN
locale, that has a non-utf-8 encoding. - Unfortunately this particular parameter value was never used in any of the workers (the test was skipped), since
locale.setlocale
was raising an error (locale not available) - I finally found a way to make the locale available on linux workers:
sudo locale-gen <locale>
Does that answer your question ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was assumed setting the environment variables
LANG: ${{ matrix.lang || '' }}
LC_ALL: ${{ matrix.lc_all || '' }}
would ensure the locale for the test run would not be the default (en_US.UTF-8) for the other locale builds.
So IIUC, this is necessary if setting a non-utf8 locale?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, this is what I thought too.
Unfortunately when I tried setting LANG
and LC_ALL
to zh_CN
(non-utf8), then there was a strange segmentation fault appearing (see #46405 (comment))
So this is why I now do not set it through these env variables but I only install it.
Maybe the problem is that when the env variables are set, many side effects happen in the build chain, that I do not quite understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're setting the locale this way, can we remove setting the LANG
and LC_ALL
environment variables here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I have no idea :) At least for the tests in this PR, yes these env variables are not used.
However these env vars are probably related to other locale-related tests, including being able to compile pandas on alternate locale, etc. It seems that in the CI those env variables are used to even add a specific line at the beginning of the init file of pandas ! (see ci/setup_env.sh
)
This seems a bit beyond the scope of this PR. However it is maybe worth keeping track of this, in a dedicated ticket.
…ure/46319_locale_strftime_period � Conflicts: � .github/workflows/ubuntu.yml
@mroeschke @jreback @jbrockmendel all set now. Sorry for the delay. Maybe still on time for 1.5 milestone ? |
Sorry, since we released the rc already, we are only backporting fixes that are uncovered in the 1.5.0rc. Could you move the whatsnew note to |
Hello @smarie! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
|
@mroeschke this should be ok now. |
Thanks @smarie for your patience and persistence! Very nice fix |
Thanks a lot @mroeschke ! |
…specific directive is used (pandas-dev#46405) * Added test representative of pandas-dev#46319. Should fail on CI * Added a gha worker with non utf 8 zh_CN encoding * Attempt to fix the encoding so that locale works * Added the fix, but not using it for now, until CI is able to reproduce the issue. * Crazy idea: maybe simply removing the .utf8 modifier will use the right encoding ! * Hopefully fixing the locale not available error * Now simply generating the locale, not updating the ubuntu one * Trying to install the locale without enabling it * Stupid mistake * Testing the optional locale generator condition * Put back all runners * Added whatsnew * Now using the fix * As per code review: moved locale-switching fixture `overridden_locale` to conftest * Flake8 * Added comments on the runner * Added a non-utf8 locale in the `it_IT` runner. Added the zh_CN.utf8 locale in the tests * Improved readability of fixture `overridden_locale` as per code review * Added two comments on default encoding * Fixed pandas-dev#46319 by adding a new `char_to_string_locale` function in the `tslibs.util` module, able to decode char* using the current locale. * As per code review: modified the test to contain non-utf8 chars. Fixed the resulting issue. * Split the test in two for clarity * Fixed test and flake8 error. * Updated whatsnew to ref pandas-dev#46468 . Updated test name * Removing wrong whatsnew bullet * Nitpick on whatsnew as per code review * Fixed build error rst directive * Names incorrectly reverted in last merge commit * Fixed test_localization so that pandas-dev#46595 can be demonstrated on windows targets (even if today these do not run on windows targets, see pandas-dev#46597) * Fixed `tm.set_locale` context manager, it could error and leak when category LC_ALL was used. Fixed pandas-dev#46595 * Removed the fixture as per code review, and added corresponding parametrization in tests. * Dummy mod to trigger CI again * reverted dummy mod * Attempt to fix the remaining error on the numpy worker * Fixed issue in `_from_ordinal` * Added asserts to try to understand * Reverted debugging asserts and applied fix for numpy repeat from pandas-dev#47670. * Fixed the last issue on numpy dev: a TypeError message had changed * Code review: Removed `EXTRA_LOC` * Code review: removed commented line * Code review: reverted out of scope change * Code review: reverted out of scope change * Fixed unused import * Fixed revert mistake * Moved whatsnew to 1.6.0 * Update pandas/tests/io/parser/test_quoting.py Co-authored-by: Sylvain MARIE <[email protected]>
UnicodeError
when usingPeriod.strftime
with non-utf8 locale #46319 , closes Not tested:Period.strftime
andPeriodIndex.strftime
with non-ascii char present in the formatting string #46468doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Note: this was extracted from #46116 to decompose the big PR into reviewable bits. More to come soon