Skip to content

BUG: Fixed Unicode decoding error in Period.strftime when a locale-specific directive is used #46405

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

smarie
Copy link
Contributor

@smarie smarie commented Mar 17, 2022

Note: this was extracted from #46116 to decompose the big PR into reviewable bits. More to come soon

@smarie smarie changed the title Fixed Unicode decoding error in Period.strftime and PeriodIndex.strftime when a locale-specific directive is used Fixed Unicode decoding error in Period.strftime when a locale-specific directive is used Mar 17, 2022
@smarie
Copy link
Contributor Author

smarie commented Mar 17, 2022

I have an issue: I can reproduce the bug locally but not on CI.
It seems that there is no CI runner as of today that uses non-utf8 encoded locale.

The issue can only be reproduced when the current locale encoding is not utf8:

image

(zh_CN defaults encoding is gb2312)

What is the recommended way to proceed ?

  • fix the issue without having the CI reproduce it ? (not recommended...)
  • add a workflow in posix.yml
  • other ?

I'll try the second option for now (when github issues are fixed, it seems down for now)

@smarie
Copy link
Contributor Author

smarie commented Mar 17, 2022

gha runner was not triggered apparently :( re-trying

@smarie
Copy link
Contributor Author

smarie commented Mar 17, 2022

Unnfortunately it does not work :(

/usr/bin/bash: warning: setlocale: LC_ALL: cannot change locale (zh_CN.GB2312)

EDIT: retrying with zh_CN.gb2312

@smarie
Copy link
Contributor Author

smarie commented Mar 17, 2022

ok with lower case zh_CN.gb2312 it is worse: the runner does not even start :(
Can anyone help me to define a runner configured with this locale+encoding ? (@jreback maybe ?) Otherwise I am not able to reproduce the issue on CI, which is a bit problematic to prevent regression once I push the fix.

@smarie
Copy link
Contributor Author

smarie commented Mar 18, 2022

I made progress: I was able to install the missing locale. However this leads to a segmentation fault !

https://github.com/pandas-dev/pandas/runs/5606041220?check_suite_focus=true#step:12:22

I will not try to resolve this in this PR.

Instead, my new attempt will be to install (but not activate) the zh_CN locale in the zh_ZH.utf-8 worker. It will then only be activated in the test requiring it.

@smarie
Copy link
Contributor Author

smarie commented Mar 19, 2022

It worked ! :)

https://github.com/pandas-dev/pandas/runs/5606395924?check_suite_focus=true#step:12:57

Now that the bug can be reproduced on CI, I will push the fix, probably monday.

@smarie
Copy link
Contributor Author

smarie commented Mar 19, 2022

All set - eventually I did it today :)
Ready for review. Maybe @mroeschke or @jbrockmendel ?

params=[
pytest.param(None, id=str(locale.getlocale())),
"it_IT.utf8",
"zh_CN",
Copy link
Contributor Author

@smarie smarie Mar 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe

Suggested change
"zh_CN",
"zh_CN", # Note: encoding is automatically 'gb2312'

What do you think ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to add zh_CH.utf8 and zh_CH.gb2312 explicitly

Copy link
Contributor Author

@smarie smarie Mar 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll add zh_CH.utf8.
Strangely enough, zh_CN.gb2312 did not seem to be working, but zh_CN does involve gb2312 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will also add it_IT by the way, so that both locales have their utf8 AND non-utf8 encoding activated in the test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- name: Generate extra locales
# These extra locales will be available for locale.setlocale() calls in tests
run: |
sudo locale-gen ${{ env.EXTRA_LOC }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the Locale: it_IT.utf8 isnt working either?

Copy link
Contributor Author

@smarie smarie Mar 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends what you mean by "working". Do you refer to the bug #46319 or to something more general ?

  • The issue BUG: UnicodeError when using Period.strftime with non-utf8 locale #46319 happens only when the current locale uses a non-unicode encoding (no utf-8). So it does not happen in any of the workers before this PR
  • So in the test fixture parameters I added the zh_CN locale, that has a non-utf-8 encoding.
  • Unfortunately this particular parameter value was never used in any of the workers (the test was skipped), since locale.setlocale was raising an error (locale not available)
  • I finally found a way to make the locale available on linux workers: sudo locale-gen <locale>

Does that answer your question ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was assumed setting the environment variables

      LANG: ${{ matrix.lang || '' }}
      LC_ALL: ${{ matrix.lc_all || '' }}

would ensure the locale for the test run would not be the default (en_US.UTF-8) for the other locale builds.

So IIUC, this is necessary if setting a non-utf8 locale?

Copy link
Contributor Author

@smarie smarie Mar 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this is what I thought too.
Unfortunately when I tried setting LANG and LC_ALL to zh_CN (non-utf8), then there was a strange segmentation fault appearing (see #46405 (comment))

So this is why I now do not set it through these env variables but I only install it.

Maybe the problem is that when the env variables are set, many side effects happen in the build chain, that I do not quite understand.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're setting the locale this way, can we remove setting the LANG and LC_ALL environment variables here?

Copy link
Contributor Author

@smarie smarie Apr 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I have no idea :) At least for the tests in this PR, yes these env variables are not used.

However these env vars are probably related to other locale-related tests, including being able to compile pandas on alternate locale, etc. It seems that in the CI those env variables are used to even add a specific line at the beginning of the init file of pandas ! (see ci/setup_env.sh)

This seems a bit beyond the scope of this PR. However it is maybe worth keeping track of this, in a dedicated ticket.

@mroeschke mroeschke removed this from the 1.5 milestone Aug 22, 2022
@smarie
Copy link
Contributor Author

smarie commented Sep 4, 2022

@mroeschke @jreback @jbrockmendel all set now. Sorry for the delay.
Note that the only failing CI build seems completely unrelated (the issue is about a TLS certificate bundle)

Maybe still on time for 1.5 milestone ?

@smarie smarie requested a review from mroeschke September 6, 2022 06:50
@mroeschke
Copy link
Member

Maybe still on time for 1.5 milestone ?

Sorry, since we released the rc already, we are only backporting fixes that are uncovered in the 1.5.0rc. Could you move the whatsnew note to 1.6.0.rst?

@pep8speaks
Copy link

Hello @smarie! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 43:36: E261 at least two spaces before inline comment

@smarie
Copy link
Contributor Author

smarie commented Sep 6, 2022

@mroeschke this should be ok now.

@smarie smarie requested a review from mroeschke September 8, 2022 09:17
@mroeschke mroeschke removed the Stale label Sep 8, 2022
@mroeschke mroeschke added this to the 1.6 milestone Sep 8, 2022
@mroeschke mroeschke merged commit ae6dc97 into pandas-dev:main Sep 8, 2022
@mroeschke
Copy link
Member

Thanks @smarie for your patience and persistence! Very nice fix

@smarie
Copy link
Contributor Author

smarie commented Sep 8, 2022

Thanks a lot @mroeschke !
I can not proceed back to #46116

@smarie smarie deleted the feature/46319_locale_strftime_period branch September 8, 2022 20:23
@mroeschke mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
…specific directive is used (pandas-dev#46405)

* Added test representative of pandas-dev#46319. Should fail on CI

* Added a gha worker with non utf 8 zh_CN encoding

* Attempt to fix the encoding so that locale works

* Added the fix, but not using it for now, until CI is able to reproduce the issue.

* Crazy idea: maybe simply removing the .utf8 modifier will use the right encoding !

* Hopefully fixing the locale not available error

* Now simply generating the locale, not updating the ubuntu one

* Trying to install the locale without enabling it

* Stupid mistake

* Testing the optional locale generator condition

* Put back all runners

* Added whatsnew

* Now using the fix

* As per code review: moved locale-switching fixture `overridden_locale` to conftest

* Flake8

* Added comments on the runner

* Added a non-utf8 locale in the `it_IT` runner. Added the zh_CN.utf8 locale in the tests

* Improved readability of fixture `overridden_locale` as per code review

* Added two comments on default encoding

* Fixed pandas-dev#46319 by adding a new `char_to_string_locale` function in the `tslibs.util` module, able to decode char* using the current locale.

* As per code review: modified the test to contain non-utf8 chars. Fixed the resulting issue.

* Split the test in two for clarity

* Fixed test and flake8 error.

* Updated whatsnew to ref pandas-dev#46468 . Updated test name

* Removing wrong whatsnew bullet

* Nitpick on whatsnew as per code review

* Fixed build error rst directive

* Names incorrectly reverted in last merge commit

* Fixed test_localization so that pandas-dev#46595 can be demonstrated on windows targets (even if today these do not run on windows targets, see pandas-dev#46597)

* Fixed `tm.set_locale` context manager, it could error and leak when category LC_ALL was used. Fixed pandas-dev#46595

* Removed the fixture as per code review, and added corresponding parametrization in tests.

* Dummy mod to trigger CI again

* reverted dummy mod

* Attempt to fix the remaining error on the numpy worker

* Fixed issue in `_from_ordinal`

* Added asserts to try to understand

* Reverted debugging asserts and applied fix for numpy repeat from pandas-dev#47670.

* Fixed the last issue on numpy dev: a TypeError message had changed

* Code review: Removed `EXTRA_LOC`

* Code review: removed commented line

* Code review: reverted out of scope change

* Code review: reverted out of scope change

* Fixed unused import

* Fixed revert mistake

* Moved whatsnew to 1.6.0

* Update pandas/tests/io/parser/test_quoting.py

Co-authored-by: Sylvain MARIE <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Period Period data type Unicode Unicode strings
Projects
None yet
5 participants