Skip to content

BUG: UnicodeError when using Period.strftime with non-utf8 locale #46319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
smarie opened this issue Mar 11, 2022 · 5 comments · Fixed by #46405
Closed
3 tasks done

BUG: UnicodeError when using Period.strftime with non-utf8 locale #46319

smarie opened this issue Mar 11, 2022 · 5 comments · Fixed by #46405
Labels
Bug Output-Formatting __repr__ of pandas objects, to_string Period Period data type

Comments

@smarie
Copy link
Contributor

smarie commented Mar 11, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import locale

locale.setlocale(locale.LC_ALL, "zh_CN")
p = pd.Period("2018-03-11 13:00", freq="H")
print(p.strftime("%p"))  # UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcf ...

Issue Description

Period.strftime("%p") prints the locale-specific version of AM or PM. When the locale uses a non-utf8 compliant encoding, it crashes.

This bug does not happen with others, for example Timestamp.strftime.

Expected Behavior

No error, printing the actual string representing AM or PM

Installed Versions

This is on the main branch head

@smarie smarie added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 11, 2022
@smarie
Copy link
Contributor Author

smarie commented Mar 11, 2022

I'll try to fix it in #46116

@smarie
Copy link
Contributor Author

smarie commented Mar 11, 2022

It actually seems that util.char_to_string is used after c_strftime . This relies on PyUnicode_FromString.

Maybe we should rather use PyUnicode_DecodeLocale https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_DecodeLocale

smarie pushed a commit to smarie/pandas that referenced this issue Mar 11, 2022
…e string returned by `c_strftime`.
@mroeschke mroeschke added Output-Formatting __repr__ of pandas objects, to_string Period Period data type and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 17, 2022
smarie pushed a commit to smarie/pandas that referenced this issue Mar 17, 2022
@smarie
Copy link
Contributor Author

smarie commented Mar 19, 2022

Note that this issue may also happen when plotting timeseries with matplotlib. Indeed the x-axis labels formatter seems to be using Period.strftime. For example this is a stracktrace I got from sphinx-gallery with figures using matplotlib :

WARNING: C:\(...)\doc\examples\1_demo.py failed to execute correctly: Traceback (most recent call last):
  File "C:\(...)\.nox\doc\lib\site-packages\sphinx_gallery\scrapers.py", line 378, in save_figures
    rst = scraper(block, block_vars, gallery_conf)
  File "C:\(...)\.nox\doc\lib\site-packages\sphinx_gallery\scrapers.py", line 171, in matplotlib_scraper
    fig.savefig(image_path, **these_kwargs)
  File "C:\(...)\.nox\doc\lib\site-packages\matplotlib\figure.py", line 3019, in savefig
    self.canvas.print_figure(fname, **kwargs)
  File "C:\(...)\.nox\doc\lib\site-packages\matplotlib\backend_bases.py", line 2319, in print_figure
    result = print_method(
  File "C:\(...)\.nox\doc\lib\site-packages\matplotlib\backend_bases.py", line 1648, in wrapper
    return func(*args, **kwargs)
  File "C:\(...)\.nox\doc\lib\site-packages\matplotlib\_api\deprecation.py", line 412, in wrapper
    return func(*inner_args, **inner_kwargs)
  File "C:\(...)\.nox\doc\lib\site-packages\matplotlib\backends\backend_agg.py", line 540, in print_png
    FigureCanvasAgg.draw(self)
  File "C:\(...)\.nox\doc\lib\site-packages\matplotlib\backends\backend_agg.py", line 436, in draw
    self.figure.draw(self.renderer)
  File "C:\(...)\.nox\doc\lib\site-packages\matplotlib\artist.py", line 73, in draw_wrapper
    result = draw(artist, renderer, *args, **kwargs)
  File "C:\(...)\.nox\doc\lib\site-packages\matplotlib\artist.py", line 50, in draw_wrapper
    return draw(artist, renderer)
  File "C:\(...)\.nox\doc\lib\site-packages\matplotlib\figure.py", line 2810, in draw
    mimage._draw_list_compositing_images(
  File "C:\(...)\.nox\doc\lib\site-packages\matplotlib\image.py", line 132, in _draw_list_compositing_images
    a.draw(renderer)
  File "C:\(...)\.nox\doc\lib\site-packages\matplotlib\artist.py", line 50, in draw_wrapper
    return draw(artist, renderer)
  File "C:\(...)\.nox\doc\lib\site-packages\matplotlib\axes\_base.py", line 3082, in draw
    mimage._draw_list_compositing_images(
  File "C:\(...)\.nox\doc\lib\site-packages\matplotlib\image.py", line 132, in _draw_list_compositing_images
    a.draw(renderer)
  File "C:\(...)\.nox\doc\lib\site-packages\matplotlib\artist.py", line 50, in draw_wrapper
    return draw(artist, renderer)
  File "C:\(...)\.nox\doc\lib\site-packages\matplotlib\axis.py", line 1158, in draw
    ticks_to_draw = self._update_ticks()
  File "C:\(...)\.nox\doc\lib\site-packages\matplotlib\axis.py", line 1046, in _update_ticks
    major_labels = self.major.formatter.format_ticks(major_locs)
  File "C:\(...)\.nox\doc\lib\site-packages\matplotlib\ticker.py", line 224, in format_ticks
    return [self(value, i) for i, value in enumerate(values)]
  File "C:\(...)\.nox\doc\lib\site-packages\matplotlib\ticker.py", line 224, in <listcomp>
    return [self(value, i) for i, value in enumerate(values)]
  File "C:\(...)\.nox\doc\lib\site-packages\pandas\plotting\_matplotlib\converter.py", line 1074, in __call__
    return Period(ordinal=int(x), freq=self.freq).strftime(fmt)
  File "pandas\_libs\tslibs\period.pyx", line 2458, in pandas._libs.tslibs.period._Period.strftime
  File "pandas\_libs\tslibs\period.pyx", line 1225, in pandas._libs.tslibs.period.period_format
  File "pandas\_libs\tslibs\period.pyx", line 1258, in pandas._libs.tslibs.period._period_strftime
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3: invalid continuation byte

smarie pushed a commit to smarie/pandas that referenced this issue Mar 21, 2022
…on in the `tslibs.util` module, able to decode char* using the current locale.
@smarie
Copy link
Contributor Author

smarie commented Mar 21, 2022

Finally note that the issue also happens when there is a non-utf8 character in the formatting string:

import pandas as pd
import locale

locale.setlocale(locale.LC_ALL, "fr_FR")
per = pd.Period("2018-03-11 13:00", freq="H")
assert per.strftime("é") == "é"  # AssertionError

In that case no error is raised but the output string does not correspond to the expected one.

Thanks @jreback for suggesting that this might fail !

@smarie
Copy link
Contributor Author

smarie commented Mar 22, 2022

I moved the above into a dedicated issue for clarity, as the issues have similar causes but are not related

@jreback jreback added this to the 1.5 milestone Mar 22, 2022
@mroeschke mroeschke removed this from the 1.5 milestone Aug 15, 2022
noatamir pushed a commit to noatamir/pandas that referenced this issue Nov 9, 2022
…specific directive is used (pandas-dev#46405)

* Added test representative of pandas-dev#46319. Should fail on CI

* Added a gha worker with non utf 8 zh_CN encoding

* Attempt to fix the encoding so that locale works

* Added the fix, but not using it for now, until CI is able to reproduce the issue.

* Crazy idea: maybe simply removing the .utf8 modifier will use the right encoding !

* Hopefully fixing the locale not available error

* Now simply generating the locale, not updating the ubuntu one

* Trying to install the locale without enabling it

* Stupid mistake

* Testing the optional locale generator condition

* Put back all runners

* Added whatsnew

* Now using the fix

* As per code review: moved locale-switching fixture `overridden_locale` to conftest

* Flake8

* Added comments on the runner

* Added a non-utf8 locale in the `it_IT` runner. Added the zh_CN.utf8 locale in the tests

* Improved readability of fixture `overridden_locale` as per code review

* Added two comments on default encoding

* Fixed pandas-dev#46319 by adding a new `char_to_string_locale` function in the `tslibs.util` module, able to decode char* using the current locale.

* As per code review: modified the test to contain non-utf8 chars. Fixed the resulting issue.

* Split the test in two for clarity

* Fixed test and flake8 error.

* Updated whatsnew to ref pandas-dev#46468 . Updated test name

* Removing wrong whatsnew bullet

* Nitpick on whatsnew as per code review

* Fixed build error rst directive

* Names incorrectly reverted in last merge commit

* Fixed test_localization so that pandas-dev#46595 can be demonstrated on windows targets (even if today these do not run on windows targets, see pandas-dev#46597)

* Fixed `tm.set_locale` context manager, it could error and leak when category LC_ALL was used. Fixed pandas-dev#46595

* Removed the fixture as per code review, and added corresponding parametrization in tests.

* Dummy mod to trigger CI again

* reverted dummy mod

* Attempt to fix the remaining error on the numpy worker

* Fixed issue in `_from_ordinal`

* Added asserts to try to understand

* Reverted debugging asserts and applied fix for numpy repeat from pandas-dev#47670.

* Fixed the last issue on numpy dev: a TypeError message had changed

* Code review: Removed `EXTRA_LOC`

* Code review: removed commented line

* Code review: reverted out of scope change

* Code review: reverted out of scope change

* Fixed unused import

* Fixed revert mistake

* Moved whatsnew to 1.6.0

* Update pandas/tests/io/parser/test_quoting.py

Co-authored-by: Sylvain MARIE <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Output-Formatting __repr__ of pandas objects, to_string Period Period data type
Projects
None yet
3 participants