Skip to content

Commit ae6dc97

Browse files
smarieSylvain MARIE
and
Sylvain MARIE
authored
BUG: Fixed Unicode decoding error in Period.strftime when a locale-specific directive is used (#46405)
* Added test representative of #46319. Should fail on CI * Added a gha worker with non utf 8 zh_CN encoding * Attempt to fix the encoding so that locale works * Added the fix, but not using it for now, until CI is able to reproduce the issue. * Crazy idea: maybe simply removing the .utf8 modifier will use the right encoding ! * Hopefully fixing the locale not available error * Now simply generating the locale, not updating the ubuntu one * Trying to install the locale without enabling it * Stupid mistake * Testing the optional locale generator condition * Put back all runners * Added whatsnew * Now using the fix * As per code review: moved locale-switching fixture `overridden_locale` to conftest * Flake8 * Added comments on the runner * Added a non-utf8 locale in the `it_IT` runner. Added the zh_CN.utf8 locale in the tests * Improved readability of fixture `overridden_locale` as per code review * Added two comments on default encoding * Fixed #46319 by adding a new `char_to_string_locale` function in the `tslibs.util` module, able to decode char* using the current locale. * As per code review: modified the test to contain non-utf8 chars. Fixed the resulting issue. * Split the test in two for clarity * Fixed test and flake8 error. * Updated whatsnew to ref #46468 . Updated test name * Removing wrong whatsnew bullet * Nitpick on whatsnew as per code review * Fixed build error rst directive * Names incorrectly reverted in last merge commit * Fixed test_localization so that #46595 can be demonstrated on windows targets (even if today these do not run on windows targets, see #46597) * Fixed `tm.set_locale` context manager, it could error and leak when category LC_ALL was used. Fixed #46595 * Removed the fixture as per code review, and added corresponding parametrization in tests. * Dummy mod to trigger CI again * reverted dummy mod * Attempt to fix the remaining error on the numpy worker * Fixed issue in `_from_ordinal` * Added asserts to try to understand * Reverted debugging asserts and applied fix for numpy repeat from #47670. * Fixed the last issue on numpy dev: a TypeError message had changed * Code review: Removed `EXTRA_LOC` * Code review: removed commented line * Code review: reverted out of scope change * Code review: reverted out of scope change * Fixed unused import * Fixed revert mistake * Moved whatsnew to 1.6.0 * Update pandas/tests/io/parser/test_quoting.py Co-authored-by: Sylvain MARIE <[email protected]>
1 parent c30456f commit ae6dc97

File tree

5 files changed

+109
-6
lines changed

5 files changed

+109
-6
lines changed

.github/workflows/ubuntu.yml

+16-2
Original file line numberDiff line numberDiff line change
@@ -42,18 +42,26 @@ jobs:
4242
- name: "Minimum Versions"
4343
env_file: actions-38-minimum_versions.yaml
4444
pattern: "not slow and not network and not single_cpu"
45-
- name: "Locale: it_IT.utf8"
45+
- name: "Locale: it_IT"
4646
env_file: actions-38.yaml
4747
pattern: "not slow and not network and not single_cpu"
4848
extra_apt: "language-pack-it"
49+
# Use the utf8 version as the default, it has no bad side-effect.
4950
lang: "it_IT.utf8"
5051
lc_all: "it_IT.utf8"
51-
- name: "Locale: zh_CN.utf8"
52+
# Also install it_IT (its encoding is ISO8859-1) but do not activate it.
53+
# It will be temporarily activated during tests with locale.setlocale
54+
extra_loc: "it_IT"
55+
- name: "Locale: zh_CN"
5256
env_file: actions-38.yaml
5357
pattern: "not slow and not network and not single_cpu"
5458
extra_apt: "language-pack-zh-hans"
59+
# Use the utf8 version as the default, it has no bad side-effect.
5560
lang: "zh_CN.utf8"
5661
lc_all: "zh_CN.utf8"
62+
# Also install zh_CN (its encoding is gb2312) but do not activate it.
63+
# It will be temporarily activated during tests with locale.setlocale
64+
extra_loc: "zh_CN"
5765
- name: "Copy-on-Write"
5866
env_file: actions-310.yaml
5967
pattern: "not slow and not network and not single_cpu"
@@ -148,6 +156,12 @@ jobs:
148156
# xsel for clipboard tests
149157
run: sudo apt-get update && sudo apt-get install -y xsel ${{ env.EXTRA_APT }}
150158

159+
- name: Generate extra locales
160+
# These extra locales will be available for locale.setlocale() calls in tests
161+
run: |
162+
sudo locale-gen ${{ matrix.extra_loc }}
163+
if: ${{ matrix.extra_loc }}
164+
151165
- name: Set up Conda
152166
uses: ./.github/actions/setup-conda
153167
with:

doc/source/whatsnew/v1.6.0.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -175,7 +175,7 @@ I/O
175175

176176
Period
177177
^^^^^^
178-
-
178+
- Bug in :meth:`Period.strftime` and :meth:`PeriodIndex.strftime`, raising ``UnicodeDecodeError`` when a locale-specific directive was passed (:issue:`46319`)
179179
-
180180

181181
Plotting

pandas/_libs/tslibs/period.pyx

+4-2
Original file line numberDiff line numberDiff line change
@@ -1160,7 +1160,8 @@ cdef str period_format(int64_t value, int freq, object fmt=None):
11601160
return "NaT"
11611161

11621162
if isinstance(fmt, str):
1163-
fmt = fmt.encode("utf-8")
1163+
# Encode using current locale, in case fmt contains non-utf8 chars
1164+
fmt = <bytes>util.string_encode_locale(fmt)
11641165

11651166
if fmt is None:
11661167
freq_group = get_freq_group(freq)
@@ -1229,7 +1230,8 @@ cdef str _period_strftime(int64_t value, int freq, bytes fmt):
12291230
# Execute c_strftime to process the usual datetime directives
12301231
formatted = c_strftime(&dts, <char*>fmt)
12311232

1232-
result = util.char_to_string(formatted)
1233+
# Decode result according to current locale
1234+
result = util.char_to_string_locale(formatted)
12331235
free(formatted)
12341236

12351237
# Now we will fill the placeholders corresponding to our additional directives

pandas/_libs/tslibs/util.pxd

+14
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,10 @@ cdef extern from "Python.h":
2626
const char* PyUnicode_AsUTF8AndSize(object obj,
2727
Py_ssize_t* length) except NULL
2828

29+
object PyUnicode_EncodeLocale(object obj, const char *errors) nogil
30+
object PyUnicode_DecodeLocale(const char *str, const char *errors) nogil
31+
32+
2933
from numpy cimport (
3034
float64_t,
3135
int64_t,
@@ -219,3 +223,13 @@ cdef inline const char* get_c_string_buf_and_size(str py_string,
219223

220224
cdef inline const char* get_c_string(str py_string) except NULL:
221225
return get_c_string_buf_and_size(py_string, NULL)
226+
227+
228+
cdef inline bytes string_encode_locale(str py_string):
229+
"""As opposed to PyUnicode_Encode, use current system locale to encode."""
230+
return PyUnicode_EncodeLocale(py_string, NULL)
231+
232+
233+
cdef inline object char_to_string_locale(const char* data):
234+
"""As opposed to PyUnicode_FromString, use current system locale to decode."""
235+
return PyUnicode_DecodeLocale(data, NULL)

pandas/tests/io/formats/test_format.py

+74-1
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,14 @@
11
"""
22
Test output formatting for Series/DataFrame, including to_string & reprs
33
"""
4-
from datetime import datetime
4+
from contextlib import nullcontext
5+
from datetime import (
6+
datetime,
7+
time,
8+
)
59
from io import StringIO
610
import itertools
11+
import locale
712
from operator import methodcaller
813
import os
914
from pathlib import Path
@@ -46,6 +51,13 @@
4651
use_32bit_repr = is_platform_windows() or not IS64
4752

4853

54+
def get_local_am_pm():
55+
"""Return the AM and PM strings returned by strftime in current locale."""
56+
am_local = time(1).strftime("%p")
57+
pm_local = time(13).strftime("%p")
58+
return am_local, pm_local
59+
60+
4961
@pytest.fixture(params=["string", "pathlike", "buffer"])
5062
def filepath_or_buffer_id(request):
5163
"""
@@ -3219,6 +3231,67 @@ def test_period_tz(self):
32193231
per = dt.to_period(freq="H")
32203232
assert per.format()[0] == "2013-01-01 00:00"
32213233

3234+
@pytest.mark.parametrize(
3235+
"locale_str",
3236+
[
3237+
pytest.param(None, id=str(locale.getlocale())),
3238+
"it_IT.utf8",
3239+
"it_IT", # Note: encoding will be 'ISO8859-1'
3240+
"zh_CN.utf8",
3241+
"zh_CN", # Note: encoding will be 'gb2312'
3242+
],
3243+
)
3244+
def test_period_non_ascii_fmt(self, locale_str):
3245+
# GH#46468 non-ascii char in input format string leads to wrong output
3246+
3247+
# Skip if locale cannot be set
3248+
if locale_str is not None and not tm.can_set_locale(locale_str, locale.LC_ALL):
3249+
pytest.skip(f"Skipping as locale '{locale_str}' cannot be set on host.")
3250+
3251+
# Change locale temporarily for this test.
3252+
with tm.set_locale(locale_str, locale.LC_ALL) if locale_str else nullcontext():
3253+
# Scalar
3254+
per = pd.Period("2018-03-11 13:00", freq="H")
3255+
assert per.strftime("%y é") == "18 é"
3256+
3257+
# Index
3258+
per = pd.period_range("2003-01-01 01:00:00", periods=2, freq="12h")
3259+
formatted = per.format(date_format="%y é")
3260+
assert formatted[0] == "03 é"
3261+
assert formatted[1] == "03 é"
3262+
3263+
@pytest.mark.parametrize(
3264+
"locale_str",
3265+
[
3266+
pytest.param(None, id=str(locale.getlocale())),
3267+
"it_IT.utf8",
3268+
"it_IT", # Note: encoding will be 'ISO8859-1'
3269+
"zh_CN.utf8",
3270+
"zh_CN", # Note: encoding will be 'gb2312'
3271+
],
3272+
)
3273+
def test_period_custom_locale_directive(self, locale_str):
3274+
# GH#46319 locale-specific directive leads to non-utf8 c strftime char* result
3275+
3276+
# Skip if locale cannot be set
3277+
if locale_str is not None and not tm.can_set_locale(locale_str, locale.LC_ALL):
3278+
pytest.skip(f"Skipping as locale '{locale_str}' cannot be set on host.")
3279+
3280+
# Change locale temporarily for this test.
3281+
with tm.set_locale(locale_str, locale.LC_ALL) if locale_str else nullcontext():
3282+
# Get locale-specific reference
3283+
am_local, pm_local = get_local_am_pm()
3284+
3285+
# Scalar
3286+
per = pd.Period("2018-03-11 13:00", freq="H")
3287+
assert per.strftime("%p") == pm_local
3288+
3289+
# Index
3290+
per = pd.period_range("2003-01-01 01:00:00", periods=2, freq="12h")
3291+
formatted = per.format(date_format="%y %I:%M:%S%p")
3292+
assert formatted[0] == f"03 01:00:00{am_local}"
3293+
assert formatted[1] == f"03 01:00:00{pm_local}"
3294+
32223295

32233296
class TestDatetimeIndexFormat:
32243297
def test_datetime(self):

0 commit comments

Comments
 (0)