Skip to content

BUG: Fixed Unicode decoding error in Period.strftime when a locale-specific directive is used #46405

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
d177852
Added test representative of #46319. Should fail on CI
Mar 17, 2022
c09b76d
Added a gha worker with non utf 8 zh_CN encoding
Mar 17, 2022
9dff153
Attempt to fix the encoding so that locale works
Mar 17, 2022
2fa0736
Added the fix, but not using it for now, until CI is able to reproduc…
Mar 17, 2022
28a0921
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Mar 18, 2022
6b85fb2
Crazy idea: maybe simply removing the .utf8 modifier will use the rig…
Mar 18, 2022
409d196
Hopefully fixing the locale not available error
Mar 18, 2022
4d09db9
Now simply generating the locale, not updating the ubuntu one
Mar 18, 2022
3c1b80c
Trying to install the locale without enabling it
Mar 18, 2022
4c687a8
Stupid mistake
Mar 18, 2022
e01e051
Testing the optional locale generator condition
Mar 19, 2022
a535249
Put back all runners
Mar 19, 2022
b4ad6dd
Added whatsnew
Mar 19, 2022
0eafd80
Now using the fix
Mar 19, 2022
b721238
As per code review: moved locale-switching fixture `overridden_locale…
Mar 19, 2022
cebed78
Flake8
Mar 19, 2022
a25f3a9
Added comments on the runner
Mar 19, 2022
d175c36
Added a non-utf8 locale in the `it_IT` runner. Added the zh_CN.utf8 l…
Mar 20, 2022
184e480
Improved readability of fixture `overridden_locale` as per code review
Mar 20, 2022
e5f3062
Added two comments on default encoding
Mar 21, 2022
22b0ae4
Fixed #46319 by adding a new `char_to_string_locale` function in the …
Mar 21, 2022
bbaa0c1
As per code review: modified the test to contain non-utf8 chars. Fixe…
Mar 21, 2022
f6857d6
Split the test in two for clarity
Mar 21, 2022
60c9e69
Fixed test and flake8 error.
Mar 21, 2022
db0e3cc
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Mar 21, 2022
06a9567
Updated whatsnew to ref #46468 . Updated test name
Mar 22, 2022
411ccb7
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Mar 22, 2022
e5ab44e
Removing wrong whatsnew bullet
smarie Mar 22, 2022
fe35aac
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Mar 23, 2022
0790e72
Nitpick on whatsnew as per code review
Mar 23, 2022
d6914f4
Fixed build error rst directive
Mar 23, 2022
22c11e7
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Mar 24, 2022
236a2a0
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Mar 25, 2022
befa0a0
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Apr 1, 2022
0105ac9
Names incorrectly reverted in last merge commit
Apr 1, 2022
a1f3773
Fixed test_localization so that #46595 can be demonstrated on windows…
Apr 1, 2022
0fcea6c
Fixed `tm.set_locale` context manager, it could error and leak when c…
Apr 1, 2022
7e18375
Removed the fixture as per code review, and added corresponding param…
Apr 1, 2022
a1c6d83
Dummy mod to trigger CI again
Apr 1, 2022
78ac13d
reverted dummy mod
Apr 1, 2022
3fc0e25
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Apr 7, 2022
9ad0ee1
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Apr 12, 2022
406d304
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Jul 1, 2022
bc1b222
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Jul 12, 2022
fcd2ce2
Attempt to fix the remaining error on the numpy worker
Jul 12, 2022
b69b074
Fixed issue in `_from_ordinal`
Jul 12, 2022
65c3a1d
Added asserts to try to understand
Jul 12, 2022
e5bca9f
Reverted debugging asserts and applied fix for numpy repeat from #47670.
Jul 13, 2022
fbfefec
Fixed the last issue on numpy dev: a TypeError message had changed
Jul 13, 2022
b64548b
Merge branch 'main' into feature/46319_locale_strftime_period
smarie Jul 16, 2022
4d026cf
Code review: Removed `EXTRA_LOC`
Sep 3, 2022
cf3be80
Code review: removed commented line
Sep 3, 2022
8f65bd9
Code review: reverted out of scope change
Sep 3, 2022
7b414a9
Code review: reverted out of scope change
Sep 3, 2022
cb43fd5
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Sep 3, 2022
59b4fe3
Fixed unused import
Sep 3, 2022
4950e06
Fixed revert mistake
Sep 6, 2022
14f854b
Moved whatsnew to 1.6.0
Sep 6, 2022
90c31cb
Update pandas/tests/io/parser/test_quoting.py
smarie Sep 6, 2022
5d467e2
Merge branch 'main' into feature/46319_locale_strftime_period
smarie Sep 6, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 16 additions & 2 deletions .github/workflows/ubuntu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,18 +42,26 @@ jobs:
- name: "Minimum Versions"
env_file: actions-38-minimum_versions.yaml
pattern: "not slow and not network and not single_cpu"
- name: "Locale: it_IT.utf8"
- name: "Locale: it_IT"
env_file: actions-38.yaml
pattern: "not slow and not network and not single_cpu"
extra_apt: "language-pack-it"
# Use the utf8 version as the default, it has no bad side-effect.
lang: "it_IT.utf8"
lc_all: "it_IT.utf8"
- name: "Locale: zh_CN.utf8"
# Also install it_IT (its encoding is ISO8859-1) but do not activate it.
# It will be temporarily activated during tests with locale.setlocale
extra_loc: "it_IT"
- name: "Locale: zh_CN"
env_file: actions-38.yaml
pattern: "not slow and not network and not single_cpu"
extra_apt: "language-pack-zh-hans"
# Use the utf8 version as the default, it has no bad side-effect.
lang: "zh_CN.utf8"
lc_all: "zh_CN.utf8"
# Also install zh_CN (its encoding is gb2312) but do not activate it.
# It will be temporarily activated during tests with locale.setlocale
extra_loc: "zh_CN"
- name: "Copy-on-Write"
env_file: actions-310.yaml
pattern: "not slow and not network and not single_cpu"
Expand Down Expand Up @@ -148,6 +156,12 @@ jobs:
# xsel for clipboard tests
run: sudo apt-get update && sudo apt-get install -y xsel ${{ env.EXTRA_APT }}

- name: Generate extra locales
# These extra locales will be available for locale.setlocale() calls in tests
run: |
sudo locale-gen ${{ matrix.extra_loc }}
if: ${{ matrix.extra_loc }}

- name: Set up Conda
uses: ./.github/actions/setup-conda
with:
Expand Down
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.6.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@ I/O

Period
^^^^^^
-
- Bug in :meth:`Period.strftime` and :meth:`PeriodIndex.strftime`, raising ``UnicodeDecodeError`` when a locale-specific directive was passed (:issue:`46319`)
-

Plotting
Expand Down
6 changes: 4 additions & 2 deletions pandas/_libs/tslibs/period.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -1162,7 +1162,8 @@ cdef str period_format(int64_t value, int freq, object fmt=None):
return "NaT"

if isinstance(fmt, str):
fmt = fmt.encode("utf-8")
# Encode using current locale, in case fmt contains non-utf8 chars
fmt = <bytes>util.string_encode_locale(fmt)

if fmt is None:
freq_group = get_freq_group(freq)
Expand Down Expand Up @@ -1231,7 +1232,8 @@ cdef str _period_strftime(int64_t value, int freq, bytes fmt):
# Execute c_strftime to process the usual datetime directives
formatted = c_strftime(&dts, <char*>fmt)

result = util.char_to_string(formatted)
# Decode result according to current locale
result = util.char_to_string_locale(formatted)
free(formatted)

# Now we will fill the placeholders corresponding to our additional directives
Expand Down
14 changes: 14 additions & 0 deletions pandas/_libs/tslibs/util.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,10 @@ cdef extern from "Python.h":
const char* PyUnicode_AsUTF8AndSize(object obj,
Py_ssize_t* length) except NULL

object PyUnicode_EncodeLocale(object obj, const char *errors) nogil
object PyUnicode_DecodeLocale(const char *str, const char *errors) nogil


from numpy cimport (
float64_t,
int64_t,
Expand Down Expand Up @@ -219,3 +223,13 @@ cdef inline const char* get_c_string_buf_and_size(str py_string,

cdef inline const char* get_c_string(str py_string) except NULL:
return get_c_string_buf_and_size(py_string, NULL)


cdef inline bytes string_encode_locale(str py_string):
"""As opposed to PyUnicode_Encode, use current system locale to encode."""
return PyUnicode_EncodeLocale(py_string, NULL)


cdef inline object char_to_string_locale(const char* data):
"""As opposed to PyUnicode_FromString, use current system locale to decode."""
return PyUnicode_DecodeLocale(data, NULL)
75 changes: 74 additions & 1 deletion pandas/tests/io/formats/test_format.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,14 @@
"""
Test output formatting for Series/DataFrame, including to_string & reprs
"""
from datetime import datetime
from contextlib import nullcontext
from datetime import (
datetime,
time,
)
from io import StringIO
import itertools
import locale
from operator import methodcaller
import os
from pathlib import Path
Expand Down Expand Up @@ -46,6 +51,13 @@
use_32bit_repr = is_platform_windows() or not IS64


def get_local_am_pm():
"""Return the AM and PM strings returned by strftime in current locale."""
am_local = time(1).strftime("%p")
pm_local = time(13).strftime("%p")
return am_local, pm_local


@pytest.fixture(params=["string", "pathlike", "buffer"])
def filepath_or_buffer_id(request):
"""
Expand Down Expand Up @@ -3219,6 +3231,67 @@ def test_period_tz(self):
per = dt.to_period(freq="H")
assert per.format()[0] == "2013-01-01 00:00"

@pytest.mark.parametrize(
"locale_str",
[
pytest.param(None, id=str(locale.getlocale())),
"it_IT.utf8",
"it_IT", # Note: encoding will be 'ISO8859-1'
"zh_CN.utf8",
"zh_CN", # Note: encoding will be 'gb2312'
],
)
def test_period_non_ascii_fmt(self, locale_str):
# GH#46468 non-ascii char in input format string leads to wrong output

# Skip if locale cannot be set
if locale_str is not None and not tm.can_set_locale(locale_str, locale.LC_ALL):
pytest.skip(f"Skipping as locale '{locale_str}' cannot be set on host.")

# Change locale temporarily for this test.
with tm.set_locale(locale_str, locale.LC_ALL) if locale_str else nullcontext():
# Scalar
per = pd.Period("2018-03-11 13:00", freq="H")
assert per.strftime("%y é") == "18 é"

# Index
per = pd.period_range("2003-01-01 01:00:00", periods=2, freq="12h")
formatted = per.format(date_format="%y é")
assert formatted[0] == "03 é"
assert formatted[1] == "03 é"

@pytest.mark.parametrize(
"locale_str",
[
pytest.param(None, id=str(locale.getlocale())),
"it_IT.utf8",
"it_IT", # Note: encoding will be 'ISO8859-1'
"zh_CN.utf8",
"zh_CN", # Note: encoding will be 'gb2312'
],
)
def test_period_custom_locale_directive(self, locale_str):
# GH#46319 locale-specific directive leads to non-utf8 c strftime char* result

# Skip if locale cannot be set
if locale_str is not None and not tm.can_set_locale(locale_str, locale.LC_ALL):
pytest.skip(f"Skipping as locale '{locale_str}' cannot be set on host.")

# Change locale temporarily for this test.
with tm.set_locale(locale_str, locale.LC_ALL) if locale_str else nullcontext():
# Get locale-specific reference
am_local, pm_local = get_local_am_pm()

# Scalar
per = pd.Period("2018-03-11 13:00", freq="H")
assert per.strftime("%p") == pm_local

# Index
per = pd.period_range("2003-01-01 01:00:00", periods=2, freq="12h")
formatted = per.format(date_format="%y %I:%M:%S%p")
assert formatted[0] == f"03 01:00:00{am_local}"
assert formatted[1] == f"03 01:00:00{pm_local}"


class TestDatetimeIndexFormat:
def test_datetime(self):
Expand Down