Skip to content

BUG: Fixed Unicode decoding error in Period.strftime when a locale-specific directive is used #46405

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
d177852
Added test representative of #46319. Should fail on CI
Mar 17, 2022
c09b76d
Added a gha worker with non utf 8 zh_CN encoding
Mar 17, 2022
9dff153
Attempt to fix the encoding so that locale works
Mar 17, 2022
2fa0736
Added the fix, but not using it for now, until CI is able to reproduc…
Mar 17, 2022
28a0921
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Mar 18, 2022
6b85fb2
Crazy idea: maybe simply removing the .utf8 modifier will use the rig…
Mar 18, 2022
409d196
Hopefully fixing the locale not available error
Mar 18, 2022
4d09db9
Now simply generating the locale, not updating the ubuntu one
Mar 18, 2022
3c1b80c
Trying to install the locale without enabling it
Mar 18, 2022
4c687a8
Stupid mistake
Mar 18, 2022
e01e051
Testing the optional locale generator condition
Mar 19, 2022
a535249
Put back all runners
Mar 19, 2022
b4ad6dd
Added whatsnew
Mar 19, 2022
0eafd80
Now using the fix
Mar 19, 2022
b721238
As per code review: moved locale-switching fixture `overridden_locale…
Mar 19, 2022
cebed78
Flake8
Mar 19, 2022
a25f3a9
Added comments on the runner
Mar 19, 2022
d175c36
Added a non-utf8 locale in the `it_IT` runner. Added the zh_CN.utf8 l…
Mar 20, 2022
184e480
Improved readability of fixture `overridden_locale` as per code review
Mar 20, 2022
e5f3062
Added two comments on default encoding
Mar 21, 2022
22b0ae4
Fixed #46319 by adding a new `char_to_string_locale` function in the …
Mar 21, 2022
bbaa0c1
As per code review: modified the test to contain non-utf8 chars. Fixe…
Mar 21, 2022
f6857d6
Split the test in two for clarity
Mar 21, 2022
60c9e69
Fixed test and flake8 error.
Mar 21, 2022
db0e3cc
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Mar 21, 2022
06a9567
Updated whatsnew to ref #46468 . Updated test name
Mar 22, 2022
411ccb7
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Mar 22, 2022
e5ab44e
Removing wrong whatsnew bullet
smarie Mar 22, 2022
fe35aac
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Mar 23, 2022
0790e72
Nitpick on whatsnew as per code review
Mar 23, 2022
d6914f4
Fixed build error rst directive
Mar 23, 2022
22c11e7
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Mar 24, 2022
236a2a0
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Mar 25, 2022
befa0a0
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Apr 1, 2022
0105ac9
Names incorrectly reverted in last merge commit
Apr 1, 2022
a1f3773
Fixed test_localization so that #46595 can be demonstrated on windows…
Apr 1, 2022
0fcea6c
Fixed `tm.set_locale` context manager, it could error and leak when c…
Apr 1, 2022
7e18375
Removed the fixture as per code review, and added corresponding param…
Apr 1, 2022
a1c6d83
Dummy mod to trigger CI again
Apr 1, 2022
78ac13d
reverted dummy mod
Apr 1, 2022
3fc0e25
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Apr 7, 2022
9ad0ee1
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Apr 12, 2022
406d304
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Jul 1, 2022
bc1b222
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Jul 12, 2022
fcd2ce2
Attempt to fix the remaining error on the numpy worker
Jul 12, 2022
b69b074
Fixed issue in `_from_ordinal`
Jul 12, 2022
65c3a1d
Added asserts to try to understand
Jul 12, 2022
e5bca9f
Reverted debugging asserts and applied fix for numpy repeat from #47670.
Jul 13, 2022
fbfefec
Fixed the last issue on numpy dev: a TypeError message had changed
Jul 13, 2022
b64548b
Merge branch 'main' into feature/46319_locale_strftime_period
smarie Jul 16, 2022
4d026cf
Code review: Removed `EXTRA_LOC`
Sep 3, 2022
cf3be80
Code review: removed commented line
Sep 3, 2022
8f65bd9
Code review: reverted out of scope change
Sep 3, 2022
7b414a9
Code review: reverted out of scope change
Sep 3, 2022
cb43fd5
Merge branch 'main' of https://github.com/pandas-dev/pandas into feat…
Sep 3, 2022
59b4fe3
Fixed unused import
Sep 3, 2022
4950e06
Fixed revert mistake
Sep 6, 2022
14f854b
Moved whatsnew to 1.6.0
Sep 6, 2022
90c31cb
Update pandas/tests/io/parser/test_quoting.py
smarie Sep 6, 2022
5d467e2
Merge branch 'main' into feature/46319_locale_strftime_period
smarie Sep 6, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion .github/workflows/posix.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,9 +46,13 @@ jobs:
- env_file: actions-38.yaml
pattern: "not slow and not network and not single_cpu"
extra_apt: "language-pack-zh-hans"
# Use the utf8 version as the default, it has no bad side-effect.
lang: "zh_CN.utf8"
lc_all: "zh_CN.utf8"
name: "Locale: zh_CN.utf8"
# Also install zh_CN (its encoding is gb2312) but do not activate it.
# It will be temporarily activated during tests with locale.setlocale
extra_loc: "zh_CN"
name: "Locale: zh_CN.utf8 + extra zh_CN (gb2312)"
- env_file: actions-38.yaml
pattern: "not slow and not network and not single_cpu"
pandas_data_manager: "array"
Expand All @@ -68,6 +72,7 @@ jobs:
ENV_FILE: ci/deps/${{ matrix.env_file }}
PATTERN: ${{ matrix.pattern }}
EXTRA_APT: ${{ matrix.extra_apt || '' }}
EXTRA_LOC: ${{ matrix.extra_loc || '' }}
LANG: ${{ matrix.lang || '' }}
LC_ALL: ${{ matrix.lc_all || '' }}
PANDAS_TESTING_MODE: ${{ matrix.pandas_testing_mode || '' }}
Expand Down Expand Up @@ -138,6 +143,12 @@ jobs:
# xsel for clipboard tests
run: sudo apt-get update && sudo apt-get install -y libc6-dev-i386 xsel ${{ env.EXTRA_APT }}

- name: Generate extra locales
# These extra locales will be available for locale.setlocale() calls in tests
run: |
sudo locale-gen ${{ env.EXTRA_LOC }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the Locale: it_IT.utf8 isnt working either?

Copy link
Contributor Author

@smarie smarie Mar 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends what you mean by "working". Do you refer to the bug #46319 or to something more general ?

  • The issue BUG: UnicodeError when using Period.strftime with non-utf8 locale #46319 happens only when the current locale uses a non-unicode encoding (no utf-8). So it does not happen in any of the workers before this PR
  • So in the test fixture parameters I added the zh_CN locale, that has a non-utf-8 encoding.
  • Unfortunately this particular parameter value was never used in any of the workers (the test was skipped), since locale.setlocale was raising an error (locale not available)
  • I finally found a way to make the locale available on linux workers: sudo locale-gen <locale>

Does that answer your question ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was assumed setting the environment variables

      LANG: ${{ matrix.lang || '' }}
      LC_ALL: ${{ matrix.lc_all || '' }}

would ensure the locale for the test run would not be the default (en_US.UTF-8) for the other locale builds.

So IIUC, this is necessary if setting a non-utf8 locale?

Copy link
Contributor Author

@smarie smarie Mar 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this is what I thought too.
Unfortunately when I tried setting LANG and LC_ALL to zh_CN (non-utf8), then there was a strange segmentation fault appearing (see #46405 (comment))

So this is why I now do not set it through these env variables but I only install it.

Maybe the problem is that when the env variables are set, many side effects happen in the build chain, that I do not quite understand.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're setting the locale this way, can we remove setting the LANG and LC_ALL environment variables here?

Copy link
Contributor Author

@smarie smarie Apr 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I have no idea :) At least for the tests in this PR, yes these env variables are not used.

However these env vars are probably related to other locale-related tests, including being able to compile pandas on alternate locale, etc. It seems that in the CI those env variables are used to even add a specific line at the beginning of the init file of pandas ! (see ci/setup_env.sh)

This seems a bit beyond the scope of this PR. However it is maybe worth keeping track of this, in a dedicated ticket.

if: ${{ env.EXTRA_LOC != '' }}

- uses: conda-incubator/setup-miniconda@v2
with:
mamba-version: "*"
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -443,6 +443,7 @@ Period
^^^^^^
- Bug in subtraction of :class:`Period` from :class:`PeriodArray` returning wrong results (:issue:`45999`)
- Bug in :meth:`Period.strftime` and :meth:`PeriodIndex.strftime`, directives ``%l`` and ``%u`` were giving wrong results (:issue:`46252`)
- Bug in :meth:`Period.strftime` and :meth:`PeriodIndex.strftime`, Unicode decoding error when a locale-specific directive is used (:issue:`46319`)
-

Plotting
Expand Down
3 changes: 2 additions & 1 deletion pandas/_libs/tslibs/period.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -1236,7 +1236,8 @@ cdef str _period_strftime(int64_t value, int freq, bytes fmt):
# Execute c_strftime to process the usual datetime directives
formatted = c_strftime(&dts, <char*>fmt)

result = util.char_to_string(formatted)
# Decode result according to current locale
result = util.char_to_string_locale(formatted)
free(formatted)

# Now we will fill the placeholders corresponding to our additional directives
Expand Down
8 changes: 8 additions & 0 deletions pandas/_libs/tslibs/util.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,9 @@ cdef extern from "Python.h":
const char* PyUnicode_AsUTF8AndSize(object obj,
Py_ssize_t* length) except NULL

object PyUnicode_DecodeLocale(const char *str, const char *errors) nogil


from numpy cimport (
float64_t,
int64_t,
Expand Down Expand Up @@ -220,3 +223,8 @@ cdef inline const char* get_c_string_buf_and_size(str py_string,

cdef inline const char* get_c_string(str py_string) except NULL:
return get_c_string_buf_and_size(py_string, NULL)


cdef inline object char_to_string_locale(const char* data):
"""As opposed to PyUnicode_FromString, use current system locale to decode."""
return PyUnicode_DecodeLocale(data, NULL)
34 changes: 34 additions & 0 deletions pandas/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
timezone,
)
from decimal import Decimal
import locale
import operator
import os

Expand Down Expand Up @@ -1202,6 +1203,39 @@ def utc_fixture(request):
utc_fixture2 = utc_fixture


@pytest.fixture(
params=[
pytest.param(None, id=str(locale.getlocale())),
"it_IT.utf8",
"zh_CN",
Copy link
Contributor Author

@smarie smarie Mar 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe

Suggested change
"zh_CN",
"zh_CN", # Note: encoding is automatically 'gb2312'

What do you think ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to add zh_CH.utf8 and zh_CH.gb2312 explicitly

Copy link
Contributor Author

@smarie smarie Mar 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll add zh_CH.utf8.
Strangely enough, zh_CN.gb2312 did not seem to be working, but zh_CN does involve gb2312 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will also add it_IT by the way, so that both locales have their utf8 AND non-utf8 encoding activated in the test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

]
)
def overridden_locale(request):
"""
Fixture to temporarily change the locale.

If a locale cannot be set (because it is not available on the host)
the test is skipped.
"""
old = locale.setlocale(locale.LC_ALL)
target = request.param
if target is None:
# Current locale - don't change
yield old
else:
try:
# Try changing the locale.
locale.setlocale(locale.LC_ALL, target)
except locale.Error as e:
# Not available on this host. Skip test.
pytest.skip(f"Skipping as locale cannot be set. {type(e).__name__}: {e}")
else:
# Run test with the temporary local
yield target
# Set back to normal
locale.setlocale(locale.LC_ALL, old)


# ----------------------------------------------------------------
# Dtypes
# ----------------------------------------------------------------
Expand Down
28 changes: 27 additions & 1 deletion pandas/tests/io/formats/test_format.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
"""
Test output formatting for Series/DataFrame, including to_string & reprs
"""
from datetime import datetime
from datetime import (
datetime,
time,
)
from io import StringIO
import itertools
from operator import methodcaller
Expand Down Expand Up @@ -46,6 +49,13 @@
use_32bit_repr = is_platform_windows() or not IS64


def get_local_am_pm():
"""Return the AM and PM strings returned by strftime in current locale."""
am_local = time(1).strftime("%p")
pm_local = time(13).strftime("%p")
return am_local, pm_local


@pytest.fixture(params=["string", "pathlike", "buffer"])
def filepath_or_buffer_id(request):
"""
Expand Down Expand Up @@ -3224,6 +3234,22 @@ def test_period_tz(self):
per = dt.to_period(freq="H")
assert per.format()[0] == "2013-01-01 00:00"

def test_period_custom_locale(self, overridden_locale):
# GH#46319 locale-specific formatting directive leads to non-utf8 str result

# Get locale-specific reference
am_local, pm_local = get_local_am_pm()

# Scalar
per = pd.Period("2018-03-11 13:00", freq="H")
assert per.strftime("%p") == pm_local

# Index
per = pd.period_range("2003-01-01 01:00:00", periods=2, freq="12h")
formatted = per.format(date_format="%y %I:%M:%S%p")
assert formatted[0] == f"03 01:00:00{am_local}"
assert formatted[1] == f"03 01:00:00{pm_local}"


class TestDatetimeIndexFormat:
def test_datetime(self):
Expand Down