Skip to content

Commit e7c6187

Browse files
committed
python3: set text file streams encoding to utf-8
The problem: test files, result files contain UTF-8 symbols, which are out of ASCII range. Just `open(file, 'r').read()` without the encoding='utf-8' argument fails to decode them, when the default encoding for file text streams is not 'utf-8'. We meet this situation on Python 3.6.8 (provided by CentOS 7 and CentOS 8), when the POSIX locale is set (`LC_ALL=C`). The solution is described in the code comment: replace `open()` built-in function and always set `encoding='utf-8'`. That's hacky way, but it looks better than change every `open()` call across the code and don't forget to do that in all future code (and keep Python 2 compatibility in the mind). But maybe we'll revisit the approach later. There is another way to hack the `open()` behaviour that works for me on Python 3.6.8: | import _bootlocale | _bootlocale.getpreferredencoding = (lambda *args: 'utf8') However it leans on Python internals and looks less reliable than the implemented one. Part of #20
1 parent 2770763 commit e7c6187

File tree

1 file changed

+30
-0
lines changed

1 file changed

+30
-0
lines changed

test-run.py

+30
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,36 @@ def main_consistent():
236236
if PY3:
237237
multiprocessing.set_start_method('fork')
238238

239+
# test-run assumes that text file streams are UTF-8 (as
240+
# contrary to ASCII) on Python 3. It is necessary to process
241+
# non ASCII symbols in test files, result files and so on.
242+
#
243+
# Default text file stream encoding depends on a system
244+
# locale with exception for the POSIX locale (C locale): in
245+
# this case UTF-8 is used (see PEP-0540). Sadly, this
246+
# behaviour is in effect since Python 3.7.
247+
#
248+
# We want to achieve the same behaviour on lower Python
249+
# versions, at least on 3.6.8, which is provided by CentOS 7
250+
# and CentOS 8.
251+
#
252+
# So we hack the open() builtin.
253+
#
254+
# https://stackoverflow.com/a/53347548/1598057
255+
if PY3 and sys.version_info[0:2] < (3, 7):
256+
std_open = __builtins__.open
257+
258+
def open_as_utf8(*args, **kwargs):
259+
if len(args) >= 2:
260+
mode = args[1]
261+
else:
262+
mode = kwargs.get('mode', '')
263+
if 'b' not in mode:
264+
kwargs.setdefault('encoding', 'utf-8')
265+
return std_open(*args, **kwargs)
266+
267+
__builtins__.open = open_as_utf8
268+
239269
# don't sure why, but it values 1 or 2 gives 1.5x speedup for parallel
240270
# test-run (and almost doesn't affect consistent test-run)
241271
os.environ['OMP_NUM_THREADS'] = '2'

0 commit comments

Comments
 (0)