python3: set text file streams encoding to utf-8

Totktonada · Totktonada · commit e7c61875a7bb · 2021-03-13T00:00:44.000+03:00
The problem: test files, result files contain UTF-8 symbols, which are out of ASCII range. Just `open(file, 'r').read()` without the encoding='utf-8' argument fails to decode them, when the default encoding for file text streams is not 'utf-8'. We meet this situation on Python 3.6.8 (provided by CentOS 7 and CentOS 8), when the POSIX locale is set (`LC_ALL=C`). The solution is described in the code comment: replace `open()` built-in function and always set `encoding='utf-8'`. That's hacky way, but it looks better than change every `open()` call across the code and don't forget to do that in all future code (and keep Python 2 compatibility in the mind). But maybe we'll revisit the approach later. There is another way to hack the `open()` behaviour that works for me on Python 3.6.8: | import _bootlocale | _bootlocale.getpreferredencoding = (lambda *args: 'utf8') However it leans on Python internals and looks less reliable than the implemented one. Part of #20
diff --git a/test-run.py b/test-run.py
@@ -236,6 +236,36 @@ def main_consistent():
     if PY3:
         multiprocessing.set_start_method('fork')
 
+    # test-run assumes that text file streams are UTF-8 (as
+    # contrary to ASCII) on Python 3. It is necessary to process
+    # non ASCII symbols in test files, result files and so on.
+    #
+    # Default text file stream encoding depends on a system
+    # locale with exception for the POSIX locale (C locale): in
+    # this case UTF-8 is used (see PEP-0540). Sadly, this
+    # behaviour is in effect since Python 3.7.
+    #
+    # We want to achieve the same behaviour on lower Python
+    # versions, at least on 3.6.8, which is provided by CentOS 7
+    # and CentOS 8.
+    #
+    # So we hack the open() builtin.
+    #
+    # https://stackoverflow.com/a/53347548/1598057
+    if PY3 and sys.version_info[0:2] < (3, 7):
+        std_open = __builtins__.open
+
+        def open_as_utf8(*args, **kwargs):
+            if len(args) >= 2:
+                mode = args[1]
+            else:
+                mode = kwargs.get('mode', '')
+            if 'b' not in mode:
+                kwargs.setdefault('encoding', 'utf-8')
+            return std_open(*args, **kwargs)
+
+        __builtins__.open = open_as_utf8
+
     # don't sure why, but it values 1 or 2 gives 1.5x speedup for parallel
     # test-run (and almost doesn't affect consistent test-run)
     os.environ['OMP_NUM_THREADS'] = '2'