Skip to content

Commit a53fa80

Browse files
Jayman2000adrienverge
authored andcommitted
decoder: Autodetect encoding of most YAML files
Before this change, yamllint would open YAML files using open()’s default encoding. As long as UTF-8 mode isn’t enabled, open() defaults to using the system’s locale encoding [1][2]. This can cause problems in multiple different scenarios. The first scenario involves linting UTF-8 YAML files on Linux systems. Most of the time, the locale encoding on Linux systems is set to UTF-8 [3][4], but it can be set to something else [5]. In the unlikely event that someone was using Linux with a locale encoding other than UTF-8, there was a chance that yamllint would crash with a UnicodeDecodeError. The second scenario involves linting UTF-8 YAML files on Windows systems. The locale encoding on Windows systems is the system’s ANSI code page [6]. The ANSI code page on Windows systems is NOT set to UTF-8 by default [7]. In the very likely event that someone was using Windows with a locale encoding other than UTF-8, there was a chance that yamllint would crash with a UnicodeDecodeError. Additionally, using open()’s default encoding is a violation of the YAML spec. Chapter 5.2 says: “On input, a YAML processor must support the UTF-8 and UTF-16 character encodings. For JSON compatibility, the UTF-32 encodings must also be supported. If a character stream begins with a byte order mark, the character encoding will be taken to be as indicated by the byte order mark. Otherwise, the stream must begin with an ASCII character. This allows the encoding to be deduced by the pattern of null (x00) characters.” [8] In most cases, this change fixes all of those problems by implementing the YAML spec’s character encoding detection algorithm. Now, as long as YAML files begin with either a byte order mark or an ASCII character, yamllint will (in most cases) automatically detect them as being UTF-8, UTF-16 or UTF-32. Other character encodings are not supported at the moment. Even with this change, there is still one specific situation where yamllint still uses the wrong character encoding. Specifically, this change does not affect the character encoding used for stdin. This means that at the moment, these two commands may use different character encodings when decoding file.yaml: $ yamllint file.yaml $ cat file.yaml | yamllint - A future commit will update yamllint so that it uses the same character encoding detection algorithm for stdin. It’s possible that this change will break things for existing yamllint users. This change allows users to use the YAMLLINT_FILE_ENCODING to override the autodetection algorithm just in case they’ve been using yamllint on weird nonstandard YAML files. Credit for the idea of having tests with pre-encoded strings and having an environment variable for overriding the character encoding autodetection algorithm goes to @adrienverge [9]. Fixes #218. Fixes #238. Fixes #347. [1]: <https://docs.python.org/3.12/library/functions.html#open> [2]: <https://docs.python.org/3.12/library/os.html#utf8-mode> [3]: <https://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html> [4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale> [5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f> [6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding> [7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page> [8]: <https://yaml.org/spec/1.2.2/#52-character-encodings> [9]: <#630 (comment)>
1 parent 0b3abe5 commit a53fa80

File tree

10 files changed

+790
-40
lines changed

10 files changed

+790
-40
lines changed

docs/character_encoding.rst

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
Character Encoding
2+
==================
3+
4+
When yamllint reads a file (whether its a configuration file or a file to
5+
lint), yamllint will try to automatically detect that file’s character
6+
encoding. In order for the automatic detection to work properly, files must
7+
follow these two rules (see `this section of the YAML specification for details
8+
<https://yaml.org/spec/1.2.2/#52-character-encodings>`_):
9+
10+
* The file must be encoded in UTF-8, UTF-16 or UTF-32.
11+
12+
* The file must begin with either a byte order mark or an ASCII character.
13+
14+
Override character encoding
15+
---------------------------
16+
17+
Previous versions of yamllint did not try to autodetect the character encoding
18+
of files. Previous versions of yamllint assumed that files used the current
19+
locale’s character encoding. This meant that older versions of yamllint would
20+
sometimes correctly decode files that didn’t follow those two rules. For the
21+
sake of backwards compatibility, the current version of yamllint allows you to
22+
disable automatic character encoding detection by setting the
23+
``YAMLLINT_FILE_ENCODING`` environment variable. If you set the
24+
``YAMLLINT_FILE_ENCODING`` environment variable to the `the name of one of
25+
Python’s standard character encodings
26+
<https://docs.python.org/3/library/codecs.html#standard-encodings>`_, then
27+
yamllint will use that character encoding instead of trying to autodetect the
28+
character encoding.
29+
30+
The ``YAMLLINT_FILE_ENCODING`` environment variable should only be used as a
31+
stopgap solution. If you need to use ``YAMLLINT_FILE_ENCODING``, then you
32+
should really update your YAML files so that their character encoding can
33+
automatically be detected, or else you may run into compatibility problems.
34+
Future versions of yamllint may remove support for the
35+
``YAMLLINT_FILE_ENCODING`` environment variable, and other YAML processors may
36+
misinterpret your YAML files.

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,3 +27,4 @@ Table of contents
2727
development
2828
text_editors
2929
integration
30+
character_encoding

tests/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,9 @@
1818
import os
1919

2020
locale.setlocale(locale.LC_ALL, 'C')
21-
# yamllint uses these environment variables to find a config file.
2221
env_vars_that_could_interfere_with_tests = (
22+
'YAMLLINT_FILE_ENCODING',
23+
# yamllint uses these environment variables to find a config file.
2324
'YAMLLINT_CONFIG_FILE',
2425
'XDG_CONFIG_HOME',
2526
# These variables are used to determine where the user’s home

tests/common.py

Lines changed: 153 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
# Copyright (C) 2016 Adrien Vergé
2+
# Copyright (C) 2023–2025 Jason Yundt
23
#
34
# This program is free software: you can redistribute it and/or modify
45
# it under the terms of the GNU General Public License as published by
@@ -13,20 +14,172 @@
1314
# You should have received a copy of the GNU General Public License
1415
# along with this program. If not, see <http://www.gnu.org/licenses/>.
1516

17+
import codecs
1618
import contextlib
1719
from io import StringIO
1820
import os
1921
import shutil
2022
import sys
2123
import tempfile
2224
import unittest
25+
import warnings
26+
from codecs import CodecInfo
2327

2428
import yaml
2529

2630
from yamllint import linter
2731
from yamllint.config import YamlLintConfig
2832

2933

34+
# Encoding related stuff:
35+
UTF_CODECS = (
36+
'utf_32_be',
37+
'utf_32_be_sig',
38+
'utf_32_le',
39+
'utf_32_le_sig',
40+
'utf_16_be',
41+
'utf_16_be_sig',
42+
'utf_16_le',
43+
'utf_16_le_sig',
44+
'utf_8',
45+
'utf_8_sig'
46+
)
47+
48+
49+
def encode_utf_32_be_sig(obj):
50+
return (
51+
codecs.BOM_UTF32_BE + codecs.encode(obj, 'utf_32_be', 'strict'),
52+
len(obj)
53+
)
54+
55+
56+
def encode_utf_32_le_sig(obj):
57+
return (
58+
codecs.BOM_UTF32_LE + codecs.encode(obj, 'utf_32_le', 'strict'),
59+
len(obj)
60+
)
61+
62+
63+
def encode_utf_16_be_sig(obj):
64+
return (
65+
codecs.BOM_UTF16_BE + codecs.encode(obj, 'utf_16_be', 'strict'),
66+
len(obj)
67+
)
68+
69+
70+
def encode_utf_16_le_sig(obj):
71+
return (
72+
codecs.BOM_UTF16_LE + codecs.encode(obj, 'utf_16_le', 'strict'),
73+
len(obj)
74+
)
75+
76+
77+
test_codec_infos = {
78+
'utf_32_be_sig':
79+
CodecInfo(encode_utf_32_be_sig, codecs.getdecoder('utf_32')),
80+
'utf_32_le_sig':
81+
CodecInfo(encode_utf_32_le_sig, codecs.getdecoder('utf_32')),
82+
'utf_16_be_sig':
83+
CodecInfo(encode_utf_16_be_sig, codecs.getdecoder('utf_16')),
84+
'utf_16_le_sig':
85+
CodecInfo(encode_utf_16_le_sig, codecs.getdecoder('utf_16')),
86+
}
87+
88+
89+
def register_test_codecs():
90+
codecs.register(test_codec_infos.get)
91+
92+
93+
def unregister_test_codecs():
94+
if sys.version_info >= (3, 10, 0):
95+
codecs.unregister(test_codec_infos.get)
96+
else:
97+
warnings.warn(
98+
"This version of Python doesn’t allow us to unregister codecs.",
99+
stacklevel=1
100+
)
101+
102+
103+
def is_test_codec(codec):
104+
return codec in test_codec_infos.keys()
105+
106+
107+
def test_codec_built_in_equivalent(test_codec):
108+
return_value = test_codec
109+
for suffix in ('_sig', '_be', '_le'):
110+
return_value = return_value.replace(suffix, '')
111+
return return_value
112+
113+
114+
def uses_bom(codec):
115+
for suffix in ('_32', '_16', '_sig'):
116+
if codec.endswith(suffix):
117+
return True
118+
return False
119+
120+
121+
def encoding_detectable(string, codec):
122+
"""
123+
Returns True if encoding can be detected after string is encoded
124+
125+
Encoding detection only works if you’re using a BOM or the first character
126+
is ASCII. See yamllint.decoder.auto_decode()’s docstring.
127+
"""
128+
return uses_bom(codec) or (len(string) > 0 and string[0].isascii())
129+
130+
131+
# Workspace related stuff:
132+
class Blob:
133+
def __init__(self, text, encoding):
134+
self.text = text
135+
self.encoding = encoding
136+
137+
138+
def build_temp_workspace(files):
139+
tempdir = tempfile.mkdtemp(prefix='yamllint-tests-')
140+
141+
for path, content in files.items():
142+
path = os.fsencode(os.path.join(tempdir, path))
143+
if not os.path.exists(os.path.dirname(path)):
144+
os.makedirs(os.path.dirname(path))
145+
146+
if isinstance(content, list):
147+
os.mkdir(path)
148+
elif isinstance(content, str) and content.startswith('symlink://'):
149+
os.symlink(content[10:], path)
150+
else:
151+
if isinstance(content, Blob):
152+
content = content.text.encode(content.encoding)
153+
mode = 'wb' if isinstance(content, bytes) else 'w'
154+
with open(path, mode) as f:
155+
f.write(content)
156+
157+
return tempdir
158+
159+
160+
@contextlib.contextmanager
161+
def temp_workspace(files):
162+
"""Provide a temporary workspace that is automatically cleaned up."""
163+
backup_wd = os.getcwd()
164+
wd = build_temp_workspace(files)
165+
166+
try:
167+
os.chdir(wd)
168+
yield
169+
finally:
170+
os.chdir(backup_wd)
171+
shutil.rmtree(wd)
172+
173+
174+
def temp_workspace_with_files_in_many_codecs(path_template, text):
175+
workspace = {}
176+
for codec in UTF_CODECS:
177+
if encoding_detectable(text, codec):
178+
workspace[path_template.format(codec)] = Blob(text, codec)
179+
return workspace
180+
181+
182+
# Miscellaneous stuff:
30183
class RuleTestCase(unittest.TestCase):
31184
def build_fake_config(self, conf):
32185
if conf is None:
@@ -81,37 +234,3 @@ def __exit__(self, *exc_info):
81234
@property
82235
def returncode(self):
83236
return self._raises_ctx.exception.code
84-
85-
86-
def build_temp_workspace(files):
87-
tempdir = tempfile.mkdtemp(prefix='yamllint-tests-')
88-
89-
for path, content in files.items():
90-
path = os.fsencode(os.path.join(tempdir, path))
91-
if not os.path.exists(os.path.dirname(path)):
92-
os.makedirs(os.path.dirname(path))
93-
94-
if isinstance(content, list):
95-
os.mkdir(path)
96-
elif isinstance(content, str) and content.startswith('symlink://'):
97-
os.symlink(content[10:], path)
98-
else:
99-
mode = 'wb' if isinstance(content, bytes) else 'w'
100-
with open(path, mode) as f:
101-
f.write(content)
102-
103-
return tempdir
104-
105-
106-
@contextlib.contextmanager
107-
def temp_workspace(files):
108-
"""Provide a temporary workspace that is automatically cleaned up."""
109-
backup_wd = os.getcwd()
110-
wd = build_temp_workspace(files)
111-
112-
try:
113-
os.chdir(wd)
114-
yield
115-
finally:
116-
os.chdir(backup_wd)
117-
shutil.rmtree(wd)

tests/test_cli.py

Lines changed: 58 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
# Copyright (C) 2016 Adrien Vergé
2+
# Copyright (C) 2023–2025 Jason Yundt
23
#
34
# This program is free software: you can redistribute it and/or modify
45
# it under the terms of the GNU General Public License as published by
@@ -22,7 +23,14 @@
2223
import unittest
2324
from io import StringIO
2425

25-
from tests.common import build_temp_workspace, RunContext, temp_workspace
26+
from tests.common import (
27+
RunContext,
28+
build_temp_workspace,
29+
register_test_codecs,
30+
temp_workspace,
31+
temp_workspace_with_files_in_many_codecs,
32+
unregister_test_codecs
33+
)
2634

2735
from yamllint import cli, config
2836

@@ -796,3 +804,52 @@ def test_multiple_parent_config_file(self):
796804
self.assertEqual((ctx.returncode, ctx.stdout, ctx.stderr),
797805
(0, './4spaces.yml:2:5: [warning] wrong indentation: '
798806
'expected 3 but found 4 (indentation)\n', ''))
807+
808+
809+
class CommandLineEncodingTestCase(unittest.TestCase):
810+
@classmethod
811+
def setUpClass(cls):
812+
super().setUpClass()
813+
register_test_codecs()
814+
815+
@classmethod
816+
def tearDownClass(cls):
817+
super().tearDownClass()
818+
unregister_test_codecs()
819+
820+
def test_valid_encodings(self):
821+
conf = ('---\n'
822+
'rules:\n'
823+
' key-ordering: enable\n')
824+
config_files = temp_workspace_with_files_in_many_codecs(
825+
'config_{}.yaml',
826+
conf
827+
)
828+
sorted_correctly = ('---\n'
829+
'A: YAML\n'
830+
'Z: YAML\n')
831+
sorted_correctly_files = temp_workspace_with_files_in_many_codecs(
832+
'sorted_correctly/{}.yaml',
833+
sorted_correctly
834+
)
835+
sorted_incorrectly = ('---\n'
836+
'Z: YAML\n'
837+
'A: YAML\n')
838+
sorted_incorrectly_files = temp_workspace_with_files_in_many_codecs(
839+
'sorted_incorrectly/{}.yaml',
840+
sorted_incorrectly
841+
)
842+
workspace = {
843+
**config_files,
844+
**sorted_correctly_files,
845+
**sorted_incorrectly_files
846+
}
847+
848+
with temp_workspace(workspace):
849+
for config_path in config_files.keys():
850+
with RunContext(self) as ctx:
851+
cli.run(('-c', config_path, 'sorted_correctly'))
852+
self.assertEqual(ctx.returncode, 0)
853+
with RunContext(self) as ctx:
854+
cli.run(('-c', config_path, 'sorted_incorrectly'))
855+
self.assertNotEqual(ctx.returncode, 0)

0 commit comments

Comments
 (0)