Skip to content

UnicodeDecodeError when using mysql client 8.0 to connect to mysql server 5.7 #504

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yonran opened this issue Oct 7, 2021 · 3 comments · Fixed by #509
Closed

UnicodeDecodeError when using mysql client 8.0 to connect to mysql server 5.7 #504

yonran opened this issue Oct 7, 2021 · 3 comments · Fixed by #509

Comments

@yonran
Copy link

yonran commented Oct 7, 2021

Describe the bug

The underlying issue is described in MySQL 8.0 Reference Manual / Character Sets, Collations, Unicode / Connection Character Sets and Collations: Connect-Time Error Handling. When you use the MySQL 8.0 client library to connect to a MySQL 5.7 server with the utf8mb4 encoding, then the server falls back to latin1 instead, and the server will send result sets that are invalid UTF-8, resulting in UnicodeDecodeError when the client tries to decode them. The symptoms are similar to a server with --skip-character-set-client-handshake.

As the Oracle document describes, when MySQLdb is linked with the mysql-client 8.0 library and attempts to connect to a mysql 5.7 server with "charset": "utf8mb4", then the mysql client library sends a HandshakeResponse41 packet requesting charset+collation number 255: utf8mb4_0900_ai_ci, which is the new default collation for the utf8mb4 charset in MySQL 8.0. However, this new collation does not exist on a MySQL 5.7 server, so the server silently falls back to the character_set_server (latin1) and collation_server (latin1_swedish_ci). In MySQL 5.7, the default collation for utf8mb4 was charset+collation number 45: utf8mb4_general_ci.

We currently call the C function mysql_character_set_name, but that function is a client-side lookup and does not verify the character set in case the server silently ignored the charset+collation from the handshake.

The connection’s collation is apparently only used for comparing literal strings, not for comparing columns (which have their own collation), so all this trouble is for a pretty uncommon use case (character_set_collation).

This bug occurs if you use the OSX Homebrew [email protected] package, but surprisingly it does not occur when you use [email protected] (brew install [email protected]; PATH=/usr/local/opt/[email protected]/bin:$PATH pip install -e /path/to/mysqlclient). This is because mysql-client was compiled with -DDEFAULT_COLLATION=utf8mb4_general_ci, whereas the mysql formula does not change the DEFAULT_COLLATION. With the default collation altered, mysql-client sends 45: utf8mb4_general_ci to the server in the HandshakeResponse41 packet, which mysql 5.7 recognizes.

Googling, I saw that this seems to have occurred to other people too:

To Reproduce

Server

docker run --name=mysql \
-e MYSQL_ALLOW_EMPTY_PASSWORD=y \
-e MYSQL_ROOT_PASSWORD= \
-e MYSQL_INITDB_SKIP_TZINFO=y \
-e MYSQL_DATABASE=db \
-p 127.0.0.1:3306:3306 mysql:5.7.33

Code

On Mac OSX, you can use the mysql package:

brew install [email protected]
pip install mysqlclient
con = MySQLdb.connect(host='127.0.0.1', user='root', passwd='', port=3306, client_flag=2, charset='utf8mb4', use_unicode=True)
with con.cursor() as c:
  # use charset literals https://dev.mysql.com/doc/refman/8.0/en/charset-literal.html
  # to explicitly create a latin1 string
  c._query(b"select _latin1'\x92'")
  c._executed = b"select _latin1'\x92'" # needed 1.3.7, not 1.4.6
  c.fetchall()

Error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/yonran/third-party/mysqlclient/MySQLdb/cursors.py", line 321, in _query
    self._post_get_result()
  File "/Users/yonran/third-party/mysqlclient/MySQLdb/cursors.py", line 355, in _post_get_result
    self._rows = self._fetch_row(0)
  File "/Users/yonran/third-party/mysqlclient/MySQLdb/cursors.py", line 328, in _fetch_row
    return self._result.fetch_row(size, self._fetch_type)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 0: invalid start byte

You also query the server to see that character_set_results is set to latin1 rather than utf8mb4:

with con.cursor() as c:
  c.execute("show variables like 'character_set_%'")
  for row in c:
    print(row)
(b'character_set_client', b'latin1')
(b'character_set_connection', b'latin1')
(b'character_set_database', b'latin1')
(b'character_set_filesystem', b'binary')
(b'character_set_results', b'latin1')
(b'character_set_server', b'latin1')
(b'character_set_system', b'utf8')
(b'character_sets_dir', b'/usr/share/mysql/charsets/')

Other implementations

How do other implementations handle the handshake?

  • node mysql sends utf8mb4_general_ci in the handshake response, never utf8mb4_0900_ai_ci.
  • rust mysql_common (used by the mysql crate) sends utf8mb4_general_ci in the handshake response packet if the server is ≥5.5.3, but never utf8mb4_0900_ai_ci.
  • MySQL Connector/J has a complicated algorithm in 8.0.26 that they describe somewhat in 6.7 Using Character Sets and Unicode. I think what it does is:
    • if the server version ≥8.0.1, then handshake with utf8mb4_0900_ai_ci; otherwise handshake with utf8mb4_general_ci (configurePreHandshake) to encode the username/password
    • After the handshake, the client queries the server for the connection settings such as @@character_set_results (loadServerVariables)
    • Then the client executes an additional SET NAMES statement if the charset/collation do not match what is required (configurePostHandshake)

Workaround

If we are using the MySQL 8.0 client library to connect to a MySQL 5.7 server, we need to perform an additional SET NAMES to set the charset to utf8mb4.

Currently, Connection.set_character_set (which is called during Connection.__init__) only executes SET NAMES if it thinks that the parameters changed after connect. But because we can’t trust the client-side mysql_character_set_name function to return the server’s value of @@character_set_results, we should just SET NAMES unconditionally.

con.query("SET NAMES utf8mb4")
con.store_result()

Environment

MySQL Server

  • Server OS: Linux RDS or Docker
  • Server Version: MySQL 5.7.33

MySQL Client

  • OS (e.g. Windows 10, Ubuntu 20.04): OS X

  • Python: Homebrew Python 3.9.7 (but it also occurs in 2.7.10 with a compatible mysqlclient 1.4.6)

  • Connector/C: Homebrew mysql-client 8.0.26

@methane
Copy link
Member

methane commented Oct 8, 2021

Hm. There are some issue reports caused by not sending SET NAMES query after 1.4.4.

On the other hand, some database don't support SET NAMES query.

I will change to send SET NAMES query automatically during initialization.

@yonran
Copy link
Author

yonran commented Oct 9, 2021

On reflection, I do not think this was an intermittent issue. (I thought it was intermittent because I was changing the client installation to try to bisect different versions of the library and I wasn’t careful enough). I think this was a problem of using a [email protected] library to connect to a [email protected] server. The MySQL 8.0 documentation describes the issue in MySQL 8.0 Reference Manual / Character Sets, Collations, Unicode / Connection Character Sets and Collations: Connect-Time Error Handling I will update the title to describe my new understanding.

@yonran yonran changed the title Intermittent UnicodeDecodeError when charset is set during connect UnicodeDecodeError when using mysql-client 8.0 to connect to mysql 5.7 (or --skip-character-set-client-handshake) Oct 9, 2021
@methane
Copy link
Member

methane commented Oct 9, 2021

MySQL 8.0 changed the default collation for utf8 charset from utf8mb4_general_ci to utf8mb4_0900_ai_ci.

Since MySQL 5.7 don't know utf8mb4_0900_ai_ci, MySQL 5.7 use the default charset:collation instead of collation specified in handshake packet.

@yonran yonran changed the title UnicodeDecodeError when using mysql-client 8.0 to connect to mysql 5.7 (or --skip-character-set-client-handshake) UnicodeDecodeError when using mysql client 8.0 to connect to mysql server 5.7 Oct 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants