Skip to content

Character set errors when using mysqlclient 1.4.4 and above #422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kierenpitts opened this issue Jan 14, 2020 · 9 comments
Closed

Character set errors when using mysqlclient 1.4.4 and above #422

kierenpitts opened this issue Jan 14, 2020 · 9 comments

Comments

@kierenpitts
Copy link

kierenpitts commented Jan 14, 2020

Hi

We've been using mysqlclient 1.3.14 with Django 2.2.9 and MySQL 5.7.21 enterprise with encrypted databases (all on Centos 7 using Python 3.6.8) for some time. Our database and tables use utf8mb4 throughout. When we upgrade to using mysqlclient 1.4.6 in our test environment we can no longer enter emojis into text fields through the Django web application. The MySQL error being generated by our test suite is:

MySQLdb._exceptions.OperationalError: (1366, "Incorrect string value: '\xF0\x9F\x98\x8A.' for column 'some_field' at row 1")

We have also tested this in the browser (i.e. separately from our test suite) using the web application and have confirmed that text containing an emoji is rejected with a MySQL error.

If we downgrade mysqlclient to 1.4.3 the data is inserted correctly (into the same database that previously rejected the insert), if we update to 1.4.4 then the error appears again (as before, using the exact same database). If we go back to 1.3.14 (as we currently use) then the data can also be inserted. In each case we are not making code changes other than installing different versions of mysqlclient using pip. Our application has had support for emojis etc for several years.

As far as we can tell, with no other code changes, the error is only occurring once we start using mysqlclient 1.4.4 and above. The changelog shows that the way charsets were being handled in connections changed with 1.4.4. We're wondering if this has caused an issue, possibly with connections to encrypted MySQL databases falling back to a default charset for some reason? We are specifying 'utf8mb4' as the charset used in the DATABASES settings in Django and this is an existing application that has not experienced similar errors before. If the connection is falling back to the default charset (i.e. not 'utf8mb4') then this is the error we would expect to see.

Any help/suggestions would be gratefully received.

Cheers

Kieren

MySQL Server

  • Server OS: Centos 7
  • Server Version: MySQL 5.7.21-enterprise

MySQL Client

  • OS: Centos 7
  • Python: Python 3.6.8
  • Connector/C: mysqlclient 1.3.14 - 1.4.3 are fine, 1.4.4 and above result in errors

Application uses Django 2.2.9

@methane
Copy link
Member

methane commented Jan 14, 2020

Try SHOW VARIABLES LIKE 'character_set%'; from Python.

@kierenpitts
Copy link
Author

kierenpitts commented Jan 14, 2020

If I do this via a Django shell then it does illustrate the issue:

With mysqlclient 1.4.6

from django.db import connection
cursor = connection.cursor()
cursor.execute("SHOW VARIABLES LIKE 'character_set%';")
8
cursor.fetchall()
(('character_set_client', 'utf8'), 
('character_set_connection', 'utf8'), 
('character_set_database', 'utf8mb4'), 
('character_set_filesystem', 'binary'), 
('character_set_results', 'utf8'), 
('character_set_server', 'utf8'), 
('character_set_system', 'utf8'), 
('character_sets_dir', '/usr/share/mysql/charsets/'))

If I then exit the Django shell and run this in the venv to downgrade:

pip install mysqlclient==1.4.3

and then re-run the above in the Django shell with no other changes:

from django.db import connection
cursor = connection.cursor()
cursor.execute("SHOW VARIABLES LIKE 'character_set%';")
8
cursor.fetchall()
(('character_set_client', 'utf8mb4'), 
('character_set_connection', 'utf8mb4'), 
('character_set_database', 'utf8mb4'),
('character_set_filesystem', 'binary'), 
('character_set_results', 'utf8mb4'), 
('character_set_server', 'utf8'), 
('character_set_system', 'utf8'), 
('character_sets_dir', '/usr/share/mysql/charsets/'))

@methane
Copy link
Member

methane commented Jan 15, 2020

How do you build mysqlclient?
Do you use a old libmysqlclient?

@methane
Copy link
Member

methane commented Jan 15, 2020

If your libmysqlclient is very old and doesn't support utf8mb4 correctly,
you can use init_command='SET NAMES "utf8mb4"' option.

@kierenpitts
Copy link
Author

kierenpitts commented Jan 15, 2020

Hi @methane

Thanks for the reply. Our libmysqlclient is installed directly from the official MySQL repos for Centos and isn't very old:

mysql-community-libs-compat-5.7.29-1.el7.x86_64 : Shared compat libraries for MySQL 5.6.45 database client applications
Repo        : @mysql57-community
Matched from:
Filename    : /usr/lib64/mysql/libmysqlclient.so.18

If the libmysqlclient is at fault I'd have expected us to have had lots of problems with utf8mb4 before now but we have had no problems at all. As I mentioned before, we only start seeing issues when we use 1.4.4 and above of mysqlclient-python, we have no issues if we use mysqlclient-python <1.4.4 and the 1.3.x series. We have been using 1.3.x series since 2017 and the test that users can insert emojis in content is of a similar age.

@methane
Copy link
Member

methane commented Jan 15, 2020

Before 1.4.4, mysqlclient sends SET NAMES "utf8mb4".
So mysqlclient works even if the libmysqlclient doesn't know the correct collation id of the "utf8mb4".

After 1.4.4, mysqlclient initialize the charset before connect. libmysqclient chose the collation id for the charset and send it in the handshake packet. And SET NAMES "utf8mb4" is not sent.

So if the libmysqlclient doesn't know the correct collation id, it fails only after 1.4.4.

Try capture the handshake packets by tcpdump and see it in Wireshark.
You can see which collation id is used by the libmysqlclient.

@kierenpitts
Copy link
Author

Thanks for the information. I had a look in Wireshark and can confirm that mysqlclient 1.4.6 is sending the correct information during the login request:

Charset: utf8mb4 COLLATE utf8mb4_general_ci (45)

mysqlclient 1.4.3 is sending Charset: latin1 COLLATE latin1_swedish_ci (8)

We hadn't realised that, prior to 1.4.4, utf8mb4 was being set using SET NAMES utf8mb4. We think what's happened is that the MySQL server we're using seems to have been configured to prevent clients setting the character set in the handshake. We weren't aware of this and are currently trying to get confirmation from the team that maintains the MYSQL instance. This would explain what we're seeing though as, even if this was the case, the server would've honoured SET NAMES and the removal of this and the reliance on the character set being specified by the client in 1.4.4 onwards would result in the server falling back to the default.

Thank you very much for your help and advice on this.

@methane
Copy link
Member

methane commented Jan 15, 2020

Uh! "--skip-character-set-client-handshake"!
I forgot there is the option.
I don't know why some people use the option.

@methane methane closed this as completed Jan 16, 2020
@kierenpitts
Copy link
Author

@methane Thanks again for all the help with this, it was much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants