Skip to content

MR #346 breaks sphinxsearch connection #381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tumb1er opened this issue Aug 9, 2019 · 14 comments
Closed

MR #346 breaks sphinxsearch connection #381

tumb1er opened this issue Aug 9, 2019 · 14 comments

Comments

@tumb1er
Copy link

tumb1er commented Aug 9, 2019

MR #346 makes all string values returned from sphinxsearch connection via mysql protocol to be bytes.
Steps to reproduce:

# clone project
#> git clone https://github.com/tumb1er/django_sphinxsearch
#> cd django_sphinxsearch
# setup virtual env
#> virtualenv --python=python3.6 .
#> . bin/activate
# install requirements
#> pip install -r test-requires.txt
# start searchd with indices at /tmp
#> docker-compose up sphinxsearch 
# run tests
#> ./manage.py test
Ran 181 tests in 1.244s

OK
# update mysqlclient to MR 346
#> pip install git+git://github.com/PyMySQL/mysqlclient-python.git@e04c5972f526805e7c310aa213d1eadbabb45371
# run tests again
#> ./manage.py test
AssertionError: b'another string' != 'another string' : attr_string

----------------------------------------------------------------------
Ran 181 tests in 1.123s

FAILED (failures=47, errors=1)

It seems that every single string column is now broken.

I'm not sure that this is a bug of mysqlclient, but need help with getting strings back.

@tumb1er
Copy link
Author

tumb1er commented Aug 9, 2019

Fixed with manual string converters:

conversions = converters.conversions.copy()
conversions[constants.FIELD_TYPE.STRING] = lambda x: x.decode('utf-8')
options.setdefault('use_unicode', False)
options.setdefault('conv', conversions)
connection = connect(..., **options)

This helps partially (sphinxsearch specific)

options.setdefault('use_unicode', True)
connection = connect(..., **options)

@methane
Copy link
Member

methane commented Aug 9, 2019

It's totally wrong solution. Don't use the converter at all.

@methane
Copy link
Member

methane commented Aug 9, 2019

Actually speaking, you broke your library by this commit: tumb1er/django_sphinxsearch@9519c5f

mysqlclient automatically decodes the string. But you override it so mysqlclient doesn't decode the string anymore. But a bug in mysqlclient (#343) hided your bug. Fixing #343 revealed your bug.

I hate converter. It is source of tons of bugs. It is not tested well. Overriding converter will cause surprising bugs like yours.

@tumb1er
Copy link
Author

tumb1er commented Aug 10, 2019

Yes, I agree. That was an attempt to fix ascii decoding errors appeared nobody knows when and where. Most probable candidates are sphinxsearch update from 2.3.x to 3.1.x and mysqlclient bugfixes.

I tried to enable use_unicode flag without converter, but I have 1 single error left:

AssertionError: 'за вдв' != 'за вдв'

Russian characters are not decoded correctly. Do you have any ideas, how to determine cause of with error (sphinxsearch vs mysqlclient usage)?

@tumb1er
Copy link
Author

tumb1er commented Aug 10, 2019

Looks like it's decoding on python side:

mysql> select attr_string from sphinx___testapp_testmodel where id = 2;
+-------------+
| attr_string |
+-------------+
| за вдв      |
+-------------+
1 row in set (0.00 sec)

What is the difference of using converters vs use_unicode for strings?

@methane
Copy link
Member

methane commented Aug 10, 2019

Have you specified charset? Until MySQL 8.0, the default connection charset was latin1.
You can override the encoding by charset="utf8mb4" option.

@tumb1er
Copy link
Author

tumb1er commented Aug 10, 2019

options['charset'] = "latin1"

UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-4: ordinal not in range(256)

It seems, defaults for sphinxsearch are different. Setting charset to "utf8mb4" or "utf8" does not fix decoding error.

options['charset'] = "cp1251"

This is fun. Russian letters are ok (it's default windows charset for russian language), but now encoding is broken :-/

@tumb1er
Copy link
Author

tumb1er commented Aug 10, 2019

Interesting. use_unicode=True without converters works exactly as

conversions[constants.FIELD_TYPE.STRING] = lambda x: x.decode('latin1')

@methane
Copy link
Member

methane commented Aug 10, 2019

It seems you are confused by Django wrappers.
Play Sphinx with bare MySQLdb and learn the behavior.
After that, read source of Django backend.

@tumb1er
Copy link
Author

tumb1er commented Aug 10, 2019

Plain MySQLdb also works in confusing manner...

from MySQLdb import *
from MySQLdb import converters, constants
conv = converters.conversions.copy()
conv[constants.FIELD_TYPE.STRING] = lambda x: x.decode('utf-8')

conn = connect(host='127.0.0.1', port=9307,
               use_unicode=True,
               # conv=converters.conversions,
               charset='utf8'
               )

c = conn.cursor()
c.execute("select id, attr_string from sphinx___testapp_testmodel;")
print(c.fetchall())
# ((1, 'за вдв'),)

conn = connect(host='127.0.0.1', port=9307, use_unicode=False, conv=conv)
c = conn.cursor()
c.execute("select id, attr_string from sphinx___testapp_testmodel;")
print(c.fetchall())
# ((1, 'за вдв'),)

I'm also confused with this line in Connection.__init__:

        if use_unicode:
            for t in (FIELD_TYPE.STRING, FIELD_TYPE.VAR_STRING, FIELD_TYPE.VARCHAR, FIELD_TYPE.TINY_BLOB,
                      FIELD_TYPE.MEDIUM_BLOB, FIELD_TYPE.LONG_BLOB, FIELD_TYPE.BLOB):
                self.converter[t] = _bytes_or_str

Which encoding uses _bytes_or_str?

@methane
Copy link
Member

methane commented Aug 10, 2019

I expect data in your table is broken already.
Try MySQL CLI.

@tumb1er
Copy link
Author

tumb1er commented Aug 10, 2019

Tried already, no errors.
I mean, in MySQL CLI there is beautiful symmetry: send/receive via UPDATE/SELECT are working correctly, without changing encoding.
These two lines have asymmetry in encoding handling:

c.execute("update sphinx___testapp_testmodel set attr_string=%s WHERE id = %s",
          ("за вдв", 1))
c.execute("select id, attr_string from sphinx___testapp_testmodel;")

This is strange and wrong.

@methane methane reopened this Aug 11, 2019
@methane
Copy link
Member

methane commented Aug 11, 2019

I confirmed the issue. And #382 will fix it.

mysqlclient used mysql_set_character_set after connection is made. It uses SET NAMES utf8 query.
But sphinxsearch doesn't support SET NAMES utf8. So mysql_set_character_set doesn't change connection charset and default client charset (latin1 before MySQL 8, utf8 after MySQL 8) are used.

#382 configure the charset before calling mysql_real_connect. It doesn't use SET NAMES query.

@methane methane closed this as completed Aug 11, 2019
@methane
Copy link
Member

methane commented Aug 11, 2019

I released 1.4.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants