-
Notifications
You must be signed in to change notification settings - Fork 605
Ignore escaped LIKE wildcards in MySQL #1735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tests/sqlparser_mysql.rs
Outdated
@@ -2530,6 +2530,16 @@ fn parse_rlike_and_regexp() { | |||
} | |||
} | |||
|
|||
#[test] | |||
fn parse_like_with_escape() { | |||
mysql().verified_only_select(r#"SELECT * FROM customer WHERE name LIKE 'a\%c'"#); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look at the comments and the tests but Im not sure I currently understood the behavior/problem being solved for.
Figured here to try ask with a concrete example, given this test case with the string 'a\%c'
what was the previous vs current behavior was it the case that we failed to parse the string in the previous behavior or that it produced a different kind of output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it took me a bit to realize what was going on when I encountered this. I'll try to explain it better in comments and commit messages.
These tests demonstrate that the backslash is not being stripped. Here is this same test without the fix:
thread 'parse_like_with_escape' panicked at src/test_utils.rs:164:13:
assertion `left == right` failed
left: "SELECT * FROM customer WHERE name LIKE 'a\\%c'"
right: "SELECT * FROM customer WHERE name LIKE 'a%c'"
The final 2 asserts on the big "Testing:" string in the parse_escaped_string_with_unescape
test in tests/sqlparser_common.rs
may make it bit clearer than this mysql-only test, because you can see the 2 different escaping behaviors side by side.
In the current behavior (which is correct for all escaping dialects except MySQL afaik), '\%'
and '\_'
are turned into'%'
and '_'
. This is the same as the 'default' escape rule: if there's no escaping rule for a given character, the backslash is simply stripped and the escaped character left in place. This 'default' is visible in parse_escaped_string_with_unescape
with '\h \ '
becoming 'h '
: there is no specific escape rule for h
or
.
The specific (and weird) escape rule that MySQL has for %
and _
is that they are left alone but the backslash is not stripped. The reason for this is so that they can be escaped in LIKE
patterns without double-escaping them like you have to do in, e.g. Snowflake (as described in the Snowflake docs and the comment on Dialect::ignores_like_wildcard_escapes
). But MySQL chose to make it a special case not just for LIKE patterns, but for all string literals.
MySQL:
mysql> select '\_', hex('\\'), hex('_'), hex('\_'), hex('h'), hex('\h');
+----+-----------+----------+-----------+----------+-----------+
| \_ | hex('\\') | hex('_') | hex('\_') | hex('h') | hex('\h') |
+----+-----------+----------+-----------+----------+-----------+
| \_ | 5C | 5F | 5C5F | 68 | 68 |
+----+-----------+----------+-----------+----------+-----------+
1 row in set (0.00 sec)
mysql> select '_' LIKE '\%', '_' LIKE '\_', '\_' LIKE '\_';
+---------------+---------------+----------------+
| '_' LIKE '\%' | '_' LIKE '\_' | '\_' LIKE '\_' |
+---------------+---------------+----------------+
| 0 | 1 | 0 |
+---------------+---------------+----------------+
1 row in set (0.00 sec)
(Note in the last LIKE expression, there is nothing to match the unstripped backslash character in the scrutinee, so it doesn't match.)
As far as I can tell from the Snowflake docs (BigQuery and Clickhouse are less clear), to express the equivalent LIKE expressions you would have to double-escape the backslashes, because this special case for _
and %
don't exist:
select '_' LIKE '\\%', '_' LIKE '\\_', '\\_' LIKE '\\_';
(And, of course, to make things more fun, double-escaping also works in MySQL...)
Like I said, I'll see what I can do to explain this better in comments/tests, but let me know if this explanation makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah that makes sense, thanks for clarifying!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet, let me know if the new comment(s) are sufficient. Thanks!
3e33640
to
50ac50b
Compare
MySQL has a special case for escaped LIKE wildcards (%, _) appearing in string literals: the escaping is ignored, whereas normally for any other (non-special) character, the backslash would be stripped. This is to allow them to be used in LIKE patterns without double-escaping as is needed in other escaping dialects (e.g. Snowflake), like so: MySQL matching a literal _ character: ```sql SELECT * FROM users WHERE name LIKE '%\_%'; ``` Snowflake doing the same thing: ```sql SELECT * FROM users WHERE name LIKE '%\\_%'; ``` Note that in MySQL, this escaping rule does not just affect LIKE patterns, but all string literals: ``` mysql> select '\_', hex('\\'), hex('_'), hex('\_'); +----+-----------+----------+-----------+ | \_ | hex('\\') | hex('_') | hex('\_') | +----+-----------+----------+-----------+ | \_ | 5C | 5F | 5C5F | +----+-----------+----------+-----------+ 1 row in set (0.00 sec) ``` This is implemented with a new flag onq dialect which gets passed into the tokenizer, because I don't know if any other dialects have similar behavior and wanted to make it easy to add them if they do. I can't currently test Snowflake, BigQuery, or Clickhouse on this point, so I'm just going off my best guess based on docs and examples. [MySQL docs](https://dev.mysql.com/doc/refman/8.4/en/string-literals.html)
50ac50b
to
4d712e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MySQL has a special case for escaped LIKE wildcards appearing in string literals: the escaping is ignored, whereas normally for any other (non-special) character, the backslash would be stripped.
This is implemented with a new flag on dialect which gets passed into the tokenizer, because I don't know if any other dialects have similar behavior and wanted to make it easy to add them if they do. I can't currently test Snowflake, BigQuery, or Clickhouse on this point, so I'm just going off my best guess based on docs and examples.
MySQL docs (see "Table 11.1 Special Character Escape Sequences" and note following)