Skip to content

Ignore escaped LIKE wildcards in MySQL #1735

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 1, 2025
Merged

Conversation

mvzink
Copy link
Contributor

@mvzink mvzink commented Feb 20, 2025

MySQL has a special case for escaped LIKE wildcards appearing in string literals: the escaping is ignored, whereas normally for any other (non-special) character, the backslash would be stripped.

This is implemented with a new flag on dialect which gets passed into the tokenizer, because I don't know if any other dialects have similar behavior and wanted to make it easy to add them if they do. I can't currently test Snowflake, BigQuery, or Clickhouse on this point, so I'm just going off my best guess based on docs and examples.

MySQL docs (see "Table 11.1 Special Character Escape Sequences" and note following)

@@ -2530,6 +2530,16 @@ fn parse_rlike_and_regexp() {
}
}

#[test]
fn parse_like_with_escape() {
mysql().verified_only_select(r#"SELECT * FROM customer WHERE name LIKE 'a\%c'"#);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at the comments and the tests but Im not sure I currently understood the behavior/problem being solved for.

Figured here to try ask with a concrete example, given this test case with the string 'a\%c' what was the previous vs current behavior was it the case that we failed to parse the string in the previous behavior or that it produced a different kind of output?

Copy link
Contributor Author

@mvzink mvzink Feb 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it took me a bit to realize what was going on when I encountered this. I'll try to explain it better in comments and commit messages.

These tests demonstrate that the backslash is not being stripped. Here is this same test without the fix:

thread 'parse_like_with_escape' panicked at src/test_utils.rs:164:13:
assertion `left == right` failed
  left: "SELECT * FROM customer WHERE name LIKE 'a\\%c'"
 right: "SELECT * FROM customer WHERE name LIKE 'a%c'"

The final 2 asserts on the big "Testing:" string in the parse_escaped_string_with_unescape test in tests/sqlparser_common.rs may make it bit clearer than this mysql-only test, because you can see the 2 different escaping behaviors side by side.

In the current behavior (which is correct for all escaping dialects except MySQL afaik), '\%' and '\_' are turned into'%' and '_'. This is the same as the 'default' escape rule: if there's no escaping rule for a given character, the backslash is simply stripped and the escaped character left in place. This 'default' is visible in parse_escaped_string_with_unescape with '\h \ ' becoming 'h ': there is no specific escape rule for h or .

The specific (and weird) escape rule that MySQL has for % and _ is that they are left alone but the backslash is not stripped. The reason for this is so that they can be escaped in LIKE patterns without double-escaping them like you have to do in, e.g. Snowflake (as described in the Snowflake docs and the comment on Dialect::ignores_like_wildcard_escapes). But MySQL chose to make it a special case not just for LIKE patterns, but for all string literals.

MySQL:

mysql> select '\_', hex('\\'), hex('_'), hex('\_'), hex('h'), hex('\h');
+----+-----------+----------+-----------+----------+-----------+
| \_ | hex('\\') | hex('_') | hex('\_') | hex('h') | hex('\h') |
+----+-----------+----------+-----------+----------+-----------+
| \_ | 5C        | 5F       | 5C5F      | 68       | 68        |
+----+-----------+----------+-----------+----------+-----------+
1 row in set (0.00 sec)

mysql> select '_' LIKE '\%', '_' LIKE '\_', '\_' LIKE '\_';
+---------------+---------------+----------------+
| '_' LIKE '\%' | '_' LIKE '\_' | '\_' LIKE '\_' |
+---------------+---------------+----------------+
|             0 |             1 |              0 |
+---------------+---------------+----------------+
1 row in set (0.00 sec)

(Note in the last LIKE expression, there is nothing to match the unstripped backslash character in the scrutinee, so it doesn't match.)

As far as I can tell from the Snowflake docs (BigQuery and Clickhouse are less clear), to express the equivalent LIKE expressions you would have to double-escape the backslashes, because this special case for _ and % don't exist:

select '_' LIKE '\\%', '_' LIKE '\\_', '\\_' LIKE '\\_';

(And, of course, to make things more fun, double-escaping also works in MySQL...)

Like I said, I'll see what I can do to explain this better in comments/tests, but let me know if this explanation makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah that makes sense, thanks for clarifying!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet, let me know if the new comment(s) are sufficient. Thanks!

@mvzink mvzink force-pushed the push-lqqyszlwynqm branch 3 times, most recently from 3e33640 to 50ac50b Compare February 25, 2025 20:08
MySQL has a special case for escaped LIKE wildcards (%, _) appearing in
string literals: the escaping is ignored, whereas normally for any other
(non-special) character, the backslash would be stripped. This is to
allow them to be used in LIKE patterns without double-escaping as is
needed in other escaping dialects (e.g. Snowflake), like so:

MySQL matching a literal _ character:

```sql
SELECT * FROM users WHERE name LIKE '%\_%';
```

Snowflake doing the same thing:

```sql
SELECT * FROM users WHERE name LIKE '%\\_%';
```

Note that in MySQL, this escaping rule does not just affect LIKE
patterns, but all string literals:

```
mysql> select '\_', hex('\\'), hex('_'), hex('\_');
+----+-----------+----------+-----------+
| \_ | hex('\\') | hex('_') | hex('\_') |
+----+-----------+----------+-----------+
| \_ | 5C        | 5F       | 5C5F      |
+----+-----------+----------+-----------+
1 row in set (0.00 sec)
```

This is implemented with a new flag onq dialect which gets passed into
the tokenizer, because I don't know if any other dialects have similar
behavior and wanted to make it easy to add them if they do. I can't
currently test Snowflake, BigQuery, or Clickhouse on this point, so I'm
just going off my best guess based on docs and examples.

[MySQL docs](https://dev.mysql.com/doc/refman/8.4/en/string-literals.html)
Copy link
Contributor

@iffyio iffyio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @mvzink!
cc @alamb

@iffyio iffyio merged commit a629ddf into apache:main Mar 1, 2025
9 checks passed
QuenKar pushed a commit to QuenKar/datafusion-sqlparser-rs that referenced this pull request Mar 25, 2025
ayman-sigma pushed a commit to sigmacomputing/sqlparser-rs that referenced this pull request Apr 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants