Added support for Mysql Backslash escapes #844

cobyge · 2023-04-06T20:38:45Z

Hi,
This fixes issue #800

This adds proper support for MySQL backslash escaping (As described here)
Before now, the escaping support would only work for escaping single-quotes or backslashes. This adds full support for all Special Character Escape Sequences supported by MySQL at the current time (as defined in the table in the link above).

I also added some tests to make sure that the special characters are parsed as they should be.

Let me know if there is anything else I can do in order for this to get merged.

cobyge · 2023-04-06T23:14:18Z

Now that I think about it, I realize that serializing special types back into a string will not end well.
I'm debating if I should add the correct implementation to Display for Token (specifically, doing the reverse translation for Token::SingleQuotedString and Token::DoublQuotedString as those are the only relevant ones according to the spec.)
Alternatively, I could just not convert special characters, and leave them as "raw" escape codes. (I'd still have to take care of escaping backslashes and quotes)
I'd appreciate maintainer input on this (I might be missing something).
Either way, tomorrow I'll add tests to make sure the sql is Serialized back to a query correctly

cobyge · 2023-04-08T21:27:07Z

As far as I can tell, in order to correctly convert characters back into escape characters, (for example - converting a null byte back to \0), the implementation of fmt::Display for Value (specifically only Value::SingleQuotedString and Value::DoubleQuotedString are relevant here) would need to be dialect-aware (Which as far as I can tell, it is not).

This presents a few problems:

When deserializing a value to a string, what is the correct value of a null byte (or \Z, or other)? Is it always \0? I would assume not every SQL dialect supports that.
- This leads me to prefer NOT converting the special characters, so upon parsing the following quote - SELECT '1 \b', the value of Value::SingleQuotedString would be '1 \b' and not '1 \u{8}'. This leads to sorta-incorrect and possibly unexpected behavior.
When deserializing \% or \_, the 'correct' implementation would be to parse it into % or _ respectively, but as stated in the docs - The \% and \_ sequences are used to search for literal instances of % and _ in pattern-matching contexts where they would otherwise be interpreted as wildcard characters. If you use \% or \_ outside of pattern-matching contexts, they evaluate to the strings \% and \_, not to % and _.. So in order to preserve query-semantics, the correct thing to do would be to keep the backslash when parsing those two characters. Like the previous point, this is not obvious behavior.
Some cases are not escaped properly, I will attempt to fix that in the following commit.

I'd love to hear any thoughts about the correct way these things should work, and I'll be happy to implement them

cobyge · 2023-04-08T22:05:14Z

Another issue -
Assuming that we keep \b (or others) as it is, without converting it to \u{8}. How can we differentiate between the following two queries:

SELECT '\b'
SELECT '\\b'

Both of them would be serialized to SELECT '\b', and when deserializing them, there is no way to know what the original behavior is.
The 'easy' fix is to store a raw '' as '\', but then we are losing most of the semantics of parsing the value.

alamb

Looks good to me -- thank you @cobyge

coveralls · 2023-04-09T11:34:11Z

Pull Request Test Coverage Report for Build 4632867435

29 of 29 (100.0%) changed or added relevant lines in 2 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage increased (+0.008%) to 86.293%

Files with Coverage Reduction	New Missed Lines	%
src/tokenizer.rs	1	89.24%

Totals
Change from base Build 4524349703:	0.008%
Covered Lines:	13920
Relevant Lines:	16131

💛 - Coveralls

alamb · 2023-04-10T13:54:49Z

I'd love to hear any thoughts about the correct way these things should work, and I'll be happy to implement them

In general I would prefer we be usecase driven. The issues you highlight seem like issues, but I am not sure how important it is to ensure full round trip display compatibility at this time, especially when there are multiple potential ways to escape/encode a particular character.

Perhaps you can write some tests cases that show the behavior (or lack thereof) that you are worried about and we can use those to drive the discussion of what the expected behavior is.

Both of them would be serialized to SELECT '\b', and when deserializing them, there is no way to know what the original behavior is.

I am not sure it is critical to recover the original query byte for byte. Thing such as whitespace / newlines are not preserved either

Added support for Mysql Backslash escapes (enabled by default)

3bce965

alamb approved these changes Apr 9, 2023

View reviewed changes

alamb merged commit 04d9f3a into apache:main Apr 10, 2023

appletreeisyellow mentioned this pull request Aug 24, 2023

Can't parse escaped double quotes #800

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for Mysql Backslash escapes #844

Added support for Mysql Backslash escapes #844

cobyge commented Apr 6, 2023

cobyge commented Apr 6, 2023

cobyge commented Apr 8, 2023 •

edited

Loading

cobyge commented Apr 8, 2023

alamb left a comment

coveralls commented Apr 9, 2023

alamb commented Apr 10, 2023

Added support for Mysql Backslash escapes #844

Added support for Mysql Backslash escapes #844

Conversation

cobyge commented Apr 6, 2023

cobyge commented Apr 6, 2023

cobyge commented Apr 8, 2023 • edited Loading

cobyge commented Apr 8, 2023

alamb left a comment

Choose a reason for hiding this comment

coveralls commented Apr 9, 2023

Pull Request Test Coverage Report for Build 4632867435

💛 - Coveralls

alamb commented Apr 10, 2023

cobyge commented Apr 8, 2023 •

edited

Loading