Skip to content

Added support for Mysql Backslash escapes #844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 10, 2023

Conversation

cobyge
Copy link
Contributor

@cobyge cobyge commented Apr 6, 2023

Hi,
This fixes issue #800

This adds proper support for MySQL backslash escaping (As described here)
Before now, the escaping support would only work for escaping single-quotes or backslashes. This adds full support for all Special Character Escape Sequences supported by MySQL at the current time (as defined in the table in the link above).

I also added some tests to make sure that the special characters are parsed as they should be.

Let me know if there is anything else I can do in order for this to get merged.

@cobyge
Copy link
Contributor Author

cobyge commented Apr 6, 2023

Now that I think about it, I realize that serializing special types back into a string will not end well.
I'm debating if I should add the correct implementation to Display for Token (specifically, doing the reverse translation for Token::SingleQuotedString and Token::DoublQuotedString as those are the only relevant ones according to the spec.)
Alternatively, I could just not convert special characters, and leave them as "raw" escape codes. (I'd still have to take care of escaping backslashes and quotes)
I'd appreciate maintainer input on this (I might be missing something).
Either way, tomorrow I'll add tests to make sure the sql is Serialized back to a query correctly

@cobyge
Copy link
Contributor Author

cobyge commented Apr 8, 2023

As far as I can tell, in order to correctly convert characters back into escape characters, (for example - converting a null byte back to \0), the implementation of fmt::Display for Value (specifically only Value::SingleQuotedString and Value::DoubleQuotedString are relevant here) would need to be dialect-aware (Which as far as I can tell, it is not).

This presents a few problems:

  • When deserializing a value to a string, what is the correct value of a null byte (or \Z, or other)? Is it always \0? I would assume not every SQL dialect supports that.
    • This leads me to prefer NOT converting the special characters, so upon parsing the following quote - SELECT '1 \b', the value of Value::SingleQuotedString would be '1 \b' and not '1 \u{8}'. This leads to sorta-incorrect and possibly unexpected behavior.
  • When deserializing \% or \_, the 'correct' implementation would be to parse it into % or _ respectively, but as stated in the docs - The \% and \_ sequences are used to search for literal instances of % and _ in pattern-matching contexts where they would otherwise be interpreted as wildcard characters. If you use \% or \_ outside of pattern-matching contexts, they evaluate to the strings \% and \_, not to % and _.. So in order to preserve query-semantics, the correct thing to do would be to keep the backslash when parsing those two characters. Like the previous point, this is not obvious behavior.
  • Some cases are not escaped properly, I will attempt to fix that in the following commit.

I'd love to hear any thoughts about the correct way these things should work, and I'll be happy to implement them

@cobyge
Copy link
Contributor Author

cobyge commented Apr 8, 2023

Another issue -
Assuming that we keep \b (or others) as it is, without converting it to \u{8}. How can we differentiate between the following two queries:

  • SELECT '\b'
  • SELECT '\\b'

Both of them would be serialized to SELECT '\b', and when deserializing them, there is no way to know what the original behavior is.
The 'easy' fix is to store a raw '' as '\', but then we are losing most of the semantics of parsing the value.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @cobyge

@coveralls
Copy link

Pull Request Test Coverage Report for Build 4632867435

  • 29 of 29 (100.0%) changed or added relevant lines in 2 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.008%) to 86.293%

Files with Coverage Reduction New Missed Lines %
src/tokenizer.rs 1 89.24%
Totals Coverage Status
Change from base Build 4524349703: 0.008%
Covered Lines: 13920
Relevant Lines: 16131

💛 - Coveralls

@alamb
Copy link
Contributor

alamb commented Apr 10, 2023

I'd love to hear any thoughts about the correct way these things should work, and I'll be happy to implement them

In general I would prefer we be usecase driven. The issues you highlight seem like issues, but I am not sure how important it is to ensure full round trip display compatibility at this time, especially when there are multiple potential ways to escape/encode a particular character.

Perhaps you can write some tests cases that show the behavior (or lack thereof) that you are worried about and we can use those to drive the discussion of what the expected behavior is.

Both of them would be serialized to SELECT '\b', and when deserializing them, there is no way to know what the original behavior is.

I am not sure it is critical to recover the original query byte for byte. Thing such as whitespace / newlines are not preserved either

@alamb alamb merged commit 04d9f3a into apache:main Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants