Skip to content

feat: mysql no-escape mode #870

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Jul 19, 2023
Merged

feat: mysql no-escape mode #870

merged 21 commits into from
Jul 19, 2023

Conversation

canalun
Copy link
Contributor

@canalun canalun commented May 8, 2023

for #861

@canalun canalun marked this pull request as ready for review May 9, 2023 03:37
Box::new(BigQueryDialect {}),
Box::new(SQLiteDialect {}),
],
options: None,
}
}

pub fn all_dialects_other_than_mysqlnoescape() -> TestedDialects {
Copy link
Contributor Author

@canalun canalun May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the functions, "XX_other_than_mysqlnoescape", including this are supposed to be used for tests regarding escape.
I newly defined this kind of functions, because some test utils assume all the dialects return the same output and these utils are used widely now.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @canalun -- I'll try and find time to review this PR over the next few days

Cargo.toml Outdated
@@ -34,6 +34,7 @@ serde = { version = "1.0", features = ["derive"], optional = true }
# https://github.com/rust-lang/cargo/issues/1596
serde_json = { version = "1.0", optional = true }
sqlparser_derive = { version = "0.1.1", path = "derive", optional = true }
duplicate = "1.0.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general we try and keep the dependencies of this crate as minimal as possible -- can you please remove this dependency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it!
please refer to it: 598d9e9

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @canalun for the contribution and the well worded description on #861

After reviewing this PR, I feel like "dialect" isn't quite the right abstraction for this behavior as it doesn't change the SQL that is supported, but rather it changes the internal representation of how the data is stored after parsing

Thus, perhaps it could be added as a flag on ParserOptions might be better: https://docs.rs/sqlparser/0.33.0/sqlparser/parser/struct.ParserOptions.html

In terms of encoding the value, I wonder if we could add another variant to Value that could store the original, unescaped, characters. Perhaps something like the following

/// Primitive SQL values such as number and string
pub enum Value {
// ...
    /// 'string value'
    SingleQuotedString(String),
    /// Value and its original (escaped) form
    WithOriginal {
      /// The parsed / unescaped value (so "foo\nbar" would have a newline in it)
       value: Box<Value>, 
       /// characters in the original string (so "foo\nbar" would still have the slash)
       original: String
    }
....

This would then let you isolate the parser changes to the parsing of values, as well as make it clear how to format each variant. This might make the implementation and testing more straightforward

What do you think ?

@coveralls
Copy link

coveralls commented May 17, 2023

Pull Request Test Coverage Report for Build 5603686662

  • 226 of 233 (97.0%) changed or added relevant lines in 6 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.06%) to 87.131%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/tokenizer.rs 73 74 98.65%
tests/sqlparser_common.rs 12 13 92.31%
tests/sqlparser_mysql.rs 107 109 98.17%
src/ast/value.rs 19 22 86.36%
Totals Coverage Status
Change from base Build 5592480441: 0.06%
Covered Lines: 15471
Relevant Lines: 17756

💛 - Coveralls

@canalun
Copy link
Contributor Author

canalun commented May 26, 2023

@alamb Thank you for reviewing it!
I agree with your idea👶 handling this situation by using dialects is a bit tricky, but the alternative is straight forward.
I will work on it, but it may take some time...! (thank you for your patience!)

@alamb
Copy link
Contributor

alamb commented May 26, 2023

No worries ! Thank you @canalun

FYI I don't plan the next sqlparser release until June 2023 so there is no time pressure from my end #887

@canalun
Copy link
Contributor Author

canalun commented Jun 11, 2023

I'm working on it...! Sorry for late response.
I'll update the situation tomorrow!

@canalun canalun mentioned this pull request Jun 13, 2023
@canalun
Copy link
Contributor Author

canalun commented Jun 13, 2023

Hi @alamb, sorry for late update.

I've started to work on it and found few things,
and then I'd like to let you know these points beforehand implementing it fully, just in case...!
(btw, It would be useful for following the points to refer to the PR #894, that is what I prototyped these days👶)

Firstly, I wanna share what I discovered about the details of implementation of the current strategy.

  • First of all, when parsing sql queries, the point where the information of escape (e.g. backslashes) vanishes is inside next_tokens(). Specifically, some functions such as tokenize_quoted_string() omit the info such as backslashes.
  • So, we have to define and use new functions which keep the info of escape, instead of those escape-vanishing functions.
  • Adding to that, tokenizer will start to receive options, a new enum of Value and Token will be defined (as you said😊).

Considering this consequence, I think the current strategy also has some costs.

  1. When using the option for no-escape, the AST will not have the information of type of each token. This is because, for example, both Token::SingleQuotedString and Token::DoubleQuotedString will become the same Token::OriginalString under the no-escape mode.
    Of course, in the reality, Token::OriginalString will be like Token::OriginalString(""example"") or Token::OriginalString("'example'"). So, it's sure that you'll never lost the info of type of each token, but i think it will get a bit difficult to analyze the AST by using programs.
  2. The knowledge about each token's prefix will spillover to next_tokens().
    For example, please see this line of the prototype.
  3. As you can see from the point 2, the calc of next_token()will get messy...! I don't have any good idea, here.TT

If I may, I still think the current strategy is better than the previous one, because the previous one was a bit tricky.
The fact that the same type of Value can have the different escape-mode string would be really confusing,
especially when it comes to the implementation of fmt::Display() for each enum of Value. (https://github.com/sqlparser-rs/sqlparser-rs/pull/870/files#diff-9bcd0911924444fc7dd51971f9114291f2e6b619da2ce64ad48e91d6dd1abf8eR187)
This difficulty would affect the development almost permanently.

However, I think it's a bit difficult problem. Both have some pricy points.
I wanna hear your opinion🐤

Thank you for reading the long message...!

@alamb
Copy link
Contributor

alamb commented Jun 16, 2023

Thank you for reading the long message...!

Thanks @canalun -- I agree it is a tricky problem. Maybe @ankrgyl has some other thoughts.

Basically my thinking is that if we are going to update the tokenizer in some backwards compatible way, we should also consider adding original line/offset information as well.

both Token::SingleQuotedString and Token::DoubleQuotedString will become the same Token::OriginalString under the no-escape mode.

What if we kept Token::SingleQuotedString and Token::DoubleQuotedString but made them something like

SingleQuotedString(Token)

Or something ?

I realize that would be a (very) disruptive API change 🤔

@ankrgyl
Copy link
Contributor

ankrgyl commented Jun 16, 2023

👋 This may be a very dumb idea — but if no-escape is a "mode" that we pass to the tokenizer, do we need to represent the "escaped" version too? If not, why can't we use the existing structs/enums and just change how the tokenizer populates the string representations inside of them? Especially if it doesn't impact (downstream) how the parser works?

In terms of making backwards incompatible changes, I do have a specific idea on how to do this that I just haven't had time to implement, which is to use features (so users can opt into backwards incompatibility) and then eventually deprecate the old modes.

@alamb
Copy link
Contributor

alamb commented Jun 16, 2023

👋 This may be a very dumb idea — but if no-escape is a "mode" that we pass to the tokenizer, do we need to represent the "escaped" version too? If not, why can't we use the existing structs/enums and just change how the tokenizer populates the string representations inside of them? Especially if it doesn't impact (downstream) how the parser works?

So in this mode, a string like

"Foo"

Which today is parsed as

Token::DoubleQuotedString(String::from("Foo"))

Would be parsed as this (the " would still be present)?:

Token::DoubleQuotedString(String::from("\"Foo\""))

That sounds like a genius (rather than dumb) idea to me ... 🤔

In terms of making backwards incompatible changes, I do have a specific idea on how to do this that I just haven't had time to implement, which is to use features (so users can opt into backwards incompatibility) and then eventually deprecate the old modes.

❤️

@ankrgyl
Copy link
Contributor

ankrgyl commented Jun 16, 2023

Yes that's exactly right. I think all that matters is that you can get back strings with the same semantics as mysql's no-escape mode. It's basically like a dialect for the tokenizer.

@canalun
Copy link
Contributor Author

canalun commented Jun 18, 2023

@alamb @ankrgyl
Thank you all for giving the idea! I agree with it :)

I understood queries are parsed and serialized like this.

phrase of original query mode node of AST serialized
"A""B""A" no-escape DoubleQuotedString(String::from("A\"\"B\"\"A")) "A""B""A"
"A""B""A" default DoubleQuotedString(String::from("A\"B\"A")) "A""B""A"
"A\"B\"A" no-escape DoubleQuotedString(String::from("A\\\"B\\\"A")) "A\"B\"A"
"A\"B\"A" default DoubleQuotedString(String::from("A\"B\"A")) "A""B""A"

assumptions

  1. in no-escape mode, each node of AST has the exact original string.
  2. in default mode, we don't have to care about whether the " comes from \" or "" in the original query (this is the current spec of this library, i think.)

@canalun
Copy link
Contributor Author

canalun commented Jun 18, 2023

I will work on it!
fortunately, I have to impl few modifications on the present PR :)

@canalun
Copy link
Contributor Author

canalun commented Jun 18, 2023

I've added all the needed changes and could you please take a look? :)

@alamb
Copy link
Contributor

alamb commented Jun 20, 2023

Thanks @canalun -- I am a behind on reviews. I plan to review this tomorrow. Sorry for the delay

@canalun
Copy link
Contributor Author

canalun commented Jun 21, 2023

Thanks @canalun -- I am a behind on reviews. I plan to review this tomorrow. Sorry for the delay

absolutely no problem, thank you for reviewing it :)
please take your time and tell me if there was any misunderstanding between us😮

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @canalun -- I think this approach is looking good.

The CI seems to be failing

I had some suggestions related to API design, and I think the new options need some docstrings (which I can help with) but I think this PR is quite close

src/tokenizer.rs Outdated
}

impl<'a> Tokenizer<'a> {
/// Create a new SQL tokenizer for the specified SQL statement
pub fn new(dialect: &'a dyn Dialect, query: &'a str) -> Self {
Self { dialect, query }
pub fn new(dialect: &'a dyn Dialect, query: &'a str, options: &'a TokenizerOptions) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is one of the main interfaces to SqlParser, I think changing the signature will be a fairly large change. If we are going to do so I would prefer to combine the dialect and options into a single structure

pub struct TokenizerOptions<'a> {
    pub no_escape: bool,
    dialect: &'a dyn Dialect,
}
...

impl<'a> Tokenizer<'a> {
    pub fn new(options: TokenizerOptions<'a>, query: &'a str,) -> Self {
...
}

That way the API gets more general and we won't have to change it again if we add new parameters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb

Since this is one of the main interfaces to SqlParser, I think changing the signature will be a fairly large change.

While I strongly agree with this idea, I think dialect and other options such as no_escape would be better to be separated. :)
There are two reasons for this.

  1. If we separate dialect and other options(TokenizerOptions), you won't have to change the interface of Tokenizer even when you want to add new parameters. This is because all you have to do will be just adding the new option to TokenizerOptions. So the interface of Tokenizer::new() itself will not change, i think.
  2. Parser has the below interface now, and dialect and options are separated here.
    https://github.com/canalun/sqlparser-rs/blob/2e6502e40b9b777e0ee590e5aa172a3c1bc2a42d/src/parser.rs#L204
    And this separation is assumed in the logic of dialect_of, which is called against both Parser and Tokenizer.
    https://github.com/canalun/sqlparser-rs/blob/2e6502e40b9b777e0ee590e5aa172a3c1bc2a42d/src/dialect/mod.rs#L54
    So we have to align the two interfaces of Parser and Tokenizer. And if point 1 is true, I think it would be better to set the same interface as Parser on Tokenizer.

How about it?? :)

src/tokenizer.rs Outdated
@@ -1259,7 +1283,7 @@ mod tests {
fn tokenize_select_1() {
let sql = String::from("SELECT 1");
let dialect = GenericDialect {};
let mut tokenizer = Tokenizer::new(&dialect, &sql);
let mut tokenizer = Tokenizer::new(&dialect, &sql, &TokenizerOptions { no_escape: false });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this look more like:

Suggested change
let mut tokenizer = Tokenizer::new(&dialect, &sql, &TokenizerOptions { no_escape: false });
let mut tokenizer = Tokenizer::new(TokenizerOptions::new(&dialect), &sql);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!!!
I implemented and used TokenizerOptions::default() and TokenizerOptions::with_no_escape(bool). :)

29cb1e2

src/tokenizer.rs Outdated
fn tokenize_quoted_identifier_with_no_escape() {
let sql = r#" "a "" b" "a """ "c """"" "#;
let dialect = GenericDialect {};
let mut tokenizer = Tokenizer::new(&dialect, sql, &TokenizerOptions { no_escape: true });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut tokenizer = Tokenizer::new(&dialect, sql, &TokenizerOptions { no_escape: true });
let opts = TokenizerOptions::new(&dialect).with_no_escape(true);
let mut tokenizer = Tokenizer::new(opts, sql, &TokenizerOptions { no_escape: true });

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

29cb1e2

I think this commit mentioned above captures this point. ;)

@alamb
Copy link
Contributor

alamb commented Jun 21, 2023

absolutely no problem, thank you for reviewing it :)
please take your time and tell me if there was any misunderstanding between us😮

Not at all ! I just feel bad I spend so little time maintaining this crate

@canalun
Copy link
Contributor Author

canalun commented Jun 24, 2023

now, cargo test, cargo fmt, and cargo clippy passed on local😊

@canalun
Copy link
Contributor Author

canalun commented Jul 1, 2023

Thank you for running CI, and found it failed!!
the command i ran was not correct, so I'm fixing it!

@canalun
Copy link
Contributor Author

canalun commented Jul 1, 2023

fixed :)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@canalun thank you for bearing with me

I think the core escaping / unescaping logic and tests in this PR are great. I do have concerns about the API, but rather than going back and forth with you again and delaying this PR more, I made a PR with some proposed API changes that target this branch

canalun#1

Please let me know what you think of them

@canalun
Copy link
Contributor Author

canalun commented Jul 19, 2023

@alamb
Thank you so much...!!! I think your modification is really good idea.
I commented on and approved your PR :)

Simplify setting Tokenizer options, add docs and comments
@canalun
Copy link
Contributor Author

canalun commented Jul 19, 2023

@alamb
merged your PR ;)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @canalun !

@alamb alamb merged commit f98a2f9 into apache:main Jul 19, 2023
@alamb
Copy link
Contributor

alamb commented Jul 19, 2023

😅 -- thanks again for bearing with this @canalun -- I think we got to a good place in the end

@canalun
Copy link
Contributor Author

canalun commented Jul 20, 2023

@alamb Thank you so much for continuous support...!!! Hope this feature helps someone in the future :)

@alamb
Copy link
Contributor

alamb commented Jul 20, 2023

I think it is a really nice feature and well aligned with using sqlparser for sql analysis / rewriting

serprex pushed a commit to serprex/sqlparser-rs that referenced this pull request Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants