feat: mysql no-escape mode #870

canalun · 2023-05-08T18:38:37Z

canalun · 2023-05-09T03:47:56Z

src/test_utils.rs

            Box::new(BigQueryDialect {}),
            Box::new(SQLiteDialect {}),
        ],
        options: None,
    }
 }

+pub fn all_dialects_other_than_mysqlnoescape() -> TestedDialects {


the functions, "XX_other_than_mysqlnoescape", including this are supposed to be used for tests regarding escape.
I newly defined this kind of functions, because some test utils assume all the dialects return the same output and these utils are used widely now.

alamb

Thanks @canalun -- I'll try and find time to review this PR over the next few days

alamb · 2023-05-10T00:44:32Z

Cargo.toml

@@ -34,6 +34,7 @@ serde = { version = "1.0", features = ["derive"], optional = true }
 # https://github.com/rust-lang/cargo/issues/1596
 serde_json = { version = "1.0", optional = true }
 sqlparser_derive = { version = "0.1.1", path = "derive", optional = true }
+duplicate = "1.0.0"


in general we try and keep the dependencies of this crate as minimal as possible -- can you please remove this dependency?

got it!
please refer to it: 598d9e9

alamb

Thank you @canalun for the contribution and the well worded description on #861

After reviewing this PR, I feel like "dialect" isn't quite the right abstraction for this behavior as it doesn't change the SQL that is supported, but rather it changes the internal representation of how the data is stored after parsing

Thus, perhaps it could be added as a flag on ParserOptions might be better: https://docs.rs/sqlparser/0.33.0/sqlparser/parser/struct.ParserOptions.html

In terms of encoding the value, I wonder if we could add another variant to Value that could store the original, unescaped, characters. Perhaps something like the following

/// Primitive SQL values such as number and string
pub enum Value {
// ...
    /// 'string value'
    SingleQuotedString(String),
    /// Value and its original (escaped) form
    WithOriginal {
      /// The parsed / unescaped value (so "foo\nbar" would have a newline in it)
       value: Box<Value>, 
       /// characters in the original string (so "foo\nbar" would still have the slash)
       original: String
    }
....

This would then let you isolate the parser changes to the parsing of values, as well as make it clear how to format each variant. This might make the implementation and testing more straightforward

What do you think ?

coveralls · 2023-05-17T17:53:59Z

Pull Request Test Coverage Report for Build 5603686662

226 of 233 (97.0%) changed or added relevant lines in 6 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.06%) to 87.131%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
src/tokenizer.rs	73	74	98.65%
tests/sqlparser_common.rs	12	13	92.31%
tests/sqlparser_mysql.rs	107	109	98.17%
src/ast/value.rs	19	22	86.36%

Totals
Change from base Build 5592480441:	0.06%
Covered Lines:	15471
Relevant Lines:	17756

💛 - Coveralls

canalun · 2023-05-26T04:17:57Z

@alamb Thank you for reviewing it!
I agree with your idea👶 handling this situation by using dialects is a bit tricky, but the alternative is straight forward.
I will work on it, but it may take some time...! (thank you for your patience!)

alamb · 2023-05-26T11:10:40Z

No worries ! Thank you @canalun

FYI I don't plan the next sqlparser release until June 2023 so there is no time pressure from my end #887

canalun · 2023-06-11T13:58:49Z

I'm working on it...! Sorry for late response.
I'll update the situation tomorrow!

canalun · 2023-06-13T17:06:32Z

Hi @alamb, sorry for late update.

I've started to work on it and found few things,
and then I'd like to let you know these points beforehand implementing it fully, just in case...!
(btw, It would be useful for following the points to refer to the PR #894, that is what I prototyped these days👶)

Firstly, I wanna share what I discovered about the details of implementation of the current strategy.

First of all, when parsing sql queries, the point where the information of escape (e.g. backslashes) vanishes is inside next_tokens(). Specifically, some functions such as tokenize_quoted_string() omit the info such as backslashes.
So, we have to define and use new functions which keep the info of escape, instead of those escape-vanishing functions.
Adding to that, tokenizer will start to receive options, a new enum of Value and Token will be defined (as you said😊).

Considering this consequence, I think the current strategy also has some costs.

When using the option for no-escape, the AST will not have the information of type of each token. This is because, for example, both Token::SingleQuotedString and Token::DoubleQuotedString will become the same Token::OriginalString under the no-escape mode.
Of course, in the reality, Token::OriginalString will be like Token::OriginalString(""example"") or Token::OriginalString("'example'"). So, it's sure that you'll never lost the info of type of each token, but i think it will get a bit difficult to analyze the AST by using programs.
The knowledge about each token's prefix will spillover to next_tokens().
For example, please see this line of the prototype.
As you can see from the point 2, the calc of next_token()will get messy...! I don't have any good idea, here.TT

If I may, I still think the current strategy is better than the previous one, because the previous one was a bit tricky.
The fact that the same type of Value can have the different escape-mode string would be really confusing,
especially when it comes to the implementation of fmt::Display() for each enum of Value. (https://github.com/sqlparser-rs/sqlparser-rs/pull/870/files#diff-9bcd0911924444fc7dd51971f9114291f2e6b619da2ce64ad48e91d6dd1abf8eR187)
This difficulty would affect the development almost permanently.

However, I think it's a bit difficult problem. Both have some pricy points.
I wanna hear your opinion🐤

Thank you for reading the long message...!

alamb · 2023-06-16T16:11:17Z

Thank you for reading the long message...!

Thanks @canalun -- I agree it is a tricky problem. Maybe @ankrgyl has some other thoughts.

Basically my thinking is that if we are going to update the tokenizer in some backwards compatible way, we should also consider adding original line/offset information as well.

both Token::SingleQuotedString and Token::DoubleQuotedString will become the same Token::OriginalString under the no-escape mode.

What if we kept Token::SingleQuotedString and Token::DoubleQuotedString but made them something like

SingleQuotedString(Token)

Or something ?

I realize that would be a (very) disruptive API change 🤔

ankrgyl · 2023-06-16T16:27:19Z

👋 This may be a very dumb idea — but if no-escape is a "mode" that we pass to the tokenizer, do we need to represent the "escaped" version too? If not, why can't we use the existing structs/enums and just change how the tokenizer populates the string representations inside of them? Especially if it doesn't impact (downstream) how the parser works?

In terms of making backwards incompatible changes, I do have a specific idea on how to do this that I just haven't had time to implement, which is to use features (so users can opt into backwards incompatibility) and then eventually deprecate the old modes.

alamb · 2023-06-16T19:08:18Z

👋 This may be a very dumb idea — but if no-escape is a "mode" that we pass to the tokenizer, do we need to represent the "escaped" version too? If not, why can't we use the existing structs/enums and just change how the tokenizer populates the string representations inside of them? Especially if it doesn't impact (downstream) how the parser works?

So in this mode, a string like

"Foo"

Which today is parsed as

Token::DoubleQuotedString(String::from("Foo"))

Would be parsed as this (the " would still be present)?:

Token::DoubleQuotedString(String::from("\"Foo\""))

That sounds like a genius (rather than dumb) idea to me ... 🤔

In terms of making backwards incompatible changes, I do have a specific idea on how to do this that I just haven't had time to implement, which is to use features (so users can opt into backwards incompatibility) and then eventually deprecate the old modes.

❤️

ankrgyl · 2023-06-16T20:49:49Z

Yes that's exactly right. I think all that matters is that you can get back strings with the same semantics as mysql's no-escape mode. It's basically like a dialect for the tokenizer.

…t is newly defined

canalun · 2023-06-18T04:26:41Z

@alamb @ankrgyl
Thank you all for giving the idea! I agree with it :)

I understood queries are parsed and serialized like this.

phrase of original query	mode	node of AST	serialized
`"A""B""A"`	no-escape	`DoubleQuotedString(String::from("A\"\"B\"\"A"))`	`"A""B""A"`
`"A""B""A"`	default	`DoubleQuotedString(String::from("A\"B\"A"))`	`"A""B""A"`
`"A\"B\"A"`	no-escape	`DoubleQuotedString(String::from("A\\\"B\\\"A"))`	`"A\"B\"A"`
`"A\"B\"A"`	default	`DoubleQuotedString(String::from("A\"B\"A"))`	`"A""B""A"`

assumptions

in no-escape mode, each node of AST has the exact original string.
in default mode, we don't have to care about whether the " comes from \" or "" in the original query (this is the current spec of this library, i think.)

canalun · 2023-06-18T04:34:40Z

I will work on it!
fortunately, I have to impl few modifications on the present PR :)

canalun · 2023-06-18T09:05:33Z

I've added all the needed changes and could you please take a look? :)

alamb · 2023-06-20T21:29:30Z

Thanks @canalun -- I am a behind on reviews. I plan to review this tomorrow. Sorry for the delay

canalun · 2023-06-21T06:34:04Z

Thanks @canalun -- I am a behind on reviews. I plan to review this tomorrow. Sorry for the delay

absolutely no problem, thank you for reviewing it :)
please take your time and tell me if there was any misunderstanding between us😮

src/tokenizer.rs

alamb

Thanks @canalun -- I think this approach is looking good.

The CI seems to be failing

I had some suggestions related to API design, and I think the new options need some docstrings (which I can help with) but I think this PR is quite close

alamb · 2023-06-21T19:17:02Z

src/tokenizer.rs

 }

 impl<'a> Tokenizer<'a> {
    /// Create a new SQL tokenizer for the specified SQL statement
-    pub fn new(dialect: &'a dyn Dialect, query: &'a str) -> Self {
-        Self { dialect, query }
+    pub fn new(dialect: &'a dyn Dialect, query: &'a str, options: &'a TokenizerOptions) -> Self {


Since this is one of the main interfaces to SqlParser, I think changing the signature will be a fairly large change. If we are going to do so I would prefer to combine the dialect and options into a single structure

pub struct TokenizerOptions<'a> { pub no_escape: bool, dialect: &'a dyn Dialect, } ... impl<'a> Tokenizer<'a> { pub fn new(options: TokenizerOptions<'a>, query: &'a str,) -> Self { ... }

That way the API gets more general and we won't have to change it again if we add new parameters

@alamb

Since this is one of the main interfaces to SqlParser, I think changing the signature will be a fairly large change.

While I strongly agree with this idea, I think dialect and other options such as no_escape would be better to be separated. :)
There are two reasons for this.

If we separate dialect and other options(TokenizerOptions), you won't have to change the interface of Tokenizer even when you want to add new parameters. This is because all you have to do will be just adding the new option to TokenizerOptions. So the interface of Tokenizer::new() itself will not change, i think.

Parser has the below interface now, and dialect and options are separated here.
https://github.com/canalun/sqlparser-rs/blob/2e6502e40b9b777e0ee590e5aa172a3c1bc2a42d/src/parser.rs#L204
And this separation is assumed in the logic of dialect_of, which is called against both Parser and Tokenizer.
https://github.com/canalun/sqlparser-rs/blob/2e6502e40b9b777e0ee590e5aa172a3c1bc2a42d/src/dialect/mod.rs#L54
So we have to align the two interfaces of Parser and Tokenizer. And if point 1 is true, I think it would be better to set the same interface as Parser on Tokenizer.

How about it?? :)

alamb · 2023-06-21T19:18:33Z

src/tokenizer.rs

@@ -1259,7 +1283,7 @@ mod tests {
    fn tokenize_select_1() {
        let sql = String::from("SELECT 1");
        let dialect = GenericDialect {};
-        let mut tokenizer = Tokenizer::new(&dialect, &sql);
+        let mut tokenizer = Tokenizer::new(&dialect, &sql, &TokenizerOptions { no_escape: false });


Could we make this look more like:

Suggested change

let mut tokenizer = Tokenizer::new(&dialect, &sql, &TokenizerOptions { no_escape: false });

let mut tokenizer = Tokenizer::new(TokenizerOptions::new(&dialect), &sql);

Thanks!!!
I implemented and used TokenizerOptions::default() and TokenizerOptions::with_no_escape(bool). :)

29cb1e2

alamb · 2023-06-21T19:19:14Z

src/tokenizer.rs

+    fn tokenize_quoted_identifier_with_no_escape() {
+        let sql = r#" "a "" b" "a """ "c """"" "#;
+        let dialect = GenericDialect {};
+        let mut tokenizer = Tokenizer::new(&dialect, sql, &TokenizerOptions { no_escape: true });


Suggested change

let mut tokenizer = Tokenizer::new(&dialect, sql, &TokenizerOptions { no_escape: true });

let opts = TokenizerOptions::new(&dialect).with_no_escape(true);

let mut tokenizer = Tokenizer::new(opts, sql, &TokenizerOptions { no_escape: true });

29cb1e2

I think this commit mentioned above captures this point. ;)

src/ast/value.rs

alamb · 2023-06-21T19:21:51Z

absolutely no problem, thank you for reviewing it :)
please take your time and tell me if there was any misunderstanding between us😮

Not at all ! I just feel bad I spend so little time maintaining this crate

canalun · 2023-06-24T15:27:07Z

now, cargo test, cargo fmt, and cargo clippy passed on local😊

canalun · 2023-07-01T13:59:56Z

Thank you for running CI, and found it failed!!
the command i ran was not correct, so I'm fixing it!

canalun · 2023-07-01T14:12:38Z

fixed :)

alamb

@canalun thank you for bearing with me

I think the core escaping / unescaping logic and tests in this PR are great. I do have concerns about the API, but rather than going back and forth with you again and delaying this PR more, I made a PR with some proposed API changes that target this branch

canalun#1

Please let me know what you think of them

canalun · 2023-07-19T05:44:34Z

@alamb
Thank you so much...!!! I think your modification is really good idea.
I commented on and approved your PR :)

Simplify setting Tokenizer options, add docs and comments

canalun · 2023-07-19T12:26:43Z

@alamb
merged your PR ;)

alamb

Thanks @canalun !

alamb · 2023-07-19T20:37:09Z

😅 -- thanks again for bearing with this @canalun -- I think we got to a good place in the end

canalun · 2023-07-20T05:03:12Z

@alamb Thank you so much for continuous support...!!! Hope this feature helps someone in the future :)

alamb · 2023-07-20T13:51:19Z

I think it is a really nice feature and well aligned with using sqlparser for sql analysis / rewriting

Co-authored-by: Andrew Lamb <[email protected]>

canalun added 4 commits May 9, 2023 01:38

add dialect for mysql with no-escape mode

d0901ec

add and mod cases related to escape

4ccecb7

mod dialect_from_str()

57654e7

mod parse logic

af1cc78

This was referenced May 8, 2023

feat: add no-escape mode for MySQL #861

Closed

feat: non-escape mode (draft for discussion) #862

Closed

mod spell

0e467a1

canalun marked this pull request as ready for review May 9, 2023 03:37

cargo clippy

ba313e8

canalun commented May 9, 2023

View reviewed changes

alamb reviewed May 10, 2023

View reviewed changes

rm duplicate crate

598d9e9

alamb reviewed May 17, 2023

View reviewed changes

canalun mentioned this pull request Jun 13, 2023

proto PR for #870 #894

Closed

Merge branch 'main' into feat/no-escape-mode

9ddb447

canalun added 3 commits June 17, 2023 15:27

mod test

790804e

mod logic of Display for Double/SingleQuotedString and add test

8d7f5be

update: add no-escape field to parserOptions and tokenizerOptions tha…

07fd545

…t is newly defined

rm no-escape-dialect and adopt no-escape option

dc858c4

mod cargo clippy

2e6502e

alamb reviewed Jun 21, 2023

View reviewed changes

src/tokenizer.rs Outdated Show resolved Hide resolved

alamb reviewed Jun 21, 2023

View reviewed changes

canalun added 2 commits June 24, 2023 23:38

add comment to ToknizerOptions

29cb1e2

impl and use TokenizerOptions:default() and with_no_escape()

5573ab6

mod lint

0267fb5

Simplify setting Tokenizer options, add docs and comments

bdeafa7

alamb mentioned this pull request Jul 17, 2023

Simplify setting Tokenizer options, add docs and comments canalun/sqlparser-rs#1

Merged

alamb reviewed Jul 17, 2023

View reviewed changes

Merge pull request #1 from alamb/alamb/better_api

e773934

Simplify setting Tokenizer options, add docs and comments

alamb approved these changes Jul 19, 2023

View reviewed changes

alamb added 3 commits July 19, 2023 16:08

Merge remote-tracking branch 'upstream/main' into feat/no-escape-mode

03d071d

Fix compile errors

67aed85

Revert inconsistent argument to all_dialects

17b27d9

alamb merged commit f98a2f9 into apache:main Jul 19, 2023

serprex pushed a commit to serprex/sqlparser-rs that referenced this pull request Nov 6, 2023

feat: mysql no-escape mode (apache#870)

ee8c74a

Co-authored-by: Andrew Lamb <[email protected]>

	let mut tokenizer = Tokenizer::new(&dialect, &sql, &TokenizerOptions { no_escape: false });
	let mut tokenizer = Tokenizer::new(TokenizerOptions::new(&dialect), &sql);

	let mut tokenizer = Tokenizer::new(&dialect, sql, &TokenizerOptions { no_escape: true });
	let opts = TokenizerOptions::new(&dialect).with_no_escape(true);
	let mut tokenizer = Tokenizer::new(opts, sql, &TokenizerOptions { no_escape: true });

feat: mysql no-escape mode #870

feat: mysql no-escape mode #870

Conversation

canalun commented May 8, 2023

canalun May 9, 2023 • edited Loading

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

coveralls commented May 17, 2023 • edited Loading

Pull Request Test Coverage Report for Build 5603686662

💛 - Coveralls

canalun commented May 26, 2023

alamb commented May 26, 2023

canalun commented Jun 11, 2023

canalun commented Jun 13, 2023

alamb commented Jun 16, 2023

ankrgyl commented Jun 16, 2023

alamb commented Jun 16, 2023

ankrgyl commented Jun 16, 2023

canalun commented Jun 18, 2023 • edited Loading

canalun commented Jun 18, 2023

canalun commented Jun 18, 2023 • edited Loading

alamb commented Jun 20, 2023

canalun commented Jun 21, 2023 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jun 21, 2023

canalun commented Jun 24, 2023

canalun commented Jul 1, 2023

canalun commented Jul 1, 2023

alamb left a comment

Choose a reason for hiding this comment

canalun commented Jul 19, 2023 • edited Loading

canalun commented Jul 19, 2023 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jul 19, 2023

canalun commented Jul 20, 2023

alamb commented Jul 20, 2023

canalun May 9, 2023 •

edited

Loading

coveralls commented May 17, 2023 •

edited

Loading

canalun commented Jun 18, 2023 •

edited

Loading

canalun commented Jun 18, 2023 •

edited

Loading

canalun commented Jun 21, 2023 •

edited

Loading

canalun commented Jul 19, 2023 •

edited

Loading

canalun commented Jul 19, 2023 •

edited

Loading