-
Notifications
You must be signed in to change notification settings - Fork 605
feat: mysql no-escape mode #870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/test_utils.rs
Outdated
Box::new(BigQueryDialect {}), | ||
Box::new(SQLiteDialect {}), | ||
], | ||
options: None, | ||
} | ||
} | ||
|
||
pub fn all_dialects_other_than_mysqlnoescape() -> TestedDialects { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the functions, "XX_other_than_mysqlnoescape", including this are supposed to be used for tests regarding escape.
I newly defined this kind of functions, because some test utils assume all the dialects return the same output and these utils are used widely now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @canalun -- I'll try and find time to review this PR over the next few days
Cargo.toml
Outdated
@@ -34,6 +34,7 @@ serde = { version = "1.0", features = ["derive"], optional = true } | |||
# https://github.com/rust-lang/cargo/issues/1596 | |||
serde_json = { version = "1.0", optional = true } | |||
sqlparser_derive = { version = "0.1.1", path = "derive", optional = true } | |||
duplicate = "1.0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general we try and keep the dependencies of this crate as minimal as possible -- can you please remove this dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it!
please refer to it: 598d9e9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @canalun for the contribution and the well worded description on #861
After reviewing this PR, I feel like "dialect" isn't quite the right abstraction for this behavior as it doesn't change the SQL that is supported, but rather it changes the internal representation of how the data is stored after parsing
Thus, perhaps it could be added as a flag on ParserOptions
might be better: https://docs.rs/sqlparser/0.33.0/sqlparser/parser/struct.ParserOptions.html
In terms of encoding the value, I wonder if we could add another variant to Value
that could store the original, unescaped, characters. Perhaps something like the following
/// Primitive SQL values such as number and string
pub enum Value {
// ...
/// 'string value'
SingleQuotedString(String),
/// Value and its original (escaped) form
WithOriginal {
/// The parsed / unescaped value (so "foo\nbar" would have a newline in it)
value: Box<Value>,
/// characters in the original string (so "foo\nbar" would still have the slash)
original: String
}
....
This would then let you isolate the parser changes to the parsing of values, as well as make it clear how to format each variant. This might make the implementation and testing more straightforward
What do you think ?
Pull Request Test Coverage Report for Build 5603686662
💛 - Coveralls |
@alamb Thank you for reviewing it! |
I'm working on it...! Sorry for late response. |
Hi @alamb, sorry for late update. I've started to work on it and found few things, Firstly, I wanna share what I discovered about the details of implementation of the current strategy.
Considering this consequence, I think the current strategy also has some costs.
If I may, I still think the current strategy is better than the previous one, because the previous one was a bit tricky. However, I think it's a bit difficult problem. Both have some pricy points. Thank you for reading the long message...! |
Thanks @canalun -- I agree it is a tricky problem. Maybe @ankrgyl has some other thoughts. Basically my thinking is that if we are going to update the tokenizer in some backwards compatible way, we should also consider adding original line/offset information as well.
What if we kept Token::SingleQuotedString and Token::DoubleQuotedString but made them something like
Or something ? I realize that would be a (very) disruptive API change 🤔 |
👋 This may be a very dumb idea — but if no-escape is a "mode" that we pass to the tokenizer, do we need to represent the "escaped" version too? If not, why can't we use the existing structs/enums and just change how the tokenizer populates the string representations inside of them? Especially if it doesn't impact (downstream) how the parser works? In terms of making backwards incompatible changes, I do have a specific idea on how to do this that I just haven't had time to implement, which is to use features (so users can opt into backwards incompatibility) and then eventually deprecate the old modes. |
So in this mode, a string like
Which today is parsed as Token::DoubleQuotedString(String::from("Foo")) Would be parsed as this (the Token::DoubleQuotedString(String::from("\"Foo\"")) That sounds like a genius (rather than dumb) idea to me ... 🤔
❤️ |
Yes that's exactly right. I think all that matters is that you can get back strings with the same semantics as mysql's no-escape mode. It's basically like a dialect for the tokenizer. |
@alamb @ankrgyl I understood queries are parsed and serialized like this.
assumptions
|
I will work on it! |
I've added all the needed changes and could you please take a look? :) |
Thanks @canalun -- I am a behind on reviews. I plan to review this tomorrow. Sorry for the delay |
absolutely no problem, thank you for reviewing it :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @canalun -- I think this approach is looking good.
The CI seems to be failing
I had some suggestions related to API design, and I think the new options need some docstrings (which I can help with) but I think this PR is quite close
src/tokenizer.rs
Outdated
} | ||
|
||
impl<'a> Tokenizer<'a> { | ||
/// Create a new SQL tokenizer for the specified SQL statement | ||
pub fn new(dialect: &'a dyn Dialect, query: &'a str) -> Self { | ||
Self { dialect, query } | ||
pub fn new(dialect: &'a dyn Dialect, query: &'a str, options: &'a TokenizerOptions) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is one of the main interfaces to SqlParser, I think changing the signature will be a fairly large change. If we are going to do so I would prefer to combine the dialect and options into a single structure
pub struct TokenizerOptions<'a> {
pub no_escape: bool,
dialect: &'a dyn Dialect,
}
...
impl<'a> Tokenizer<'a> {
pub fn new(options: TokenizerOptions<'a>, query: &'a str,) -> Self {
...
}
That way the API gets more general and we won't have to change it again if we add new parameters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is one of the main interfaces to SqlParser, I think changing the signature will be a fairly large change.
While I strongly agree with this idea, I think dialect
and other options such as no_escape
would be better to be separated. :)
There are two reasons for this.
- If we separate
dialect
and other options(TokenizerOptions
), you won't have to change the interface ofTokenizer
even when you want to add new parameters. This is because all you have to do will be just adding the new option toTokenizerOptions
. So the interface ofTokenizer::new()
itself will not change, i think. Parser
has the below interface now, anddialect
andoptions
are separated here.
https://github.com/canalun/sqlparser-rs/blob/2e6502e40b9b777e0ee590e5aa172a3c1bc2a42d/src/parser.rs#L204
And this separation is assumed in the logic ofdialect_of
, which is called against bothParser
andTokenizer
.
https://github.com/canalun/sqlparser-rs/blob/2e6502e40b9b777e0ee590e5aa172a3c1bc2a42d/src/dialect/mod.rs#L54
So we have to align the two interfaces ofParser
andTokenizer
. And if point 1 is true, I think it would be better to set the same interface asParser
onTokenizer
.
How about it?? :)
src/tokenizer.rs
Outdated
@@ -1259,7 +1283,7 @@ mod tests { | |||
fn tokenize_select_1() { | |||
let sql = String::from("SELECT 1"); | |||
let dialect = GenericDialect {}; | |||
let mut tokenizer = Tokenizer::new(&dialect, &sql); | |||
let mut tokenizer = Tokenizer::new(&dialect, &sql, &TokenizerOptions { no_escape: false }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this look more like:
let mut tokenizer = Tokenizer::new(&dialect, &sql, &TokenizerOptions { no_escape: false }); | |
let mut tokenizer = Tokenizer::new(TokenizerOptions::new(&dialect), &sql); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!!!
I implemented and used TokenizerOptions::default()
and TokenizerOptions::with_no_escape(bool)
. :)
src/tokenizer.rs
Outdated
fn tokenize_quoted_identifier_with_no_escape() { | ||
let sql = r#" "a "" b" "a """ "c """"" "#; | ||
let dialect = GenericDialect {}; | ||
let mut tokenizer = Tokenizer::new(&dialect, sql, &TokenizerOptions { no_escape: true }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let mut tokenizer = Tokenizer::new(&dialect, sql, &TokenizerOptions { no_escape: true }); | |
let opts = TokenizerOptions::new(&dialect).with_no_escape(true); | |
let mut tokenizer = Tokenizer::new(opts, sql, &TokenizerOptions { no_escape: true }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this commit mentioned above captures this point. ;)
Not at all ! I just feel bad I spend so little time maintaining this crate |
now, |
Thank you for running CI, and found it failed!! |
fixed :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@canalun thank you for bearing with me
I think the core escaping / unescaping logic and tests in this PR are great. I do have concerns about the API, but rather than going back and forth with you again and delaying this PR more, I made a PR with some proposed API changes that target this branch
Please let me know what you think of them
@alamb |
Simplify setting Tokenizer options, add docs and comments
@alamb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @canalun !
😅 -- thanks again for bearing with this @canalun -- I think we got to a good place in the end |
@alamb Thank you so much for continuous support...!!! Hope this feature helps someone in the future :) |
I think it is a really nice feature and well aligned with using sqlparser for sql analysis / rewriting |
Co-authored-by: Andrew Lamb <[email protected]>
for #861