-
Notifications
You must be signed in to change notification settings - Fork 612
What's the best way to merge my fork? #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This will allow re-using it for SQLStatement in a later commit. (Also split the new struct into a separate file, other query-related types will be moved here in a follow-up commit.)
The SQLAssignment *struct* is used directly in ASTNode::SQLUpdate (will change to SQLStatement::SQLUpdate shortly).
Continuing from https://github.com/andygrove/sqlparser-rs/pull/33#issuecomment-453060427 This stops the parser from accepting (and the AST from being able to represent) SQL look-alike code that makes no sense, e.g. SELECT ... FROM (CREATE TABLE ...) foo SELECT ... FROM (1+CAST(...)) foo Generally this makes the AST less "partially typed": meaning certain parts are strongly typed (e.g. SELECT can only contain projections, relations, etc.), while everything that didn't get its own type is dumped into ASTNode, effectively untyped. After a few more fixes (yet to be implemented), `ASTNode` could become an `SQLExpression`. The Pratt-style expression parser (returning an SQLExpression) would be invoked from the top-down parser in places where a generic expression is expected (e.g. after SELECT <...>, WHERE <...>, etc.), while things like select's `projection` and `relation` could be more appropriately (narrowly) typed. Since the diff is quite large due to necessarily large number of mechanical changes, here's an overview: 1) Interface changes: - A new AST enum - `SQLStatement` - is split out of ASTNode: - The variants of the ASTNode enum, which _only_ make sense as a top level statement (INSERT, UPDATE, DELETE, CREATE, ALTER, COPY) are _moved_ to the new enum, with no other changes. - SQLSelect is _duplicated_: now available both as a variant in SQLStatement::SQLSelect (top-level SELECT) and ASTNode:: (subquery). - The main entry point (Parser::parse_sql) now expects an SQL statement as input, and returns an `SQLStatement`. 2) Parser changes: instead of detecting the top-level constructs deep down in the precedence parser (`parse_prefix`) we are able to do it just right after setting up the parser in the `parse_sql` entry point (SELECT, again, is kept in the expression parser to demonstrate how subqueries could be implemented). The rest of parser changes are mechanical ASTNode -> SQLStatement replacements resulting from the AST change. 3) Testing changes: for every test - depending on whether the input was a complete statement or an expresssion - I used an appropriate helper function: - `verified` (parses SQL, checks that it round-trips, and returns the AST) - was replaced by `verified_stmt` or `verified_expr`. - `parse_sql` (which returned AST without checking it round-tripped) was replaced by: - `parse_sql_expr` (same function, for expressions) - `one_statement_parses_to` (formerly `parses_to`), extended to deal with statements that are not expected to round-trip. The weird name is to reduce further churn when implementing multi-statement parsing. - `verified_stmt` (in 4 testcases that actually round-tripped)
…rom_projection` (The primary motivation was that it makes the tests more resilient to the upcoming changes to the SQLSelectStatement to support `AS` aliases and `UNION`.) Also start using `&'static str` literals consistently instead of String::from for the `let sql` test strings.
Parser::parse_sql() can now parse a semicolon-separated list of statements, returning them in a Vec<SQLStatement>. To support this we: - Move handling of inter-statement tokens from the end of individual statement parsers (`parse_select` and `parse_delete`; this was not implemented for other top-level statements) to the common statement-list parsing code (`parse_sql`); - Change the "Unexpected token at end of ..." error, which didn't have tests and prevented us from parsing successive statements -> "Expected end of statement" (i.e. a delimiter - currently only ";" - or the EOF); - Add PartialEq on ParserError to be able to assert_eq!() that parsing statements that do not terminate properly returns an expected error.
Before this commit there was a single `parse_expr(u8)` method, which was called both 1) from within the expression parser (to parse subexpression consisting of operators with higher priority than the current one), and 2) from the top-down parser both a) to parse true expressions (such as an item of the SELECT list or the condition after WHERE or after ON), and b) to parse sequences which are not exactly "expressions". This starts cleaning this up by renaming the `parse_expr(u8)` method to `parse_subexpr()` and using it only for (1) - i.e. usually providing a non-zero precedence parameter. The non-intuitively called `parse()` method is renamed to `parse_expr()`, which became available and is used for (2a). While reviewing the existing callers of `parse_expr`, four points to follow up on were identified (marked "TBD (#)" in the commit): 1) Do not lose parens (e.g. `(1+2)*3`) when roundtripping String->AST->String by using SQLNested. 2) Incorrect precedence of the NOT unary 3) `parse_table_factor` accepts any expression where a SELECT subquery is expected. 4) parse_delete uses parse_expr() to retrieve a table name These are dealt with in the commits to follow.
Before this change an expression like `(a+b)-(c+d)` was parsed correctly (as a Minus node with two Plus nodes as children), but when serializing back to an SQL string, it came up as a+b-c+d, since we don't store parens in AST and don't attempt to insert them when necessary during serialization. The latter would be hard, and we already had an SQLNested enum variant, so I changed the code to wrap the AST node for the parenthesized expression in it.
I checked the docs of a few of the most popular RDBMSes, and it seems there's consensus that the precedence of `NOT` is higher than `AND`, but lower than `IS NULL`. Postgresql[1], Oracle[2] and MySQL[3] docs say that explicitly. T-SQL docs[4] do mention it's higher than `AND`, and while they don't explicitly mention IS NULL, this snippet: select * from (select 1 as a)x where (not x.a) is null ...is a parsing error, while the following works like IS NOT NULL: select * from (select 1 as a)x where not x.a is null sqlite doesn't seem to mention `NOT` precedence, but I assume it works similarly. [1] https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-OPERATORS [2] https://docs.oracle.com/cd/B19306_01/server.102/b14200/conditions001.htm#i1034834 [3] https://dev.mysql.com/doc/refman/8.0/en/operator-precedence.html [4] https://docs.microsoft.com/en-us/sql/t-sql/language-elements/operator-precedence-transact-sql?view=sql-server-2017
This makes the parser more strict when handling SELECTs nested somewhere in the main statement: 1) instead of accepting SELECT anywhere in the expression where an operand was expected, we only accept it inside parens. (I've added a test for the currently supported syntax, <scalar subquery> in ANSI SQL terms) 2) instead of accepting any expression in the derived table context: `FROM ( ... )` - we only look for a SELECT subquery there. Due to #1, I had to swith the 'ansi' test from invoking the expression parser to the statement parser.
(To store "A name of a table, view, custom type, etc., possibly multi-part, i.e. db.schema.obj".) Before this change - some places used `String` for this (these are updated in this commit) - while others (notably SQLStatement::SQLDelete::relation, which is the reason for this series of commits) relied on ASTNode::SQLCompoundIdentifier (which is also backed by a Vec<SQLIdent>, but, as a variant of ASTNode enum, is not convenient to use when you know you need that specific variant).
...to match the name of the recently introduced `SQLObjectName` struct and to avoid any reservations about using it with multi-part names of objects other than tables (as in the `type_name` case).
...instead make `parse_compound_identifier()` return the underlying Vec<> directly, and rename it to `parse_list_of_ids()`, since it's used both for parsing compound identifiers and lists of identifiers.
Store (and parse) `table_name: SQLObjectName` instead of `relation: Option<Box<ASTNode>>`, which can be an arbitrary expression. Also remove the `Option<>`: the table name is not optional in any dialects I'm familiar with. While the FROM keyword itself _is_ optional in some dialects, there are more things to implement for those dialects, see https://stackoverflow.com/a/4484271/1026
Also move more things to use SQLIdent instead of String in the hope of making it a newtype eventually. Add tests that quoted identifiers round-trip parsing/serialization correctly.
ASTNode can now be renamed SQLExpression, as it represents a node in the "expression" part of the AST -- other nodes have their own types.
It was probably copied from somewhere else when most types were variants in ASTNode, and needed Box<> to prevent recursion in the ASTNode definition.
This used to be needed when it was a variant in the ASTNode enum itself.
Instead change ASTNode::SQLSubquery to be Box<SQLSelect>
(the tests affected by "unboxing" in the previous commits.)
This happens all the time when I forget to check that the keyword I wanted to use is actually listed in keywords.rs, this should help with debugging.
...by reusing `parse_column_names` instead of extracting identifiers out of the `parse_expr_list`s result.
Don't need the duplicate `columns.push()` + we advance the tokenizer, so no need to peek first.
By not swallowing the Err from parse_data_type(). Also switch to `match` to enable parsing table-level constraints in this loop later.
Widely used in MS SQL and specified in ANSI.
Some unsupported features are noted as TODOs.
Pull Request Test Coverage Report for Build 138
💛 - Coveralls |
Wow, I just ran into this crate and found this PR. No intention to spam but really thanks for your hard work. This fork works like a charm |
@zimond Thanks for commenting on this .. I had not seen this PR until now. @nickolay Apologies for missing this, and generally being unresponsive. I've been putting a lot of energy into Apache Arrow and neglecting the work here. Maybe I should just make you a collaborator on this repo since you are driving it more than me now. The only requirements I have in the short term are to not break compatibility without warning for datafusion and locustdb, the two dependent crates. The description of your changes sounds good. I'll go ahead and test against datafusion from your branch and see if there are any issues. |
Cargo.toml
Outdated
@@ -1,7 +1,7 @@ | |||
[package] | |||
name = "sqlparser" | |||
description = "Extensible SQL Lexer and Parser with support for ANSI SQL:2011" | |||
version = "0.2.2-alpha.0" | |||
version = "0.2.3-alpha.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please bump the version to 0.3.0
since these are breaking changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks good. Could you create a PR for this one?
I was trying to approve individual commits within the PR ... that doesn't work I guess |
I think the best bet is going to be to rebase this, bump the version to 0.3.0 and I will merge and then create a PR to upgrade Arrow to the new version. We should get @cswinter to take a look at this too. |
@andygrove, thanks for getting back to me; don't worry about the delays! I merged andygrove/master (after reverting the revert from #42...) back into my fork for now. As indicated, I avoid rebasing as it will make it harder for those who pulled from my fork. Bumped the version to 0.3.0, as requested. You previously noted in #42 that you found a "breaking change" in this, did you mean the (intentional) AST changes or is there a problem I missed? |
🎉 |
In my fork I have a number of new features and refactorings:
SQLStatement
enum fromASTNode
(up to 2dec65f) and support parsing multiple statements separated with a semicolon (707c58a)ast.to_string()
not produce invalid SQL (e0ceacd)[foo bar]
), national string literals (N'...'
), comments.CHAR
synonym and supportNUMERIC
not followed by(p,s)
.Initially I intended to submit this as one PR per item, and deal with the conflicts arising from reviews by rebasing. Now that the pile grew quite large and others indicated they pulled from my fork, I'm not sure rebasing is a good idea. I can still submit multiple PRs to make reviewing easier.
What do you think we should do?