What's the best way to merge my fork? #43

nickolay · 2019-03-08T14:33:46Z

In my fork I have a number of new features and refactorings:

Separate SQLStatement enum from ASTNode (up to 2dec65f) and support parsing multiple statements separated with a semicolon (707c58a)
Only use parse_expr() for expressions (b57c60a) and related fixes (up to 07790fe)
Assorted refactorings and changes to the AST (up to 346d1ff), notably:
- Store the way an identifier was quoted, as the easiest way to make ast.to_string() not produce invalid SQL (e0ceacd)
- Move the remaining ASTNode variant which doesn't make sense in the context of an expression (9967031)
Support CREATE VIEW (0c0cbca)
A CLI program (2e9da53)
Tokenizer fixes (up to f958e9d): dialect-specific identifier quoting (e.g. MS SQL's [foo bar]), national string literals (N'...'), comments.
Support for more SQL constructs (up to 0621f8d): WITH, aliases and qualified wildcards in SELECT, IN, BETWEEN, unary +/-, UNION/EXCEPT/INTERSECT, data types tweaks: the CHAR synonym and support NUMERIC not followed by (p,s).
CREATE MATERIALIZED VIEW from PR Support CREATE MATERIALIZED VIEW nickolay/sqlparser-rs#1

Initially I intended to submit this as one PR per item, and deal with the conflicts arising from reviews by rebasing. Now that the pile grew quite large and others indicated they pulled from my fork, I'm not sure rebasing is a good idea. I can still submit multiple PRs to make reviewing easier.

What do you think we should do?

This will allow re-using it for SQLStatement in a later commit. (Also split the new struct into a separate file, other query-related types will be moved here in a follow-up commit.)

The SQLAssignment *struct* is used directly in ASTNode::SQLUpdate (will change to SQLStatement::SQLUpdate shortly).

Continuing from https://github.com/andygrove/sqlparser-rs/pull/33#issuecomment-453060427 This stops the parser from accepting (and the AST from being able to represent) SQL look-alike code that makes no sense, e.g. SELECT ... FROM (CREATE TABLE ...) foo SELECT ... FROM (1+CAST(...)) foo Generally this makes the AST less "partially typed": meaning certain parts are strongly typed (e.g. SELECT can only contain projections, relations, etc.), while everything that didn't get its own type is dumped into ASTNode, effectively untyped. After a few more fixes (yet to be implemented), `ASTNode` could become an `SQLExpression`. The Pratt-style expression parser (returning an SQLExpression) would be invoked from the top-down parser in places where a generic expression is expected (e.g. after SELECT <...>, WHERE <...>, etc.), while things like select's `projection` and `relation` could be more appropriately (narrowly) typed. Since the diff is quite large due to necessarily large number of mechanical changes, here's an overview: 1) Interface changes: - A new AST enum - `SQLStatement` - is split out of ASTNode: - The variants of the ASTNode enum, which _only_ make sense as a top level statement (INSERT, UPDATE, DELETE, CREATE, ALTER, COPY) are _moved_ to the new enum, with no other changes. - SQLSelect is _duplicated_: now available both as a variant in SQLStatement::SQLSelect (top-level SELECT) and ASTNode:: (subquery). - The main entry point (Parser::parse_sql) now expects an SQL statement as input, and returns an `SQLStatement`. 2) Parser changes: instead of detecting the top-level constructs deep down in the precedence parser (`parse_prefix`) we are able to do it just right after setting up the parser in the `parse_sql` entry point (SELECT, again, is kept in the expression parser to demonstrate how subqueries could be implemented). The rest of parser changes are mechanical ASTNode -> SQLStatement replacements resulting from the AST change. 3) Testing changes: for every test - depending on whether the input was a complete statement or an expresssion - I used an appropriate helper function: - `verified` (parses SQL, checks that it round-trips, and returns the AST) - was replaced by `verified_stmt` or `verified_expr`. - `parse_sql` (which returned AST without checking it round-tripped) was replaced by: - `parse_sql_expr` (same function, for expressions) - `one_statement_parses_to` (formerly `parses_to`), extended to deal with statements that are not expected to round-trip. The weird name is to reduce further churn when implementing multi-statement parsing. - `verified_stmt` (in 4 testcases that actually round-tripped)

…rom_projection` (The primary motivation was that it makes the tests more resilient to the upcoming changes to the SQLSelectStatement to support `AS` aliases and `UNION`.) Also start using `&'static str` literals consistently instead of String::from for the `let sql` test strings.

Parser::parse_sql() can now parse a semicolon-separated list of statements, returning them in a Vec<SQLStatement>. To support this we: - Move handling of inter-statement tokens from the end of individual statement parsers (`parse_select` and `parse_delete`; this was not implemented for other top-level statements) to the common statement-list parsing code (`parse_sql`); - Change the "Unexpected token at end of ..." error, which didn't have tests and prevented us from parsing successive statements -> "Expected end of statement" (i.e. a delimiter - currently only ";" - or the EOF); - Add PartialEq on ParserError to be able to assert_eq!() that parsing statements that do not terminate properly returns an expected error.

Before this commit there was a single `parse_expr(u8)` method, which was called both 1) from within the expression parser (to parse subexpression consisting of operators with higher priority than the current one), and 2) from the top-down parser both a) to parse true expressions (such as an item of the SELECT list or the condition after WHERE or after ON), and b) to parse sequences which are not exactly "expressions". This starts cleaning this up by renaming the `parse_expr(u8)` method to `parse_subexpr()` and using it only for (1) - i.e. usually providing a non-zero precedence parameter. The non-intuitively called `parse()` method is renamed to `parse_expr()`, which became available and is used for (2a). While reviewing the existing callers of `parse_expr`, four points to follow up on were identified (marked "TBD (#)" in the commit): 1) Do not lose parens (e.g. `(1+2)*3`) when roundtripping String->AST->String by using SQLNested. 2) Incorrect precedence of the NOT unary 3) `parse_table_factor` accepts any expression where a SELECT subquery is expected. 4) parse_delete uses parse_expr() to retrieve a table name These are dealt with in the commits to follow.

Before this change an expression like `(a+b)-(c+d)` was parsed correctly (as a Minus node with two Plus nodes as children), but when serializing back to an SQL string, it came up as a+b-c+d, since we don't store parens in AST and don't attempt to insert them when necessary during serialization. The latter would be hard, and we already had an SQLNested enum variant, so I changed the code to wrap the AST node for the parenthesized expression in it.

I checked the docs of a few of the most popular RDBMSes, and it seems there's consensus that the precedence of `NOT` is higher than `AND`, but lower than `IS NULL`. Postgresql[1], Oracle[2] and MySQL[3] docs say that explicitly. T-SQL docs[4] do mention it's higher than `AND`, and while they don't explicitly mention IS NULL, this snippet: select * from (select 1 as a)x where (not x.a) is null ...is a parsing error, while the following works like IS NOT NULL: select * from (select 1 as a)x where not x.a is null sqlite doesn't seem to mention `NOT` precedence, but I assume it works similarly. [1] https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-OPERATORS [2] https://docs.oracle.com/cd/B19306_01/server.102/b14200/conditions001.htm#i1034834 [3] https://dev.mysql.com/doc/refman/8.0/en/operator-precedence.html [4] https://docs.microsoft.com/en-us/sql/t-sql/language-elements/operator-precedence-transact-sql?view=sql-server-2017

This makes the parser more strict when handling SELECTs nested somewhere in the main statement: 1) instead of accepting SELECT anywhere in the expression where an operand was expected, we only accept it inside parens. (I've added a test for the currently supported syntax, <scalar subquery> in ANSI SQL terms) 2) instead of accepting any expression in the derived table context: `FROM ( ... )` - we only look for a SELECT subquery there. Due to #1, I had to swith the 'ansi' test from invoking the expression parser to the statement parser.

(To store "A name of a table, view, custom type, etc., possibly multi-part, i.e. db.schema.obj".) Before this change - some places used `String` for this (these are updated in this commit) - while others (notably SQLStatement::SQLDelete::relation, which is the reason for this series of commits) relied on ASTNode::SQLCompoundIdentifier (which is also backed by a Vec<SQLIdent>, but, as a variant of ASTNode enum, is not convenient to use when you know you need that specific variant).

...to match the name of the recently introduced `SQLObjectName` struct and to avoid any reservations about using it with multi-part names of objects other than tables (as in the `type_name` case).

...instead make `parse_compound_identifier()` return the underlying Vec<> directly, and rename it to `parse_list_of_ids()`, since it's used both for parsing compound identifiers and lists of identifiers.

Store (and parse) `table_name: SQLObjectName` instead of `relation: Option<Box<ASTNode>>`, which can be an arbitrary expression. Also remove the `Option<>`: the table name is not optional in any dialects I'm familiar with. While the FROM keyword itself _is_ optional in some dialects, there are more things to implement for those dialects, see https://stackoverflow.com/a/4484271/1026

Also move more things to use SQLIdent instead of String in the hope of making it a newtype eventually. Add tests that quoted identifiers round-trip parsing/serialization correctly.

ASTNode can now be renamed SQLExpression, as it represents a node in the "expression" part of the AST -- other nodes have their own types.

It was probably copied from somewhere else when most types were variants in ASTNode, and needed Box<> to prevent recursion in the ASTNode definition.

This used to be needed when it was a variant in the ASTNode enum itself.

Instead change ASTNode::SQLSubquery to be Box<SQLSelect>

(the tests affected by "unboxing" in the previous commits.)

This happens all the time when I forget to check that the keyword I wanted to use is actually listed in keywords.rs, this should help with debugging.

...by reusing `parse_column_names` instead of extracting identifiers out of the `parse_expr_list`s result.

Don't need the duplicate `columns.push()` + we advance the tokenizer, so no need to peek first.

By not swallowing the Err from parse_data_type(). Also switch to `match` to enable parsing table-level constraints in this loop later.

The dialect information is from https://en.wikibooks.org/wiki/SQL_Dialects_Reference/Data_structure_definition/Delimited_identifiers

Widely used in MS SQL and specified in ANSI.

Some unsupported features are noted as TODOs.

https://jakewheat.github.io/sql-overview/sql-2011-foundation-grammar.html#exact-numeric-type

https://jakewheat.github.io/sql-overview/sql-2011-foundation-grammar.html#character-string-type

coveralls · 2019-03-08T14:49:56Z

Pull Request Test Coverage Report for Build 138

1371 of 1484 (92.39%) changed or added relevant lines in 11 files are covered.
23 unchanged lines in 4 files lost coverage.
Overall coverage increased (+5.4%) to 95.378%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
src/sqlast/table_key.rs	3	6	50.0%
src/sqlast/query.rs	156	161	96.89%
src/sqlast/sqltype.rs	1	6	16.67%
src/sqltokenizer.rs	171	183	93.44%
src/sqlast/mod.rs	78	91	85.71%
tests/sqlparser_generic.rs	465	499	93.19%
src/sqlparser.rs	423	464	91.16%

Files with Coverage Reduction	New Missed Lines	%
src/sqltokenizer.rs	1	89.83%
src/sqlast/mod.rs	3	77.1%
tests/sqlparser_generic.rs	9	91.79%
src/sqlparser.rs	10	79.79%

Totals
Change from base Build 137:	5.4%
Covered Lines:	227
Relevant Lines:	238

💛 - Coveralls

zimond · 2019-04-01T03:40:07Z

Wow, I just ran into this crate and found this PR. No intention to spam but really thanks for your hard work. This fork works like a charm

andygrove · 2019-04-01T04:24:32Z

@zimond Thanks for commenting on this .. I had not seen this PR until now.

@nickolay Apologies for missing this, and generally being unresponsive. I've been putting a lot of energy into Apache Arrow and neglecting the work here.

Maybe I should just make you a collaborator on this repo since you are driving it more than me now.

The only requirements I have in the short term are to not break compatibility without warning for datafusion and locustdb, the two dependent crates.

The description of your changes sounds good. I'll go ahead and test against datafusion from your branch and see if there are any issues.

andygrove · 2019-04-01T05:00:30Z

Cargo.toml

@@ -1,7 +1,7 @@
 [package]
 name = "sqlparser"
 description = "Extensible SQL Lexer and Parser with support for ANSI SQL:2011"
-version = "0.2.2-alpha.0"
+version = "0.2.3-alpha.0"


Please bump the version to 0.3.0 since these are breaking changes

andygrove

This change looks good. Could you create a PR for this one?

andygrove · 2019-04-01T05:19:46Z

@nickolay

This change looks good. Could you create a PR for this one?

I was trying to approve individual commits within the PR ... that doesn't work I guess

andygrove · 2019-04-01T05:21:55Z

I think the best bet is going to be to rebase this, bump the version to 0.3.0 and I will merge and then create a PR to upgrade Arrow to the new version. We should get @cswinter to take a look at this too.

nickolay · 2019-04-02T06:52:31Z

@andygrove, thanks for getting back to me; don't worry about the delays!

I merged andygrove/master (after reverting the revert from #42...) back into my fork for now. As indicated, I avoid rebasing as it will make it harder for those who pulled from my fork.

Bumped the version to 0.3.0, as requested.

You previously noted in #42 that you found a "breaking change" in this, did you mean the (intentional) AST changes or is there a problem I missed?

cswinter · 2019-04-03T03:28:21Z

🎉

nickolay added 30 commits January 31, 2019 03:57

Extract ASTNode::SQLSelect to a separate struct (1/5)

d8173d4

This will allow re-using it for SQLStatement in a later commit. (Also split the new struct into a separate file, other query-related types will be moved here in a follow-up commit.)

Move SQLOrderByExpr and Join* to query.rs (2/5)

45a5f84

Remove unused ASTNode::SQLAssignment variant (3/5)

7b86f5c

The SQLAssignment *struct* is used directly in ASTNode::SQLUpdate (will change to SQLStatement::SQLUpdate shortly).

Rename parse_tablename -> parse_object_name (4.2/4.4)

39e98cb

...to match the name of the recently introduced `SQLObjectName` struct and to avoid any reservations about using it with multi-part names of objects other than tables (as in the `type_name` case).

Simplify by avoiding SQLCompoundIdentifier (4.3/4.4)

f5bd9c3

...instead make `parse_compound_identifier()` return the underlying Vec<> directly, and rename it to `parse_list_of_ids()`, since it's used both for parsing compound identifiers and lists of identifiers.

Store original, quoted form in SQLIdent

e0ceacd

Also move more things to use SQLIdent instead of String in the hope of making it a newtype eventually. Add tests that quoted identifiers round-trip parsing/serialization correctly.

Move TableFactor to be a separate enum

9967031

ASTNode can now be renamed SQLExpression, as it represents a node in the "expression" part of the AST -- other nodes have their own types.

Remove Box<> from SQLOrderByExpr

3619e89

It was probably copied from somewhere else when most types were variants in ASTNode, and needed Box<> to prevent recursion in the ASTNode definition.

Don't Box<ASTNode> in SQLStatement

c5bbfc3

This used to be needed when it was a variant in the ASTNode enum itself.

Don't Box<ASTNode> in SQLSelect

e3b981a

Instead change ASTNode::SQLSubquery to be Box<SQLSelect>

Switch some tests to verified_select_stmt

6b10706

(the tests affected by "unboxing" in the previous commits.)

Assert when an unknown keyword was passed to parse_keyword()

05a70a3

This happens all the time when I forget to check that the keyword I wanted to use is actually listed in keywords.rs, this should help with debugging.

Simplify JOIN USING (columns)

b716ffb

...by reusing `parse_column_names` instead of extracting identifiers out of the `parse_expr_list`s result.

Fix a typo in parse_value error message

89602dc

Simplify parse_create() a little

a0f625b

Don't need the duplicate `columns.push()` + we advance the tokenizer, so no need to peek first.

Improve error messages in parse_create()

346d1ff

By not swallowing the Err from parse_data_type(). Also switch to `match` to enable parsing table-level constraints in this loop later.

Support basic CREATE VIEW

0c0cbca

Update README to the recent changes in the AST

577e634

Small CLI app that can be used to test parsing an external SQL file

2e9da53

Simplify quoted identifier tokenization

b3693bf

Support different quoting styles for delimited identifiers

b9f4b50

The dialect information is from https://en.wikibooks.org/wiki/SQL_Dialects_Reference/Data_structure_definition/Delimited_identifiers

nickolay and others added 15 commits February 7, 2019 05:34

Support national string literals (N'...')

35dd934

Widely used in MS SQL and specified in ANSI.

Support comments in the tokenizer

028c613

TBD fixup multiline comment tokenization

f958e9d

Support basic CTEs (WITH)

bf0c07b

Some unsupported features are noted as TODOs.

Support AS and qualified wildcards in SELECT

bed03ab

Support IN

2643193

Support BETWEEN

786b1cf

Support unary + / -

54c9ca8

Support NUMERIC without precision or scale

23a0d03

https://jakewheat.github.io/sql-overview/sql-2011-foundation-grammar.html#exact-numeric-type

Support CHAR synonym for CHARACTER

533775c

https://jakewheat.github.io/sql-overview/sql-2011-foundation-grammar.html#character-string-type

Support UNION/EXCEPT/INTERSECT

52e0f55

Support CREATE MATERIALIZED VIEW

23a0fc7

Re-run cargo fmt

f30ab89

Merge #1 from benesch/materialized

0621f8d

Merge remote-tracking branch 'upstream/master'

5ba9912

andygrove reviewed Apr 1, 2019

View reviewed changes

andygrove approved these changes Apr 1, 2019

View reviewed changes

nickolay added 3 commits April 2, 2019 09:15

reapply the "Rework keyword/identifier parsing" PR

6dfe7c2

Merge branch 'upstream'

5de23b1

bump version to 0.3.0, as requested in the PR

311f2ab

andygrove merged commit d1b5668 into apache:master Apr 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What's the best way to merge my fork? #43

What's the best way to merge my fork? #43

Uh oh!

nickolay commented Mar 8, 2019

Uh oh!

coveralls commented Mar 8, 2019 •

edited

Loading

Uh oh!

zimond commented Apr 1, 2019

Uh oh!

andygrove commented Apr 1, 2019

Uh oh!

andygrove Apr 1, 2019

Uh oh!

andygrove left a comment

Uh oh!

andygrove commented Apr 1, 2019 •

edited

Loading

Uh oh!

andygrove commented Apr 1, 2019

Uh oh!

nickolay commented Apr 2, 2019

Uh oh!

cswinter commented Apr 3, 2019

Uh oh!

Uh oh!

What's the best way to merge my fork? #43

What's the best way to merge my fork? #43

Uh oh!

Conversation

nickolay commented Mar 8, 2019

Uh oh!

coveralls commented Mar 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 138

💛 - Coveralls

Uh oh!

zimond commented Apr 1, 2019

Uh oh!

andygrove commented Apr 1, 2019

Uh oh!

andygrove Apr 1, 2019

Choose a reason for hiding this comment

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

andygrove commented Apr 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andygrove commented Apr 1, 2019

Uh oh!

nickolay commented Apr 2, 2019

Uh oh!

cswinter commented Apr 3, 2019

Uh oh!

Uh oh!

coveralls commented Mar 8, 2019 •

edited

Loading

andygrove commented Apr 1, 2019 •

edited

Loading