Skip to content

What's the best way to merge my fork? #43

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 48 commits into from
Apr 3, 2019
Merged

Conversation

nickolay
Copy link
Contributor

@nickolay nickolay commented Mar 8, 2019

In my fork I have a number of new features and refactorings:

  • Separate SQLStatement enum from ASTNode (up to 2dec65f) and support parsing multiple statements separated with a semicolon (707c58a)
  • Only use parse_expr() for expressions (b57c60a) and related fixes (up to 07790fe)
  • Assorted refactorings and changes to the AST (up to 346d1ff), notably:
    • Store the way an identifier was quoted, as the easiest way to make ast.to_string() not produce invalid SQL (e0ceacd)
    • Move the remaining ASTNode variant which doesn't make sense in the context of an expression (9967031)
  • Support CREATE VIEW (0c0cbca)
  • A CLI program (2e9da53)
  • Tokenizer fixes (up to f958e9d): dialect-specific identifier quoting (e.g. MS SQL's [foo bar]), national string literals (N'...'), comments.
  • Support for more SQL constructs (up to 0621f8d): WITH, aliases and qualified wildcards in SELECT, IN, BETWEEN, unary +/-, UNION/EXCEPT/INTERSECT, data types tweaks: the CHAR synonym and support NUMERIC not followed by (p,s).
  • CREATE MATERIALIZED VIEW from PR Support CREATE MATERIALIZED VIEW nickolay/sqlparser-rs#1

Initially I intended to submit this as one PR per item, and deal with the conflicts arising from reviews by rebasing. Now that the pile grew quite large and others indicated they pulled from my fork, I'm not sure rebasing is a good idea. I can still submit multiple PRs to make reviewing easier.

What do you think we should do?

This will allow re-using it for SQLStatement in a later commit.

(Also split the new struct into a separate file, other query-related
types will be moved here in a follow-up commit.)
The SQLAssignment *struct* is used directly in ASTNode::SQLUpdate (will
change to SQLStatement::SQLUpdate shortly).
Continuing from https://github.com/andygrove/sqlparser-rs/pull/33#issuecomment-453060427

This stops the parser from accepting (and the AST from being able to
represent) SQL look-alike code that makes no sense, e.g.

    SELECT ... FROM (CREATE TABLE ...) foo
    SELECT ... FROM (1+CAST(...)) foo

Generally this makes the AST less "partially typed": meaning certain
parts are strongly typed (e.g. SELECT can only contain projections,
relations, etc.), while everything that didn't get its own type is
dumped into ASTNode, effectively untyped. After a few more fixes (yet
to be implemented), `ASTNode` could become an `SQLExpression`. The
Pratt-style expression parser (returning an SQLExpression) would be
invoked from the top-down parser in places where a generic expression
is expected (e.g. after SELECT <...>, WHERE <...>, etc.), while things
like select's `projection` and `relation` could be more appropriately
(narrowly) typed.


Since the diff is quite large due to necessarily large number of
mechanical changes, here's an overview:

1) Interface changes:

   - A new AST enum - `SQLStatement` - is split out of ASTNode:

     - The variants of the ASTNode enum, which _only_ make sense as a top
       level statement (INSERT, UPDATE, DELETE, CREATE, ALTER, COPY) are
       _moved_ to the new enum, with no other changes.
     - SQLSelect is _duplicated_: now available both as a variant in
       SQLStatement::SQLSelect (top-level SELECT) and ASTNode:: (subquery).

   - The main entry point (Parser::parse_sql) now expects an SQL statement
     as input, and returns an `SQLStatement`.

2) Parser changes: instead of detecting the top-level constructs deep
down in the precedence parser (`parse_prefix`) we are able to do it
just right after setting up the parser in the `parse_sql` entry point

(SELECT, again, is kept in the expression parser to demonstrate how
subqueries could be implemented).

The rest of parser changes are mechanical ASTNode -> SQLStatement
replacements resulting from the AST change.

3) Testing changes: for every test - depending on whether the input was
a complete statement or an expresssion -  I used an appropriate helper
function:

   - `verified` (parses SQL, checks that it round-trips, and returns
     the AST) - was replaced by `verified_stmt` or `verified_expr`.

   - `parse_sql` (which returned AST without checking it round-tripped)
     was replaced by:

     - `parse_sql_expr` (same function, for expressions)

     - `one_statement_parses_to` (formerly `parses_to`), extended to
       deal with statements that are not expected to round-trip.
       The weird name is to reduce further churn when implementing
       multi-statement parsing.

     - `verified_stmt` (in 4 testcases that actually round-tripped)
…rom_projection`

(The primary motivation was that it makes the tests more resilient to
the upcoming changes to the SQLSelectStatement to support `AS` aliases
and `UNION`.)

Also start using `&'static str` literals consistently instead of
String::from for the `let sql` test strings.
Parser::parse_sql() can now parse a semicolon-separated list of
statements, returning them in a Vec<SQLStatement>.

To support this we:

  - Move handling of inter-statement tokens from the end of individual
    statement parsers (`parse_select` and `parse_delete`; this was not
    implemented for other top-level statements) to the common
    statement-list parsing code (`parse_sql`);

  - Change the "Unexpected token at end of ..." error, which didn't have
    tests and prevented us from parsing successive statements  ->
    "Expected end of statement" (i.e. a delimiter - currently only ";" -
    or the EOF);

  - Add PartialEq on ParserError to be able to assert_eq!() that parsing
    statements that do not terminate properly returns an expected error.
Before this commit there was a single `parse_expr(u8)` method, which
was called both

1) from within the expression parser (to parse subexpression consisting
   of operators with higher priority than the current one), and

2) from the top-down parser both

   a) to parse true expressions (such as an item of the SELECT list or
      the condition after WHERE or after ON), and 
   b) to parse sequences which are not exactly "expressions".


This starts cleaning this up by renaming the `parse_expr(u8)` method to
`parse_subexpr()` and using it only for (1) - i.e. usually providing a
non-zero precedence parameter.

The non-intuitively called `parse()` method is renamed to `parse_expr()`,
which became available and is used for (2a).


While reviewing the existing callers of `parse_expr`, four points to
follow up on were identified (marked "TBD (#)" in the commit):

1) Do not lose parens (e.g. `(1+2)*3`) when roundtripping
   String->AST->String by using SQLNested.
2) Incorrect precedence of the NOT unary
3) `parse_table_factor` accepts any expression where a SELECT subquery
   is expected.
4) parse_delete uses parse_expr() to retrieve a table name

These are dealt with in the commits to follow.
Before this change an expression like `(a+b)-(c+d)` was parsed correctly
(as a Minus node with two Plus nodes as children), but when serializing
back to an SQL string, it came up as a+b-c+d, since we don't store
parens in AST and don't attempt to insert them when necessary during
serialization. The latter would be hard, and we already had an SQLNested
enum variant, so I changed the code to wrap the AST node for the
parenthesized expression in it.
I checked the docs of a few of the most popular RDBMSes, and it seems
there's consensus that the precedence of `NOT` is higher than `AND`,
but lower than `IS NULL`.

Postgresql[1], Oracle[2] and MySQL[3] docs say that explicitly.

T-SQL docs[4] do mention it's higher than `AND`, and while they don't
explicitly mention IS NULL, this snippet:

    select * from (select 1 as a)x
    where (not x.a) is null

...is a parsing error, while the following works like IS NOT NULL:

    select * from (select 1 as a)x
    where not x.a is null

sqlite doesn't seem to mention `NOT` precedence, but I assume it works
similarly.

[1] https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-OPERATORS
[2] https://docs.oracle.com/cd/B19306_01/server.102/b14200/conditions001.htm#i1034834
[3] https://dev.mysql.com/doc/refman/8.0/en/operator-precedence.html
[4] https://docs.microsoft.com/en-us/sql/t-sql/language-elements/operator-precedence-transact-sql?view=sql-server-2017
This makes the parser more strict when handling SELECTs nested
somewhere in the main statement:

1) instead of accepting SELECT anywhere in the expression where an
   operand was expected, we only accept it inside parens. (I've added a
   test for the currently supported syntax, <scalar subquery> in ANSI
   SQL terms)

2) instead of accepting any expression in the derived table context:
   `FROM ( ... )` - we only look for a SELECT subquery there.

Due to #1, I had to swith the 'ansi' test from invoking the expression
parser to the statement parser.
(To store "A name of a table, view, custom type, etc., possibly
multi-part, i.e. db.schema.obj".)

Before this change

  - some places used `String` for this (these are updated in this commit)

  - while others (notably SQLStatement::SQLDelete::relation, which is
    the reason for this series of commits) relied on
    ASTNode::SQLCompoundIdentifier (which is also backed by a 
    Vec<SQLIdent>, but, as a variant of ASTNode enum, is not convenient
    to use when you know you need that specific variant).
...to match the name of the recently introduced `SQLObjectName` struct
and to avoid any reservations about using it with multi-part names of
objects other than tables (as in the `type_name` case).
...instead make `parse_compound_identifier()` return the underlying
Vec<> directly, and rename it to `parse_list_of_ids()`, since it's
used both for parsing compound identifiers and lists of identifiers.
Store (and parse) `table_name: SQLObjectName` instead of
`relation: Option<Box<ASTNode>>`, which can be an arbitrary expression.

Also remove the `Option<>`: the table name is not optional in any dialects
I'm familiar with. While the FROM keyword itself _is_ optional in some
dialects, there are more things to implement for those dialects, see
https://stackoverflow.com/a/4484271/1026
Also move more things to use SQLIdent instead of String in the hope of
making it a newtype eventually.

Add tests that quoted identifiers round-trip parsing/serialization correctly.
ASTNode can now be renamed SQLExpression, as it represents a node in
the "expression" part of the AST -- other nodes have their own types.
It was probably copied from somewhere else when most types were variants
in ASTNode, and needed Box<> to prevent recursion in the ASTNode definition.
This used to be needed when it was a variant in the ASTNode enum itself.
Instead change ASTNode::SQLSubquery to be Box<SQLSelect>
(the tests affected by "unboxing" in the previous commits.)
This happens all the time when I forget to check that the keyword I wanted
to use is actually listed in keywords.rs, this should help with debugging.
...by reusing `parse_column_names` instead of extracting identifiers
out of the `parse_expr_list`s result.
Don't need the duplicate `columns.push()` + we advance the tokenizer,
so no need to peek first.
By not swallowing the Err from parse_data_type().

Also switch to `match` to enable parsing table-level constraints in this
loop later.
@coveralls
Copy link

coveralls commented Mar 8, 2019

Pull Request Test Coverage Report for Build 138

  • 1371 of 1484 (92.39%) changed or added relevant lines in 11 files are covered.
  • 23 unchanged lines in 4 files lost coverage.
  • Overall coverage increased (+5.4%) to 95.378%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/sqlast/table_key.rs 3 6 50.0%
src/sqlast/query.rs 156 161 96.89%
src/sqlast/sqltype.rs 1 6 16.67%
src/sqltokenizer.rs 171 183 93.44%
src/sqlast/mod.rs 78 91 85.71%
tests/sqlparser_generic.rs 465 499 93.19%
src/sqlparser.rs 423 464 91.16%
Files with Coverage Reduction New Missed Lines %
src/sqltokenizer.rs 1 89.83%
src/sqlast/mod.rs 3 77.1%
tests/sqlparser_generic.rs 9 91.79%
src/sqlparser.rs 10 79.79%
Totals Coverage Status
Change from base Build 137: 5.4%
Covered Lines: 227
Relevant Lines: 238

💛 - Coveralls

@zimond
Copy link

zimond commented Apr 1, 2019

Wow, I just ran into this crate and found this PR. No intention to spam but really thanks for your hard work. This fork works like a charm

@andygrove
Copy link
Member

@zimond Thanks for commenting on this .. I had not seen this PR until now.

@nickolay Apologies for missing this, and generally being unresponsive. I've been putting a lot of energy into Apache Arrow and neglecting the work here.

Maybe I should just make you a collaborator on this repo since you are driving it more than me now.

The only requirements I have in the short term are to not break compatibility without warning for datafusion and locustdb, the two dependent crates.

The description of your changes sounds good. I'll go ahead and test against datafusion from your branch and see if there are any issues.

Cargo.toml Outdated
@@ -1,7 +1,7 @@
[package]
name = "sqlparser"
description = "Extensible SQL Lexer and Parser with support for ANSI SQL:2011"
version = "0.2.2-alpha.0"
version = "0.2.3-alpha.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please bump the version to 0.3.0 since these are breaking changes

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks good. Could you create a PR for this one?

@andygrove
Copy link
Member

andygrove commented Apr 1, 2019

@nickolay

This change looks good. Could you create a PR for this one?

I was trying to approve individual commits within the PR ... that doesn't work I guess

@andygrove
Copy link
Member

I think the best bet is going to be to rebase this, bump the version to 0.3.0 and I will merge and then create a PR to upgrade Arrow to the new version. We should get @cswinter to take a look at this too.

@nickolay
Copy link
Contributor Author

nickolay commented Apr 2, 2019

@andygrove, thanks for getting back to me; don't worry about the delays!

I merged andygrove/master (after reverting the revert from #42...) back into my fork for now. As indicated, I avoid rebasing as it will make it harder for those who pulled from my fork.

Bumped the version to 0.3.0, as requested.

You previously noted in #42 that you found a "breaking change" in this, did you mean the (intentional) AST changes or is there a problem I missed?

@andygrove andygrove merged commit d1b5668 into apache:master Apr 3, 2019
@cswinter
Copy link
Contributor

cswinter commented Apr 3, 2019

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants