Skip to content

Improve parsing of JSON accesses on Postgres and Snowflake #1215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 30, 2024

Conversation

jmhain
Copy link
Contributor

@jmhain jmhain commented Apr 13, 2024

This PR fixes a few issues I found in the parsing of JSON accesses on these two dialects.

  • Snowflake allows any unquoted keyword as the key in its dot notation for querying semi-structured data. There have been several PRs explicitly adding special cases and tests for specific keywords - this PR addresses this problem in general. I also think the Value::UnQuotedString produced by parse_value is problematic and likely to lead to misparses, so I got rid of it.
  • Postgres allows any expression to be passed to its various JSON operators, and some of these operators are actually overloaded to do other things, such as text search. As such, it makes sense just to represent these as binary operators.

I also added several tests, most of which were previously failing.

@coveralls
Copy link

coveralls commented Apr 13, 2024

Pull Request Test Coverage Report for Build 8896826786

Details

  • 198 of 204 (97.06%) changed or added relevant lines in 6 files are covered.
  • 2 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.05%) to 89.196%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/parser/mod.rs 46 49 93.88%
tests/sqlparser_postgres.rs 50 53 94.34%
Files with Coverage Reduction New Missed Lines %
tests/sqlparser_common.rs 1 89.58%
src/parser/mod.rs 1 92.81%
Totals Coverage Status
Change from base Build 8896470747: 0.05%
Covered Lines: 23802
Relevant Lines: 26685

💛 - Coveralls

Copy link
Contributor

@iffyio iffyio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jmhain! the changes look good to me overall - left some comments re parsing setup and potentially reusing the map syntax to express this case

Comment on lines 2618 to 2620
Token::Colon if path.is_empty() => {
path.push(self.parse_variant_object_key()?);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to instead only call this function when we expect a variant access? That would let us avoid this special case as well as the path.is_empty checks since we can 'expect' the colon (and potentially the first part) before entering the loop to parse the remaining parts.
i.e. at the caller we would peek_token and only call this if we see a :, otherwise we may parse it as a map access

Copy link
Contributor Author

@jmhain jmhain Apr 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A variant access can begin with either bracket or "dot" syntax (the colon) corresponding to the two sites where this function is called, so the : token is not necessarily required initially.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking rather that the bracket syntax is identical to the existing map expr so it would be reasonable to call that function instead of this one (hence the peek_token beforehand)

src/ast/mod.rs Outdated
Comment on lines 390 to 394
VariantAccess {
/// The VARIANT being queried.
value: Box<Expr>,
/// The path to the data inside the VARIANT to extract.
path: Vec<VariantPathElem>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering would it be possible to reuse the existing MapAccess expr variant to represent this scenario?
thinking that e.g with an example like below, the only difference would be an extension to the existing syntax enum

enum MapAccessSyntax {
   // ... 
   /// Access using colon notation. `mymap:mykey`
    Colon,
}


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this comment, I appreciate your review! So I did consider that, but my interpretation of this syntax is that : is an operator that takes a VARIANT expression and a JSON path expression as a single syntactic element (VARIANT : JSONPATH).

Another SQL dialect with the same syntax explicitly says this is the case:
https://docs.databricks.com/en/sql/language-manual/functions/colonsign.html
https://docs.databricks.com/en/sql/language-manual/sql-ref-json-path-expression.html

(If we do keep this, I should actually rename it back to JsonAccess since I do plan to add Databricks as a dialect here in the future...)

Tangentially, there are also some things I'm not happy with in MapAccess, particularly MapAccessSyntax::Dot:

  • MapAccessSyntax::Dot overlaps with CompositeAccess.
  • MapAccessKey contains an Expr even when using dot syntax, which can only legally contain Identifier or DoubleQuotedString in that scenario. It would make more sense to use an Ident in that case.

So my broader preference here is:

  • Use CompositeAccess for all non-jsonpath dot accesses.
  • Use MapAccess for all non-jsonpath bracketed accesses (i.e. eliminate MapAccessSyntax).
  • Use JsonAccess for all jsonpath accesss.

What do you think?

Copy link
Contributor

@iffyio iffyio Apr 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I did consider that, but my interpretation of this syntax is that : is an operator that takes a VARIANT expression and a JSON path expression as a single syntactic element (VARIANT : JSONPATH).

hmm not sure if that matters a lot from a parsing/ast perspective - in that were not required to represent that literally so we have some leeway to be flexible if it lets us simplify the code/representation?

So my broader preference here is:

I think a reasonable goal to aim for could be to reuse the same solution for both the map and the variant expressions - since beyond the leading : special case, they are identical? I think this probably means merging both MapAccess and JsonAccess based on your proposal? If that seems reasonable then yeah feel free to make changes to that effect if it means revamping the current MapAccess setup.

src/ast/mod.rs Outdated
Comment on lines 1176 to 1177
write!(f, "{value}")?;
for (i, elem) in path.iter().enumerate() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the variant remains it might be good to instead impl display on the VariantPathElem and here use the display_separated function so that we avoid looping and the same logic an be reused in other references to VariantPathElem in the future

Copy link
Contributor Author

@jmhain jmhain Apr 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trouble with that is that an individual VariantPathElem needs to know if it's the initial path segment to render as a : or a . (and I don't want to add VariantPathElem::Colon because (a) the two current variants map to the two documented Snowflake syntaxes and (b) it makes more AST states unrepresentable (e.g. if a user manually constructs an AST to render, they have to remember this detail). Another option if we wanted to factor this out would be to wrap the path elements in a VariantPath type, though that doesn't really seem worth it to me I went ahead and made this change.

@iffyio
Copy link
Contributor

iffyio commented Apr 27, 2024

@jmhain just a heads up re conflicts on the PR
@alamb I think this looks good for a review!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jmhain and @iffyio

While this will result in non trivial downstream API churn, I think it is a good improvement and sets the stage elegantly for arbitrary path access

really nice 👏

It just needs to resolve some conflicts and I think it will be good to merge.

BTW I hope to do a release this week #1214

left: Box<Expr>,
operator: JsonOperator,
right: Box<Expr>,
/// The value being queried.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes much more sense to me

select.projection
);

// dot and bracket notation can be mixed (starting with : case)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very nice (and clever)

@jmhain
Copy link
Contributor Author

jmhain commented Apr 30, 2024

@alamb Conflicts resolved!

@alamb
Copy link
Contributor

alamb commented Apr 30, 2024

🚀

@alamb alamb merged commit 4bfa399 into apache:main Apr 30, 2024
10 checks passed
@alamb
Copy link
Contributor

alamb commented Apr 30, 2024

Thanks again @jmhain

@jmhain jmhain deleted the joey/pg-sf-json-rework branch April 30, 2024 15:10
JichaoS pushed a commit to luabase/sqlparser-rs that referenced this pull request May 7, 2024
@tisonkun
Copy link
Member

tisonkun commented May 19, 2024

The following sql seems fail to pass after this patch:

select make_array(1, 2, 3)[1:2], make_array(1.0, 2.0, 3.0)[2:3], make_array('h', 'e', 'l', 'l', 'o')[2:4];
External error: query failed: DataFusion error: SQL error: ParserError("Expected variant object key name, found: 2")
[SQL] select make_array(1, 2, 3)[1:2], make_array(1.0, 2.0, 3.0)[2:3], make_array('h', 'e', 'l', 'l', 'o')[2:4];

@alamb
Copy link
Contributor

alamb commented May 20, 2024

Thank you for debugging this @tisonkun -- is this something we should create a PR to fix?

@jmhain
Copy link
Contributor Author

jmhain commented May 22, 2024

Sorry about the regression here! I was initially a bit confused since we don't appear to have explicit support or tests for this array slice syntax, but after a quick investigation I can see what happened. This is what make_array(1, 2, 3)[1:2] parsed to prior to this PR:

MapAccess {
    column: Function(
        Function {
            name: ObjectName([Ident { value: "make_array", quote_style: None, }]),
            args: [
                Unnamed(Expr(Value(Number("1", false)))),
                Unnamed(Expr(Value(Number("2", false)))),
                Unnamed(Expr(Value(Number("3", false)))),
            ],
            filter: None,
            null_treatment: None,
            over: None,
            distinct: false,
            special: false,
            order_by: [],
        },
    ),
    keys: [
        MapAccessKey {
            key: JsonAccess {
                left: Value(Number( "1", false)),
                operator: Colon,
                right: Value(Number("2", false)),
            },
            syntax: Bracket,
        },
    ],
}

We were parsing the contents of the [] as an expression and then interpreting 1:2 as itself a json access of a field "2" of the value "1". So basically the fact that we were accepting this syntax was a complete fluke. I'll start thinking through what the best way is to support this syntax properly.

@alamb
Copy link
Contributor

alamb commented May 22, 2024

We were parsing the contents of the [] as an expression and then interpreting 1:2 as itself a json access of a field "2" of the value "1". So basically the fact that we were accepting this syntax was a complete fluke. I'll start thinking through what the best way is to support this syntax properly.

Thank you @jmhain

I am likely going to do another sqlparser release the week after next, though I can do a patch release sooner if we can get a fix for this. Unfortuantely I don't have the time to work on a fix myself

I did file a ticket to track the issue so it doesn't get lost in this merged PR: #1283

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants