Skip to content

Improve parsing of JSON accesses on Postgres and Snowflake #1215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 30, 2024
125 changes: 50 additions & 75 deletions src/ast/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,8 @@ pub use self::query::{
ValueTableMode, Values, WildcardAdditionalOptions, With,
};
pub use self::value::{
escape_quoted_string, DateTimeField, DollarQuotedString, TrimWhereField, Value,
escape_double_quote_string, escape_quoted_string, DateTimeField, DollarQuotedString,
TrimWhereField, Value,
};

use crate::ast::helpers::stmt_data_loading::{
Expand Down Expand Up @@ -266,66 +267,6 @@ impl fmt::Display for Interval {
}
}

/// JsonOperator
#[derive(Debug, Copy, Clone, PartialEq, PartialOrd, Eq, Ord, Hash)]
#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
#[cfg_attr(feature = "visitor", derive(Visit, VisitMut))]
pub enum JsonOperator {
/// -> keeps the value as json
Arrow,
/// ->> keeps the value as text or int.
LongArrow,
/// #> Extracts JSON sub-object at the specified path
HashArrow,
/// #>> Extracts JSON sub-object at the specified path as text
HashLongArrow,
/// : Colon is used by Snowflake (Which is similar to LongArrow)
Colon,
/// jsonb @> jsonb -> boolean: Test whether left json contains the right json
AtArrow,
/// jsonb <@ jsonb -> boolean: Test whether right json contains the left json
ArrowAt,
/// jsonb #- text[] -> jsonb: Deletes the field or array element at the specified
/// path, where path elements can be either field keys or array indexes.
HashMinus,
/// jsonb @? jsonpath -> boolean: Does JSON path return any item for the specified
/// JSON value?
AtQuestion,
/// jsonb @@ jsonpath → boolean: Returns the result of a JSON path predicate check
/// for the specified JSON value. Only the first item of the result is taken into
/// account. If the result is not Boolean, then NULL is returned.
AtAt,
}

impl fmt::Display for JsonOperator {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
match self {
JsonOperator::Arrow => {
write!(f, "->")
}
JsonOperator::LongArrow => {
write!(f, "->>")
}
JsonOperator::HashArrow => {
write!(f, "#>")
}
JsonOperator::HashLongArrow => {
write!(f, "#>>")
}
JsonOperator::Colon => {
write!(f, ":")
}
JsonOperator::AtArrow => {
write!(f, "@>")
}
JsonOperator::ArrowAt => write!(f, "<@"),
JsonOperator::HashMinus => write!(f, "#-"),
JsonOperator::AtQuestion => write!(f, "@?"),
JsonOperator::AtAt => write!(f, "@@"),
}
}
}

/// A field definition within a struct.
///
/// [bigquery]: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#struct_type
Expand Down Expand Up @@ -408,6 +349,24 @@ impl fmt::Display for MapAccessKey {
}
}

/// A element of a path to data nested in a VARIANT on Snowflake.
///
/// See <https://docs.snowflake.com/en/user-guide/querying-semistructured>.
#[derive(Debug, Clone, PartialEq, PartialOrd, Eq, Ord, Hash)]
#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
#[cfg_attr(feature = "visitor", derive(Visit, VisitMut))]
pub enum VariantPathElem {
/// Accesses an object field using dot notation, e.g. `obj:foo.bar.baz`.
///
/// See <https://docs.snowflake.com/en/user-guide/querying-semistructured#dot-notation>.
Dot { key: String, quoted: bool },
/// Accesses an object field or array element using bracket notation,
/// e.g. `obj['foo']`.
///
/// See <https://docs.snowflake.com/en/user-guide/querying-semistructured#bracket-notation>.
Bracket { key: Expr },
}

/// An SQL expression of any type.
///
/// The parser does not distinguish between expressions of different types
Expand All @@ -425,11 +384,14 @@ pub enum Expr {
Identifier(Ident),
/// Multi-part identifier, e.g. `table_alias.column` or `schema.table.col`
CompoundIdentifier(Vec<Ident>),
/// JSON access (postgres) eg: data->'tags'
JsonAccess {
left: Box<Expr>,
operator: JsonOperator,
right: Box<Expr>,
/// Access data nested in a VARIANT on Snowflake, for example `src:customer[0].name`.
///
/// See <https://docs.snowflake.com/en/user-guide/querying-semistructured>.
VariantAccess {
/// The VARIANT being queried.
value: Box<Expr>,
/// The path to the data inside the VARIANT to extract.
path: Vec<VariantPathElem>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering would it be possible to reuse the existing MapAccess expr variant to represent this scenario?
thinking that e.g with an example like below, the only difference would be an extension to the existing syntax enum

enum MapAccessSyntax {
   // ... 
   /// Access using colon notation. `mymap:mykey`
    Colon,
}


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this comment, I appreciate your review! So I did consider that, but my interpretation of this syntax is that : is an operator that takes a VARIANT expression and a JSON path expression as a single syntactic element (VARIANT : JSONPATH).

Another SQL dialect with the same syntax explicitly says this is the case:
https://docs.databricks.com/en/sql/language-manual/functions/colonsign.html
https://docs.databricks.com/en/sql/language-manual/sql-ref-json-path-expression.html

(If we do keep this, I should actually rename it back to JsonAccess since I do plan to add Databricks as a dialect here in the future...)

Tangentially, there are also some things I'm not happy with in MapAccess, particularly MapAccessSyntax::Dot:

  • MapAccessSyntax::Dot overlaps with CompositeAccess.
  • MapAccessKey contains an Expr even when using dot syntax, which can only legally contain Identifier or DoubleQuotedString in that scenario. It would make more sense to use an Ident in that case.

So my broader preference here is:

  • Use CompositeAccess for all non-jsonpath dot accesses.
  • Use MapAccess for all non-jsonpath bracketed accesses (i.e. eliminate MapAccessSyntax).
  • Use JsonAccess for all jsonpath accesss.

What do you think?

Copy link
Contributor

@iffyio iffyio Apr 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I did consider that, but my interpretation of this syntax is that : is an operator that takes a VARIANT expression and a JSON path expression as a single syntactic element (VARIANT : JSONPATH).

hmm not sure if that matters a lot from a parsing/ast perspective - in that were not required to represent that literally so we have some leeway to be flexible if it lets us simplify the code/representation?

So my broader preference here is:

I think a reasonable goal to aim for could be to reuse the same solution for both the map and the variant expressions - since beyond the leading : special case, they are identical? I think this probably means merging both MapAccess and JsonAccess based on your proposal? If that seems reasonable then yeah feel free to make changes to that effect if it means revamping the current MapAccess setup.

},
/// CompositeAccess (postgres) eg: SELECT (information_schema._pg_expandarray(array['i','i'])).n
CompositeAccess {
Expand Down Expand Up @@ -1210,16 +1172,29 @@ impl fmt::Display for Expr {
Expr::Array(set) => {
write!(f, "{set}")
}
Expr::JsonAccess {
left,
operator,
right,
} => {
if operator == &JsonOperator::Colon {
write!(f, "{left}{operator}{right}")
} else {
write!(f, "{left} {operator} {right}")
Expr::VariantAccess { value, path } => {
write!(f, "{value}")?;
for (i, elem) in path.iter().enumerate() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the variant remains it might be good to instead impl display on the VariantPathElem and here use the display_separated function so that we avoid looping and the same logic an be reused in other references to VariantPathElem in the future

Copy link
Contributor Author

@jmhain jmhain Apr 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trouble with that is that an individual VariantPathElem needs to know if it's the initial path segment to render as a : or a . (and I don't want to add VariantPathElem::Colon because (a) the two current variants map to the two documented Snowflake syntaxes and (b) it makes more AST states unrepresentable (e.g. if a user manually constructs an AST to render, they have to remember this detail). Another option if we wanted to factor this out would be to wrap the path elements in a VariantPath type, though that doesn't really seem worth it to me I went ahead and made this change.

match elem {
VariantPathElem::Dot { key, quoted } => {
if i == 0 {
write!(f, ":")?;
} else {
write!(f, ".")?;
}

if *quoted {
write!(f, "\"{}\"", escape_double_quote_string(key))?;
} else {
write!(f, "{key}")?;
}
}
VariantPathElem::Bracket { key } => {
write!(f, "[{key}]")?;
}
}
}
Ok(())
}
Expr::CompositeAccess { expr, key } => {
write!(f, "{expr}.{key}")
Expand Down
82 changes: 82 additions & 0 deletions src/ast/operator.rs
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,79 @@ pub enum BinaryOperator {
PGNotILikeMatch,
/// String "starts with", eg: `a ^@ b` (PostgreSQL-specific)
PGStartsWith,
/// The `->` operator.
///
/// On PostgreSQL, this operator extracts a JSON object field or array
/// element, for example `'{"a":"b"}'::json -> 'a'` or `[1, 2, 3]'::json
/// -> 2`.
///
/// See <https://www.postgresql.org/docs/current/functions-json.html>.
Arrow,
/// The `->>` operator.
///
/// On PostgreSQL, this operator that extracts a JSON object field or JSON
/// array element and converts it to text, for example `'{"a":"b"}'::json
/// ->> 'a'` or `[1, 2, 3]'::json ->> 2`.
///
/// See <https://www.postgresql.org/docs/current/functions-json.html>.
LongArrow,
/// The `#>` operator.
///
/// On PostgreSQL, this operator extracts a JSON sub-object at the specified
/// path, for example:
///
/// ```notrust
///'{"a": {"b": ["foo","bar"]}}'::json #> '{a,b,1}'
/// ```
///
/// See <https://www.postgresql.org/docs/current/functions-json.html>.
HashArrow,
/// The `#>>` operator.
///
/// A PostgreSQL-specific operator that extracts JSON sub-object at the
/// specified path, for example
///
/// ```notrust
///'{"a": {"b": ["foo","bar"]}}'::json #>> '{a,b,1}'
/// ```
///
/// See <https://www.postgresql.org/docs/current/functions-json.html>.
HashLongArrow,
/// The `@@` operator.
///
/// On PostgreSQL, this is used for JSON and text searches.
///
/// See <https://www.postgresql.org/docs/current/functions-json.html>.
/// See <https://www.postgresql.org/docs/current/functions-textsearch.html>.
AtAt,
/// The `@>` operator.
///
/// On PostgreSQL, this is used for JSON and text searches.
///
/// See <https://www.postgresql.org/docs/current/functions-json.html>.
/// See <https://www.postgresql.org/docs/current/functions-textsearch.html>.
AtArrow,
/// The `<@` operator.
///
/// On PostgreSQL, this is used for JSON and text searches.
///
/// See <https://www.postgresql.org/docs/current/functions-json.html>.
/// See <https://www.postgresql.org/docs/current/functions-textsearch.html>.
ArrowAt,
/// The `#-` operator.
///
/// On PostgreSQL, this operator is used to delete a field or array element
/// at a specified path.
///
/// See <https://www.postgresql.org/docs/current/functions-json.html>.
HashMinus,
/// The `@?` operator.
///
/// On PostgreSQL, this operator is used to check the given JSON path
/// returns an item for the JSON value.
///
/// See <https://www.postgresql.org/docs/current/functions-json.html>.
AtQuestion,
/// PostgreSQL-specific custom operator.
///
/// See [CREATE OPERATOR](https://www.postgresql.org/docs/current/sql-createoperator.html)
Expand Down Expand Up @@ -187,6 +260,15 @@ impl fmt::Display for BinaryOperator {
BinaryOperator::PGNotLikeMatch => f.write_str("!~~"),
BinaryOperator::PGNotILikeMatch => f.write_str("!~~*"),
BinaryOperator::PGStartsWith => f.write_str("^@"),
BinaryOperator::Arrow => f.write_str("->"),
BinaryOperator::LongArrow => f.write_str("->>"),
BinaryOperator::HashArrow => f.write_str("#>"),
BinaryOperator::HashLongArrow => f.write_str("#>>"),
BinaryOperator::AtAt => f.write_str("@@"),
BinaryOperator::AtArrow => f.write_str("@>"),
BinaryOperator::ArrowAt => f.write_str("<@"),
BinaryOperator::HashMinus => f.write_str("#-"),
BinaryOperator::AtQuestion => f.write_str("@?"),
BinaryOperator::PGCustomBinaryOperator(idents) => {
write!(f, "OPERATOR({})", display_separated(idents, "."))
}
Expand Down
3 changes: 0 additions & 3 deletions src/ast/value.rs
Original file line number Diff line number Diff line change
Expand Up @@ -65,8 +65,6 @@ pub enum Value {
Null,
/// `?` or `$` Prepared statement arg placeholder
Placeholder(String),
/// Add support of snowflake field:key - key should be a value
UnQuotedString(String),
}

impl fmt::Display for Value {
Expand All @@ -85,7 +83,6 @@ impl fmt::Display for Value {
Value::RawStringLiteral(v) => write!(f, "R'{v}'"),
Value::Null => write!(f, "NULL"),
Value::Placeholder(v) => write!(f, "{v}"),
Value::UnQuotedString(v) => write!(f, "{v}"),
}
}
}
Expand Down
Loading
Loading