Skip to content

Extend support for INDEX parsing #1707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Mar 4, 2025
Merged

Conversation

LucaCappelletti94
Copy link
Contributor

@LucaCappelletti94 LucaCappelletti94 commented Feb 5, 2025

This pull request addresses issue #1706.

  • Added tests reproducing issue Failure while parsing GIN Index #1706
  • Added enumerations and keywords to properly represent operator classes
  • Extended Index struct to be able to characterize indices such as those presented in the issue

Summary of changes:

@LucaCappelletti94 LucaCappelletti94 marked this pull request as ready for review February 6, 2025 15:48
`Copy` trait derive was previously removed as a variant `Custom` with `Ident` data type was added to the enum. This variant has afterwards been removed as deemed, while necessary to fully capture custom indices, extremely rare (the Postgres documentation itself says that creating custom indices is a rather hard task) and it would be very hard to fully capture in a good manner all possible custom variants.
Co-authored-by: Ifeanyi Ubah <[email protected]>
@iffyio
Copy link
Contributor

iffyio commented Feb 19, 2025

Marking this as draft in the meantime as its no longer awaiting review, @LucaCappelletti94 please feel free to undraft and ping when ready!

@iffyio iffyio marked this pull request as draft February 19, 2025 17:48
@LucaCappelletti94
Copy link
Contributor Author

At this time I am somewhat stuck as I would not know how to parse the class operator names if not as a keyword - do you have any examples that could be applied to this use case?

@iffyio
Copy link
Contributor

iffyio commented Feb 21, 2025

At this time I am somewhat stuck as I would not know how to parse the class operator names if not as a keyword - do you have any examples that could be applied to this use case?

@LucaCappelletti94 (not sure I followed entirely) did you mean to parse and represent an operator class like gist_oid_ops? If so I imagine we could parse the next token and check that its an unquoted Token::Word(_) and represent the class in the AST as an opaque String for example

@LucaCappelletti94
Copy link
Contributor Author

While things may change in the future, to the best of my knowledge some indices can have operator classes and some cannot. Since we model the different indices, is it preferable if I continue to model this semantical difference between them? Or should I remove it?

I believe that different indices which accept different parametrizations are more similar to syntax than semantics. Do let me know what is your opinion on it.

@iffyio
Copy link
Contributor

iffyio commented Feb 22, 2025

Sorry im not sure I understood the question, could you clarify with a sql example if that's possible?

@LucaCappelletti94
Copy link
Contributor Author

Sure, so:

In a GIN index, there MAY be an operator class from a provided limited set, in the following example gin_trgm_ops

CREATE INDEX documents_content_trgm_idx ON documents USING GIN (content gin_trgm_ops);

In a BRIN index, instead, to the best of my knowledge there are no such operator classes. I base this claim on having read the actual postgres source code, as unfortunately the documentation is rather scarse in regard of which operator classes are available.

CREATE INDEX logs_event_time_brin_idx ON logs USING BRIN (event_time);

One more important thing is that the actual index type, in these cases respectively GIN and BRIN, while currently modeled as an enumeration, is actually something as customizable as a table since I could always introduce a new index via a custom extension.

If the Operator classes are to be treated as generic strings, it may be the case that also index types should be treated as such, if the goal is to make the library generically future-proofed. I am not certain it is necessarily the best way forward, as it may become exceedingly generic, but most likely this is my need to validate the semantics of the SQL that sneaks in.

Summarizing, let me know your opinion of these three questions:

  1. Should there be an enumeration of index types (plus maybe a CustomIndexType(String) variant? Or should that too be replaced with a simpler IndexType(String)?
  2. If we keep the index enum, should there be a check on whether the provided index type is known to support Operator classes (which is how I implemented it now, with the generic dispatch)?
  3. If we keep the dispatch for validating the operator classes, maybe it would be reasonable to still keep the enumeration of the operator classes PLUS a CustomOperatorClass(String) thing? Or, if we remove any of the above, switch to a Option<OperatorClass(String)> and be done with it?

@iffyio
Copy link
Contributor

iffyio commented Feb 22, 2025

Ah I see thanks for clarifying!

  1. I think introducing a Custom variant would be reasonable, it would be similar to the existing pattern of DataType::Custom, BinaryOperator::Custom etc that we have. The existing Hash and BTree I figure we can keep as is, represented explicitly, partily to avoid breaking changes more than necessary and also since those types are the most common index types across dialects
  2. I don't think we need to check that, we can look to parse the operator class if one is present in the input.
  3. I think we can keep IndexType as the enum in (1), then essentially operator class in (2) is an opaque String.

@LucaCappelletti94
Copy link
Contributor Author

I have tried to replace the parsing of keywords with the generic String Ident as proposed, but it leads to much more complex parsing.

Primarily, I don't understand how I could distinguish an operator class from a keyword, such as: CREATE UNIQUE INDEX IF NOT EXISTS idx_name ON test USING BTREE (name gin_trgm_ops,age DESC), specifically how can you tell that gin_trgm_ops is to be parsed as an operator class and DESC should instead be a keyword? Should I try to parse it as a keyword, and it it fails I assume it is an operator class?

@LucaCappelletti94
Copy link
Contributor Author

I have managed to do it, but I really dislike the approach I used, which peeks whether the parser will next encounter one of the expected keywords and if not, assumes that the next token is an operator class. I much preferred the previous approach that distinctively treated operator classes as keywords which was much stricter, but at some point it boils down to personal opinion.

Let me know your opinion on it.

@LucaCappelletti94
Copy link
Contributor Author

I tried to do the merge using the GitHub tool but it was lagging immensely, I will now fix it offline.

@LucaCappelletti94 LucaCappelletti94 marked this pull request as ready for review February 25, 2025 11:27

#[test]
fn parse_create_projects_name_description_trgm_index() {
let sql = "CREATE INDEX projects_name_description_trgm_idx ON projects USING GIN (concat_projects_name_description(name, description) gin_trgm_ops)";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test scenario looks identical to parse_create_users_name_trgm_index in terms of coverage?
Im thinking generally for the tests we can group them into the same test function, it would be sufficient with the AST assertion on only one of the test scenarios then for the rest we can rely only on verified_stmt(), that would keep the tests code smaller and easier to track what part of the syntax is covered.

Then could we add scenarios for all the introduced index types (i.e. BRIN, BLOOM etc)? As well as the custom type which probably would live in common I imagine?

@LucaCappelletti94 LucaCappelletti94 requested a review from iffyio March 2, 2025 08:00
Copy link
Contributor

@iffyio iffyio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @LucaCappelletti94!
cc @alamb

@iffyio iffyio changed the title Extending support for INDEX parsing Extend support for INDEX parsing Mar 4, 2025
@iffyio iffyio merged commit 6ec5223 into apache:main Mar 4, 2025
9 checks passed
QuenKar pushed a commit to QuenKar/datafusion-sqlparser-rs that referenced this pull request Mar 25, 2025
ayman-sigma pushed a commit to sigmacomputing/sqlparser-rs that referenced this pull request Apr 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants