Extend support for INDEX parsing #1707

LucaCappelletti94 · 2025-02-05T16:48:01Z

This pull request addresses issue #1706.

Added tests reproducing issue Failure while parsing GIN Index #1706
Added enumerations and keywords to properly represent operator classes
Extended Index struct to be able to characterize indices such as those presented in the issue

Summary of changes:

Added a new document operator_classes.rs containing the classes
Added all of the necessary keywords to characterize the parsing of the operator classes
Extended enumeration of available Indices ddl.rs
Introduced a new struct IndexColumn
Extended parser to properly handle the coupled parsing of different Indices and associated Operator Classes
Updated accordingly the test suite

src/ast/ddl.rs

`Copy` trait derive was previously removed as a variant `Custom` with `Ident` data type was added to the enum. This variant has afterwards been removed as deemed, while necessary to fully capture custom indices, extremely rare (the Postgres documentation itself says that creating custom indices is a rather hard task) and it would be very hard to fully capture in a good manner all possible custom variants.

src/ast/operator_classes.rs

src/ast/spans.rs

src/ast/dml.rs

src/parser/mod.rs

tests/sqlparser_postgres.rs

Co-authored-by: Ifeanyi Ubah <[email protected]>

iffyio · 2025-02-19T17:48:11Z

Marking this as draft in the meantime as its no longer awaiting review, @LucaCappelletti94 please feel free to undraft and ping when ready!

LucaCappelletti94 · 2025-02-20T08:08:36Z

At this time I am somewhat stuck as I would not know how to parse the class operator names if not as a keyword - do you have any examples that could be applied to this use case?

iffyio · 2025-02-21T06:47:55Z

At this time I am somewhat stuck as I would not know how to parse the class operator names if not as a keyword - do you have any examples that could be applied to this use case?

@LucaCappelletti94 (not sure I followed entirely) did you mean to parse and represent an operator class like gist_oid_ops? If so I imagine we could parse the next token and check that its an unquoted Token::Word(_) and represent the class in the AST as an opaque String for example

LucaCappelletti94 · 2025-02-22T08:14:51Z

While things may change in the future, to the best of my knowledge some indices can have operator classes and some cannot. Since we model the different indices, is it preferable if I continue to model this semantical difference between them? Or should I remove it?

I believe that different indices which accept different parametrizations are more similar to syntax than semantics. Do let me know what is your opinion on it.

iffyio · 2025-02-22T10:27:41Z

Sorry im not sure I understood the question, could you clarify with a sql example if that's possible?

LucaCappelletti94 · 2025-02-22T10:41:06Z

Sure, so:

In a GIN index, there MAY be an operator class from a provided limited set, in the following example gin_trgm_ops

CREATE INDEX documents_content_trgm_idx ON documents USING GIN (content gin_trgm_ops);

In a BRIN index, instead, to the best of my knowledge there are no such operator classes. I base this claim on having read the actual postgres source code, as unfortunately the documentation is rather scarse in regard of which operator classes are available.

CREATE INDEX logs_event_time_brin_idx ON logs USING BRIN (event_time);

One more important thing is that the actual index type, in these cases respectively GIN and BRIN, while currently modeled as an enumeration, is actually something as customizable as a table since I could always introduce a new index via a custom extension.

If the Operator classes are to be treated as generic strings, it may be the case that also index types should be treated as such, if the goal is to make the library generically future-proofed. I am not certain it is necessarily the best way forward, as it may become exceedingly generic, but most likely this is my need to validate the semantics of the SQL that sneaks in.

Summarizing, let me know your opinion of these three questions:

Should there be an enumeration of index types (plus maybe a CustomIndexType(String) variant? Or should that too be replaced with a simpler IndexType(String)?
If we keep the index enum, should there be a check on whether the provided index type is known to support Operator classes (which is how I implemented it now, with the generic dispatch)?
If we keep the dispatch for validating the operator classes, maybe it would be reasonable to still keep the enumeration of the operator classes PLUS a CustomOperatorClass(String) thing? Or, if we remove any of the above, switch to a Option<OperatorClass(String)> and be done with it?

iffyio · 2025-02-22T12:44:35Z

Ah I see thanks for clarifying!

I think introducing a Custom variant would be reasonable, it would be similar to the existing pattern of DataType::Custom, BinaryOperator::Custom etc that we have. The existing Hash and BTree I figure we can keep as is, represented explicitly, partily to avoid breaking changes more than necessary and also since those types are the most common index types across dialects
I don't think we need to check that, we can look to parse the operator class if one is present in the input.
I think we can keep IndexType as the enum in (1), then essentially operator class in (2) is an opaque String.

LucaCappelletti94 · 2025-02-25T10:40:17Z

I have tried to replace the parsing of keywords with the generic String Ident as proposed, but it leads to much more complex parsing.

Primarily, I don't understand how I could distinguish an operator class from a keyword, such as: CREATE UNIQUE INDEX IF NOT EXISTS idx_name ON test USING BTREE (name gin_trgm_ops,age DESC), specifically how can you tell that gin_trgm_ops is to be parsed as an operator class and DESC should instead be a keyword? Should I try to parse it as a keyword, and it it fails I assume it is an operator class?

LucaCappelletti94 · 2025-02-25T11:08:04Z

I have managed to do it, but I really dislike the approach I used, which peeks whether the parser will next encounter one of the expected keywords and if not, assumes that the next token is an operator class. I much preferred the previous approach that distinctively treated operator classes as keywords which was much stricter, but at some point it boils down to personal opinion.

Let me know your opinion on it.

LucaCappelletti94 · 2025-02-25T11:13:03Z

I tried to do the merge using the GitHub tool but it was lagging immensely, I will now fix it offline.

src/parser/mod.rs

iffyio · 2025-02-26T05:57:24Z

tests/sqlparser_postgres.rs

+
+#[test]
+fn parse_create_projects_name_description_trgm_index() {
+    let sql = "CREATE INDEX projects_name_description_trgm_idx ON projects USING GIN (concat_projects_name_description(name, description) gin_trgm_ops)";


this test scenario looks identical to parse_create_users_name_trgm_index in terms of coverage?
Im thinking generally for the tests we can group them into the same test function, it would be sufficient with the AST assertion on only one of the test scenarios then for the rest we can rely only on verified_stmt(), that would keep the tests code smaller and easier to track what part of the syntax is covered.

Then could we add scenarios for all the introduced index types (i.e. BRIN, BLOOM etc)? As well as the custom type which probably would live in common I imagine?

Co-authored-by: Ifeanyi Ubah <[email protected]>

iffyio

LGTM! Thanks @LucaCappelletti94!
cc @alamb

Co-authored-by: Ifeanyi Ubah <[email protected]>

LucaCappelletti94 added 5 commits February 5, 2025 17:44

Added first tentative tests

da57618

Added support and test to handle operator classes for indices

49e001e

Removed

5bb6cb3

Formatted code

30c20f8

Resolved issues regarding serde derive

47ea5d4

LucaCappelletti94 marked this pull request as ready for review February 6, 2025 15:48

alamb reviewed Feb 7, 2025

View reviewed changes

src/ast/ddl.rs Show resolved Hide resolved

LucaCappelletti94 added 2 commits February 8, 2025 10:25

Merge branch 'main' into gin_trgm_ops

2f95334

iffyio reviewed Feb 11, 2025

View reviewed changes

Update src/ast/spans.rs

8e8608c

Co-authored-by: Ifeanyi Ubah <[email protected]>

iffyio marked this pull request as draft February 19, 2025 17:48

zzzdong mentioned this pull request Feb 25, 2025

Add support column prefix index for MySQL #1732

Closed

LucaCappelletti94 added 2 commits February 25, 2025 12:02

Removed operator classes and replaced it with a simpler Ident

7bcd499

Replaced custom index object name with simpler "identifier"

1709649

Merge branch 'main' into gin_trgm_ops

9ef6db7

LucaCappelletti94 added 2 commits February 25, 2025 12:19

Fixed errors relative to improper GitHub merge

1049bc0

Fixe clippy code smell

db5c9e7

LucaCappelletti94 marked this pull request as ready for review February 25, 2025 11:27

Removed #[cfg_attr(feature = "std", derive(Debug))]

8d08bd5

LucaCappelletti94 added 2 commits February 25, 2025 12:30

Updtaed the parse_create_index_expr index description

bcdde9b

Added multi-column test

be9ec88

iffyio reviewed Feb 26, 2025

View reviewed changes

LucaCappelletti94 and others added 9 commits February 27, 2025 10:34

Update src/parser/mod.rs

71d5bf1

Co-authored-by: Ifeanyi Ubah <[email protected]>

Update src/parser/mod.rs

c8e906c

Co-authored-by: Ifeanyi Ubah <[email protected]>

Update src/parser/mod.rs

6aba425

Co-authored-by: Ifeanyi Ubah <[email protected]>

Update src/parser/mod.rs

57e5c14

Co-authored-by: Ifeanyi Ubah <[email protected]>

Update src/parser/mod.rs

e941cb0

Co-authored-by: Ifeanyi Ubah <[email protected]>

Update src/parser/mod.rs

653dba0

Co-authored-by: Ifeanyi Ubah <[email protected]>

Replaced bool const for parameter as per issue request

fd659ff

Extended operator class tests and added test for bloom index syntax

f11a614

Added a test for the BRIN index type

7651771

LucaCappelletti94 requested a review from iffyio March 2, 2025 08:00

iffyio approved these changes Mar 4, 2025

View reviewed changes

iffyio changed the title ~~Extending support for INDEX parsing~~ Extend support for INDEX parsing Mar 4, 2025

iffyio merged commit 6ec5223 into apache:main Mar 4, 2025
9 checks passed

QuenKar pushed a commit to QuenKar/datafusion-sqlparser-rs that referenced this pull request Mar 25, 2025

Extend support for INDEX parsing (apache#1707)

b07053e

Co-authored-by: Ifeanyi Ubah <[email protected]>

ayman-sigma pushed a commit to sigmacomputing/sqlparser-rs that referenced this pull request Apr 10, 2025

Extend support for INDEX parsing (apache#1707)

6f0fdd6

Co-authored-by: Ifeanyi Ubah <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend support for INDEX parsing #1707

Extend support for INDEX parsing #1707

LucaCappelletti94 commented Feb 5, 2025 •

edited

Loading

iffyio commented Feb 19, 2025

LucaCappelletti94 commented Feb 20, 2025

iffyio commented Feb 21, 2025

LucaCappelletti94 commented Feb 22, 2025

iffyio commented Feb 22, 2025

LucaCappelletti94 commented Feb 22, 2025

iffyio commented Feb 22, 2025

LucaCappelletti94 commented Feb 25, 2025

LucaCappelletti94 commented Feb 25, 2025

LucaCappelletti94 commented Feb 25, 2025

iffyio Feb 26, 2025

iffyio left a comment

Extend support for INDEX parsing #1707

Extend support for INDEX parsing #1707

Conversation

LucaCappelletti94 commented Feb 5, 2025 • edited Loading

iffyio commented Feb 19, 2025

LucaCappelletti94 commented Feb 20, 2025

iffyio commented Feb 21, 2025

LucaCappelletti94 commented Feb 22, 2025

iffyio commented Feb 22, 2025

LucaCappelletti94 commented Feb 22, 2025

iffyio commented Feb 22, 2025

LucaCappelletti94 commented Feb 25, 2025

LucaCappelletti94 commented Feb 25, 2025

LucaCappelletti94 commented Feb 25, 2025

iffyio Feb 26, 2025

Choose a reason for hiding this comment

iffyio left a comment

Choose a reason for hiding this comment

LucaCappelletti94 commented Feb 5, 2025 •

edited

Loading