Skip to content

Fix tokenization of qualified identifiers with numeric prefix. #1803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 11, 2025
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 64 additions & 12 deletions src/tokenizer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -895,7 +895,7 @@ impl<'a> Tokenizer<'a> {
};

let mut location = state.location();
while let Some(token) = self.next_token(&mut state)? {
while let Some(token) = self.next_token(&mut state, buf.last().map(|t| &t.token))? {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we implement instead in the parser as a special for self.dialect.supports_numeric_prefix()? somewhat similar to what we do for BigQuery here. In that when its time to combine identifiers into a compound identifier we check if each subsequent identifier part is unquoted and prefixed by ., and if so we drop the prefix. I imagine that would be done here

Thinking that could be a smaller change since the behavior needs a bit of extra context around the tokens which the tokenizer isn't so good at handling cleanly

Copy link
Contributor Author

@romanb romanb Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review and the suggestion. The latest commit implements an alternative approach within Parser#parse_compound_expr. However, I have to admit that I personally prefer the previous fix in the tokenizer, for a few reasons:

  1. The Tokenizer is part of the public API of the crate, and without fixing this in the tokenizer, it will continue to exhibit the following behavior with the Mysql dialect:

    • t.1to2 tokenizes to t (Word), .1to2 (Word)

    • t.1e2 tokenizes to t (Word), .1e2 (Number)

    • t.`1to2` tokenizes to t (Word), . (Period), 1to2 (Word)

    • t.`1e2` tokenizes to t (Word), . (Period) 1e2 (Word)

      This could be very surprising for users and arguably be considered incorrect (when using the Mysql dialect).

  2. The handling of Word and Number tokens in parse_compound_expr is not the most elegant with having to split off the . and having to create the correct off-by-one spans accordingly.

  3. Unqualified identifiers that start with digits are already handled in the tokenizer - and I think have to be (see Support identifiers beginning with digits in MySQL #856). Handling the same problem of misinterpreted tokens for qualified identifiers in the parser seems a bit disconnected and a like a downstream solution to the tokenizer producing incorrect tokens, at least that is how I currently perceive it (see point 1).

Let me know which solution you prefer (with or without the latest commit) or if you have another idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fair, we can revert back to the tokenizer version in that case? to keep the tokenizer behavior non-surprising

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

let span = location.span_to(state.location());

buf.push(TokenWithSpan { token, span });
Expand Down Expand Up @@ -932,7 +932,11 @@ impl<'a> Tokenizer<'a> {
}

/// Get the next token or return None
fn next_token(&self, chars: &mut State) -> Result<Option<Token>, TokenizerError> {
fn next_token(
&self,
chars: &mut State,
prev_token: Option<&Token>,
) -> Result<Option<Token>, TokenizerError> {
match chars.peek() {
Some(&ch) => match ch {
' ' => self.consume_and_return(chars, Token::Whitespace(Whitespace::Space)),
Expand Down Expand Up @@ -1211,17 +1215,29 @@ impl<'a> Tokenizer<'a> {
chars.next();
}

// If the dialect supports identifiers that start with a numeric prefix
// and we have now consumed a dot, check if the previous token was a Word.
// If so, what follows is definitely not part of a decimal number and
// we should yield the dot as a dedicated token so compound identifiers
// starting with digits can be parsed correctly.
if s == "." && self.dialect.supports_numeric_prefix() {
if let Some(Token::Word(_)) = prev_token {
return Ok(Some(Token::Period));
}
}

// Consume fractional digits.
s += &peeking_next_take_while(chars, |ch, next_ch| {
ch.is_ascii_digit() || is_number_separator(ch, next_ch)
});

// No number -> Token::Period
// No fraction -> Token::Period
if s == "." {
return Ok(Some(Token::Period));
}

let mut exponent_part = String::new();
// Parse exponent as number
let mut exponent_part = String::new();
if chars.peek() == Some(&'e') || chars.peek() == Some(&'E') {
let mut char_clone = chars.peekable.clone();
exponent_part.push(char_clone.next().unwrap());
Expand Down Expand Up @@ -1250,14 +1266,23 @@ impl<'a> Tokenizer<'a> {
}
}

// mysql dialect supports identifiers that start with a numeric prefix,
// as long as they aren't an exponent number.
if self.dialect.supports_numeric_prefix() && exponent_part.is_empty() {
let word =
peeking_take_while(chars, |ch| self.dialect.is_identifier_part(ch));

if !word.is_empty() {
s += word.as_str();
// If the dialect supports identifiers that start with a numeric prefix,
// we need to check if the value is in fact an identifier and must thus
// be tokenized as a word.
if self.dialect.supports_numeric_prefix() {
if exponent_part.is_empty() {
// If it is not a number with an exponent, it may be
// an identifier starting with digits.
let word =
peeking_take_while(chars, |ch| self.dialect.is_identifier_part(ch));

if !word.is_empty() {
s += word.as_str();
return Ok(Some(Token::make_word(s.as_str(), None)));
}
} else if prev_token == Some(&Token::Period) {
// If the previous token was a period, thus not belonging to a number,
// the value we have is part of an identifier.
return Ok(Some(Token::make_word(s.as_str(), None)));
}
}
Expand Down Expand Up @@ -3960,4 +3985,31 @@ mod tests {
],
);
}

#[test]
fn test_tokenize_identifiers_numeric_prefix() {
all_dialects_where(|dialect| dialect.supports_numeric_prefix())
.tokenizes_to("123abc", vec![Token::make_word("123abc", None)]);

all_dialects_where(|dialect| dialect.supports_numeric_prefix())
.tokenizes_to("12e34", vec![Token::Number("12e34".to_string(), false)]);

all_dialects_where(|dialect| dialect.supports_numeric_prefix()).tokenizes_to(
"t.12e34",
vec![
Token::make_word("t", None),
Token::Period,
Token::make_word("12e34", None),
],
);

all_dialects_where(|dialect| dialect.supports_numeric_prefix()).tokenizes_to(
"t.1two3",
vec![
Token::make_word("t", None),
Token::Period,
Token::make_word("1two3", None),
],
);
}
}
100 changes: 100 additions & 0 deletions tests/sqlparser_mysql.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1926,6 +1926,106 @@ fn parse_select_with_numeric_prefix_column_name() {
}
}

#[test]
fn parse_qualified_identifiers_with_numeric_prefix() {
// Case 1: Qualified column name that starts with digits.
mysql().verified_stmt("SELECT t.15to29 FROM my_table AS t");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add a test case for the behavior of multiple accesses e.g. t.15to29.16to30?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I added such a test.

match mysql()
.parse_sql_statements("SELECT t.15to29 FROM my_table AS t")
.unwrap()
.pop()
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mysql().verified_stmt("SELECT t.15to29 FROM my_table AS t");
match mysql()
.parse_sql_statements("SELECT t.15to29 FROM my_table AS t")
.unwrap()
.pop()
{
match mysql().verified_stmt("SELECT t.15to29 FROM my_table AS t") {

does this format work the same to remove the second parse statement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that simplifies the test, I was simply ignorant of the return value. Thanks for the suggestion.

Some(Statement::Query(q)) => match *q.body {
SetExpr::Select(s) => match s.projection.last() {
Some(SelectItem::UnnamedExpr(Expr::CompoundIdentifier(parts))) => {
assert_eq!(&[Ident::new("t"), Ident::new("15to29")], &parts[..]);
}
proj => panic!("Unexpected projection: {:?}", proj),
},
body => panic!("Unexpected statement body: {:?}", body),
},
stmt => panic!("Unexpected statement: {:?}", stmt),
}

// Case 2: Qualified column name that starts with digits and on its own represents a number.
mysql().verified_stmt("SELECT t.15e29 FROM my_table AS t");
match mysql()
.parse_sql_statements("SELECT t.15e29 FROM my_table AS t")
.unwrap()
.pop()
{
Some(Statement::Query(q)) => match *q.body {
SetExpr::Select(s) => match s.projection.last() {
Some(SelectItem::UnnamedExpr(Expr::CompoundIdentifier(parts))) => {
assert_eq!(&[Ident::new("t"), Ident::new("15e29")], &parts[..]);
}
proj => panic!("Unexpected projection: {:?}", proj),
},
body => panic!("Unexpected statement body: {:?}", body),
},
stmt => panic!("Unexpected statement: {:?}", stmt),
}

// Case 3: Unqualified, the same token is parsed as a number.
match mysql()
.parse_sql_statements("SELECT 15e29 FROM my_table")
.unwrap()
.pop()
{
Some(Statement::Query(q)) => match *q.body {
SetExpr::Select(s) => match s.projection.last() {
Some(SelectItem::UnnamedExpr(Expr::Value(ValueWithSpan { value, .. }))) => {
assert_eq!(&number("15e29"), value);
}
proj => panic!("Unexpected projection: {:?}", proj),
},
body => panic!("Unexpected statement body: {:?}", body),
},
stmt => panic!("Unexpected statement: {:?}", stmt),
}

// Case 4: Quoted simple identifier.
mysql().verified_stmt("SELECT `15e29` FROM my_table");
match mysql()
.parse_sql_statements("SELECT `15e29` FROM my_table")
.unwrap()
.pop()
{
Some(Statement::Query(q)) => match *q.body {
SetExpr::Select(s) => match s.projection.last() {
Some(SelectItem::UnnamedExpr(Expr::Identifier(name))) => {
assert_eq!(&Ident::with_quote('`', "15e29"), name);
}
proj => panic!("Unexpected projection: {:?}", proj),
},
body => panic!("Unexpected statement body: {:?}", body),
},
stmt => panic!("Unexpected statement: {:?}", stmt),
}

// Case 5: Quoted compound identifier.
mysql().verified_stmt("SELECT t.`15e29` FROM my_table");
match mysql()
.parse_sql_statements("SELECT t.`15e29` FROM my_table AS t")
.unwrap()
.pop()
{
Some(Statement::Query(q)) => match *q.body {
SetExpr::Select(s) => match s.projection.last() {
Some(SelectItem::UnnamedExpr(Expr::CompoundIdentifier(parts))) => {
assert_eq!(
&[Ident::new("t"), Ident::with_quote('`', "15e29")],
&parts[..]
);
}
proj => panic!("Unexpected projection: {:?}", proj),
},
body => panic!("Unexpected statement body: {:?}", body),
},
stmt => panic!("Unexpected statement: {:?}", stmt),
}
}

// Don't run with bigdecimal as it fails like this on rust beta:
//
// 'parse_select_with_concatenation_of_exp_number_and_numeric_prefix_column'
Expand Down