Skip to content

Don't lose precision when parsing decimal fractions #130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 2, 2019
Merged

Conversation

benesch
Copy link
Contributor

@benesch benesch commented Jul 8, 2019

The SQL standard requires that numeric literals with a decimal point,
like 1.23, are represented exactly, up to some precision. That means
that parsing these literals into f64s is invalid, as it is impossible
to represent many decimal numbers exactly in binary floating point (for
example, 0.3).

This commit instead parses these literals into a big decimal type that
is capable of representing every decimal fraction exactly. Downstream
consumers can lower this representation into something more efficient
but less precise (like the old f64) if desired.

@coveralls
Copy link

coveralls commented Jul 8, 2019

Pull Request Test Coverage Report for Build 449

  • 86 of 89 (96.63%) changed or added relevant lines in 5 files are covered.
  • 7 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.1%) to 92.359%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/parser.rs 8 11 72.73%
Files with Coverage Reduction New Missed Lines %
tests/sqlparser_postgres.rs 2 88.59%
tests/sqlparser_common.rs 5 90.79%
Totals Coverage Status
Change from base Build 440: -0.1%
Covered Lines: 4315
Relevant Lines: 4672

💛 - Coveralls

Copy link
Contributor

@nickolay nickolay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

One thing that bothers me is that we use Value::Date(String) in the AST for DATE literals, instead of bundling chrono; while for decimals you've elected to rely on bigdecimal. Is there a reason for this discrepancy or do you see us adding chrono for date/time values later?

Is having values parsed in the AST useful in context of an SQL engine? I haven't studied any SQL implementations, but I imagined one would have to make at least one pass over the AST (to resolve the object names), producing a different structure before inspecting the values. I'd like to know if that assumption is wrong.

I have a slight preference for adding a way to parse the literals to this crate as an optional cargo feature, but I realize that my obsession with build times can sometimes be unhealthy:)

@benesch
Copy link
Contributor Author

benesch commented Jul 10, 2019

One thing that bothers me is that we use Value::Date(String) in the AST for DATE literals, instead of bundling chrono; while for decimals you've elected to rely on bigdecimal. Is there a reason for this discrepancy or do you see us adding chrono for date/time values later?

Yes, this bothers me too! The reason I went with this is because it seemed really disruptive to remove all integer/float parsing, but I'm starting to lean towards that approach. There's a problem right now in that someone can write SELECT 1234123412341234123412341242314123431241324321412341234, which is perfectly parseable as an arbitrary-precision decimal in Postgres, and our SQL parser will fail to parse it because the literal isn't representable in a u64. Really the best thing to do is to return the literal as a string, and let downstream consumers handle it.

Is having values parsed in the AST useful in context of an SQL engine? I haven't studied any SQL implementations, but I imagined one would have to make at least one pass over the AST (to resolve the object names), producing a different structure before inspecting the values. I'd like to know if that assumption is wrong.

Nope, it's not that useful. It was useful when we (Materialize) were just prototyping our SQL engine, but as its evolved we've found that the values parsing was too opinionated. It's fine when the parser produces exactly the literal representation we need (e.g., using this bigdecimal crate, the ordered float crate, etc.), but that approach certainly doesn't scale as we have additional users with different and conflicting requirements.

I have a slight preference for adding a way to parse the literals to this crate as an optional cargo feature, but I realize that my obsession with build times can sometimes be unhealthy:)

Very on board with that! How do you feel about this as the standard Value enum?

pub enum Value {
    /// Numeric literal
    Number(String),
    /// 'string value'
    SingleQuotedString(String),
    /// N'string value'
    NationalStringLiteral(String),
    /// X'hex value'
    HexStringLiteral(String),
    /// Boolean value true or false
    Boolean(bool),
    /// `DATE '...'` literals
    Date(String),
    /// `TIME '...'` literals
    Time(String),
    /// `TIMESTAMP '...'` literals
    Timestamp(String),
    /// INTERVAL literals, roughly in the following format:
    /// `INTERVAL '<value>' <leading_field> [ (<leading_precision>) ]
    /// [ TO <last_field> [ (<fractional_seconds_precision>) ] ]`,
    /// e.g. `INTERVAL '123:45.67' MINUTE(3) TO SECOND(2)`.
    ///
    /// The parser does not validate the `<value>`, nor does it ensure
    /// that the `<leading_field>` units >= the units in `<last_field>`,
    /// so the user will have to reject intervals like `HOUR TO YEAR`.
    Interval {
        value: String,
        leading_field: DateTimeField,
        leading_precision: Option<u64>,
        last_field: Option<DateTimeField>,
        /// The seconds precision can be specified in SQL source as
        /// `INTERVAL '__' SECOND(_, x)` (in which case the `leading_field`
        /// will be `Second` and the `last_field` will be `None`),
        /// or as `__ TO SECOND(x)`.
        fractional_seconds_precision: Option<u64>,
    },
    /// `NULL` value
    Null,
}

It feels a bit strange at first, but seems quite sensible once you sit with it. We could retain the existing Value variants behind a feature flag, plus perhaps this decimal variant too. Let me rework this PR accordingly.

@nickolay
Copy link
Contributor

How do you feel about this as the standard Value enum?

So a single Value::Number(String) instead of Value::Long(u64) / Value::Double(OrderedFloat<f64>). That makes sense!

I'm not sure I understand: why provide an option to keep existing variants, instead of just providing an optional bigdecimal?

@benesch
Copy link
Contributor Author

benesch commented Aug 13, 2019

So a single Value::Number(String) instead of Value::Long(u64) / Value::Double(OrderedFloat). That makes sense!

Yep, this is done now! Please take a look when you get a chance.

I'm not sure I understand: why provide an option to keep existing variants, instead of just providing an optional bigdecimal?

I think we're just saying the same thing in different ways. I coded up what I meant, at least, in the second commit in this PR—let me know if that's also what you had in mind!

Copy link
Contributor

@nickolay nickolay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're just saying the same thing in different ways.

You're right, this is exactly what I had in mind. Thanks!

src/parser.rs Outdated
@@ -1854,20 +1847,17 @@ impl Parser {
}

/// Parse a LIMIT clause
pub fn parse_limit(&mut self) -> Result<Option<Expr>, ParserError> {
pub fn parse_limit(&mut self) -> Result<Option<u64>, ParserError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against this change, but I didn't make it sooner because at least Postgres allows expressions here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good point. I've backed this bit out.

src/parser.rs Outdated
let value = self
.parse_literal_uint()
.map(|n| Expr::Value(Value::Long(n)))?;
pub fn parse_offset(&mut self) -> Result<u64, ParserError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backed this bit out too.

@@ -1176,7 +1176,10 @@ impl Parser {
return parser_err!(format!("No value parser for keyword {}", k.keyword));
}
},
Token::Number(ref n) => Ok(Value::Number(n.to_string())),
Token::Number(ref n) => match n.parse() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused by this at first, as the IDE suggested that parse() returns a String. Perhaps add a comment that this is for feature=bigdecimal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, done!

Copy link
Contributor Author

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I believe I've addressed all your feedback.

The SQL standard requires that numeric literals with a decimal point,
like 1.23, are represented exactly, up to some precision. That means
that parsing these literals into f64s is invalid, as it is impossible
to represent many decimal numbers exactly in binary floating point (for
example, 0.3).

This commit parses all numeric literals into a new `Value` variant
`Number(String)`, removing the old `Long(u64)` and `Double(f64)`
variants. This is slightly less convenient for downstream consumers, but
far more flexible, as numbers that do not fit into a u64 and f64 are now
representable.
With `--features bigdecimal`, parse numbers into BigDecimals instead of
leaving them as strings.
@nickolay
Copy link
Contributor

nickolay commented Sep 1, 2019

I believe I've addressed all your feedback.

Yep, thank you!

@benesch benesch merged commit f4df340 into master Sep 2, 2019
@benesch benesch deleted the decimal branch September 2, 2019 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants