Don't lose precision when parsing decimal fractions #130

benesch · 2019-07-08T18:25:46Z

The SQL standard requires that numeric literals with a decimal point,
like 1.23, are represented exactly, up to some precision. That means
that parsing these literals into f64s is invalid, as it is impossible
to represent many decimal numbers exactly in binary floating point (for
example, 0.3).

This commit instead parses these literals into a big decimal type that
is capable of representing every decimal fraction exactly. Downstream
consumers can lower this representation into something more efficient
but less precise (like the old f64) if desired.

coveralls · 2019-07-08T18:36:51Z

Pull Request Test Coverage Report for Build 449

86 of 89 (96.63%) changed or added relevant lines in 5 files are covered.
7 unchanged lines in 2 files lost coverage.
Overall coverage decreased (-0.1%) to 92.359%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
src/parser.rs	8	11	72.73%

Files with Coverage Reduction	New Missed Lines	%
tests/sqlparser_postgres.rs	2	88.59%
tests/sqlparser_common.rs	5	90.79%

Totals
Change from base Build 440:	-0.1%
Covered Lines:	4315
Relevant Lines:	4672

💛 - Coveralls

nickolay

This is great!

One thing that bothers me is that we use Value::Date(String) in the AST for DATE literals, instead of bundling chrono; while for decimals you've elected to rely on bigdecimal. Is there a reason for this discrepancy or do you see us adding chrono for date/time values later?

Is having values parsed in the AST useful in context of an SQL engine? I haven't studied any SQL implementations, but I imagined one would have to make at least one pass over the AST (to resolve the object names), producing a different structure before inspecting the values. I'd like to know if that assumption is wrong.

I have a slight preference for adding a way to parse the literals to this crate as an optional cargo feature, but I realize that my obsession with build times can sometimes be unhealthy:)

benesch · 2019-07-10T04:32:00Z

One thing that bothers me is that we use Value::Date(String) in the AST for DATE literals, instead of bundling chrono; while for decimals you've elected to rely on bigdecimal. Is there a reason for this discrepancy or do you see us adding chrono for date/time values later?

Yes, this bothers me too! The reason I went with this is because it seemed really disruptive to remove all integer/float parsing, but I'm starting to lean towards that approach. There's a problem right now in that someone can write SELECT 1234123412341234123412341242314123431241324321412341234, which is perfectly parseable as an arbitrary-precision decimal in Postgres, and our SQL parser will fail to parse it because the literal isn't representable in a u64. Really the best thing to do is to return the literal as a string, and let downstream consumers handle it.

Is having values parsed in the AST useful in context of an SQL engine? I haven't studied any SQL implementations, but I imagined one would have to make at least one pass over the AST (to resolve the object names), producing a different structure before inspecting the values. I'd like to know if that assumption is wrong.

Nope, it's not that useful. It was useful when we (Materialize) were just prototyping our SQL engine, but as its evolved we've found that the values parsing was too opinionated. It's fine when the parser produces exactly the literal representation we need (e.g., using this bigdecimal crate, the ordered float crate, etc.), but that approach certainly doesn't scale as we have additional users with different and conflicting requirements.

I have a slight preference for adding a way to parse the literals to this crate as an optional cargo feature, but I realize that my obsession with build times can sometimes be unhealthy:)

Very on board with that! How do you feel about this as the standard Value enum?

pub enum Value {
    /// Numeric literal
    Number(String),
    /// 'string value'
    SingleQuotedString(String),
    /// N'string value'
    NationalStringLiteral(String),
    /// X'hex value'
    HexStringLiteral(String),
    /// Boolean value true or false
    Boolean(bool),
    /// `DATE '...'` literals
    Date(String),
    /// `TIME '...'` literals
    Time(String),
    /// `TIMESTAMP '...'` literals
    Timestamp(String),
    /// INTERVAL literals, roughly in the following format:
    /// `INTERVAL '<value>' <leading_field> [ (<leading_precision>) ]
    /// [ TO <last_field> [ (<fractional_seconds_precision>) ] ]`,
    /// e.g. `INTERVAL '123:45.67' MINUTE(3) TO SECOND(2)`.
    ///
    /// The parser does not validate the `<value>`, nor does it ensure
    /// that the `<leading_field>` units >= the units in `<last_field>`,
    /// so the user will have to reject intervals like `HOUR TO YEAR`.
    Interval {
        value: String,
        leading_field: DateTimeField,
        leading_precision: Option<u64>,
        last_field: Option<DateTimeField>,
        /// The seconds precision can be specified in SQL source as
        /// `INTERVAL '__' SECOND(_, x)` (in which case the `leading_field`
        /// will be `Second` and the `last_field` will be `None`),
        /// or as `__ TO SECOND(x)`.
        fractional_seconds_precision: Option<u64>,
    },
    /// `NULL` value
    Null,
}

It feels a bit strange at first, but seems quite sensible once you sit with it. We could retain the existing Value variants behind a feature flag, plus perhaps this decimal variant too. Let me rework this PR accordingly.

nickolay · 2019-07-10T22:55:16Z

How do you feel about this as the standard Value enum?

So a single Value::Number(String) instead of Value::Long(u64) / Value::Double(OrderedFloat<f64>). That makes sense!

I'm not sure I understand: why provide an option to keep existing variants, instead of just providing an optional bigdecimal?

benesch · 2019-08-13T18:48:08Z

So a single Value::Number(String) instead of Value::Long(u64) / Value::Double(OrderedFloat). That makes sense!

Yep, this is done now! Please take a look when you get a chance.

I'm not sure I understand: why provide an option to keep existing variants, instead of just providing an optional bigdecimal?

I think we're just saying the same thing in different ways. I coded up what I meant, at least, in the second commit in this PR—let me know if that's also what you had in mind!

nickolay

I think we're just saying the same thing in different ways.

You're right, this is exactly what I had in mind. Thanks!

nickolay · 2019-08-27T22:19:42Z

src/parser.rs

@@ -1854,20 +1847,17 @@ impl Parser {
    }

    /// Parse a LIMIT clause
-    pub fn parse_limit(&mut self) -> Result<Option<Expr>, ParserError> {
+    pub fn parse_limit(&mut self) -> Result<Option<u64>, ParserError> {


I'm not against this change, but I didn't make it sooner because at least Postgres allows expressions here.

Oh, good point. I've backed this bit out.

nickolay · 2019-08-27T22:22:37Z

src/parser.rs

-        let value = self
-            .parse_literal_uint()
-            .map(|n| Expr::Value(Value::Long(n)))?;
+    pub fn parse_offset(&mut self) -> Result<u64, ParserError> {


Same here. This was added in this PR https://github.com/andygrove/sqlparser-rs/pull/69#issuecomment-497160698

Backed this bit out too.

nickolay · 2019-08-27T22:25:05Z

src/parser.rs

@@ -1176,7 +1176,10 @@ impl Parser {
                        return parser_err!(format!("No value parser for keyword {}", k.keyword));
                    }
                },
-                Token::Number(ref n) => Ok(Value::Number(n.to_string())),
+                Token::Number(ref n) => match n.parse() {


I was confused by this at first, as the IDE suggested that parse() returns a String. Perhaps add a comment that this is for feature=bigdecimal?

Sure thing, done!

benesch

Thanks for the review! I believe I've addressed all your feedback.

The SQL standard requires that numeric literals with a decimal point, like 1.23, are represented exactly, up to some precision. That means that parsing these literals into f64s is invalid, as it is impossible to represent many decimal numbers exactly in binary floating point (for example, 0.3). This commit parses all numeric literals into a new `Value` variant `Number(String)`, removing the old `Long(u64)` and `Double(f64)` variants. This is slightly less convenient for downstream consumers, but far more flexible, as numbers that do not fit into a u64 and f64 are now representable.

With `--features bigdecimal`, parse numbers into BigDecimals instead of leaving them as strings.

nickolay · 2019-09-01T19:25:08Z

I believe I've addressed all your feedback.

Yep, thank you!

benesch requested a review from nickolay July 8, 2019 18:25

benesch force-pushed the decimal branch from e39ef8e to 594e01c Compare July 8, 2019 18:30

nickolay approved these changes Jul 8, 2019

View reviewed changes

benesch force-pushed the decimal branch from 594e01c to 09b1a28 Compare August 13, 2019 18:47

nickolay approved these changes Aug 27, 2019

View reviewed changes

benesch force-pushed the decimal branch from 09b1a28 to 10830c5 Compare September 1, 2019 16:30

benesch commented Sep 1, 2019

View reviewed changes

benesch force-pushed the decimal branch from 10830c5 to e1dab41 Compare September 1, 2019 16:39

benesch added 2 commits September 1, 2019 13:21

Optionally parse numbers into BigDecimals

a0aca82

With `--features bigdecimal`, parse numbers into BigDecimals instead of leaving them as strings.

benesch force-pushed the decimal branch from e1dab41 to a0aca82 Compare September 1, 2019 17:21

benesch merged commit f4df340 into master Sep 2, 2019

benesch deleted the decimal branch September 2, 2019 11:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't lose precision when parsing decimal fractions #130

Don't lose precision when parsing decimal fractions #130

benesch commented Jul 8, 2019

coveralls commented Jul 8, 2019 •

edited

Loading

nickolay left a comment

benesch commented Jul 10, 2019

nickolay commented Jul 10, 2019

benesch commented Aug 13, 2019

nickolay left a comment

nickolay Aug 27, 2019

benesch Sep 1, 2019

nickolay Aug 27, 2019

benesch Sep 1, 2019

nickolay Aug 27, 2019

benesch Sep 1, 2019

benesch left a comment

nickolay commented Sep 1, 2019

Don't lose precision when parsing decimal fractions #130

Don't lose precision when parsing decimal fractions #130

Conversation

benesch commented Jul 8, 2019

coveralls commented Jul 8, 2019 • edited Loading

Pull Request Test Coverage Report for Build 449

💛 - Coveralls

nickolay left a comment

Choose a reason for hiding this comment

benesch commented Jul 10, 2019

nickolay commented Jul 10, 2019

benesch commented Aug 13, 2019

nickolay left a comment

Choose a reason for hiding this comment

nickolay Aug 27, 2019

Choose a reason for hiding this comment

benesch Sep 1, 2019

Choose a reason for hiding this comment

nickolay Aug 27, 2019

Choose a reason for hiding this comment

benesch Sep 1, 2019

Choose a reason for hiding this comment

nickolay Aug 27, 2019

Choose a reason for hiding this comment

benesch Sep 1, 2019

Choose a reason for hiding this comment

benesch left a comment

Choose a reason for hiding this comment

nickolay commented Sep 1, 2019

coveralls commented Jul 8, 2019 •

edited

Loading