Ensure accurate encoding/decoding of big and small floats #4042

trottier · 2013-06-26T05:17:01Z

ujson has trouble with very big, and especially very small, floats:

ultrajson/ultrajson#69
ultrajson/ultrajson#83
ultrajson/ultrajson#90

>>> import ujson
>>> ujson.dumps(1e-40)
'0.0'
>>> ujson.dumps(1e-40, double_precision=17)
'0.0'

I believe this makes ujson inappropriate for inclusion in pandas.

This pull requests adds tests that ujson should, but won't, completely pass. I don't feel yet sufficiently familiar with pandas' design to substitute in (e.g.) simplejson.

Komnomnomnom · 2013-06-26T09:20:15Z

@trottier the bundled ujson is a custom modified fork to add built-in support for encoding and decoding pandas objects at the C-level (better performance), so it's not a simple drop-in. IIRC standard ujson is also significantly faster than simplejson.

That said though I do wonder if this could be fixed. It seems to handle very large numbers pretty well, and a very brief look at the code suggests it just reverts to sprintf if the value is above a threshold, I wonder if something similar could be done for very small values. I'm not in a position to play with this now but I'll have a deeper look when I get some time.

ghost · 2013-06-26T17:11:23Z

ujson knows how to deal with numpy array directly, which meshes well
with pandas' underlying use of them.
simplejson (or the stdlib json, same thing) does not. crudely:

In [16]: import ujson
    ...: import simplejson
    ...: a=np.arange(10000,dtype='f')
    ...: %timeit ujson.dumps(a)
    ...: try:
    ...:      simplejson.dumps(a)
    ...: except:
    ...:      print("well, no.")
    ...: %timeit simplejson.dumps(a.tolist())
1000 loops, best of 3: 967 µs per loop
well, no.
100 loops, best of 3: 4 ms per loop

trottier · 2013-06-26T23:08:19Z

Interesting -- I didn't know that ujson had been modified, etc. Sounds like it would be simpler to fix ujson than switch to something else. Regardless, that ujson handles floating point properly is critical for an application like pandas.

ujson doesn't really handle very large numbers all that well:

>>> ujson.dumps(1e20)
'1.000000000000000e+20'
>>> simplejson.dumps(1e20)
'1e+20'
>>>

... in its representation it implies precision that isn't there. Perhaps this is because it doesn't use "precision" properly:

>>> ujson.dumps(1e20, double_precision=1)
'1.000000000000000e+20' # should be '1.0e20'

dieterv77 · 2013-06-27T12:11:12Z

Just to provide a little more color on this:
Currently, ultrajson will use sprintf("%.15e") to print values whose absolute value exceeds 1e16 - 1. In that case, the double_precision argument is unused.

I also created a branch where i did the same thing for values below 1e-16 (dieterv77/ultrajson@17d9705)
This way you get
In [3]: ujson.dumps(1e-40)
Out[3]: '9.999999999999999e-41'

njsmith · 2013-06-27T13:39:08Z

@trottier: I assume the actual problem is that you want to be able to dump data from pandas to json and then read it back again, and got bitten because the json routines throw away precision? That's pretty nasty.

It doesn't matter if you add extra decimal points; IEEE doubles always have the same amount of precision in them. The important thing is that for any (finite) float value, you get bit-accurate round-tripping. The usual rule is to print as many decimals as are necessary to accomplish this, but if you have extra that's okay too.

That said, double printing/reading is extremely hard, and from a quick skim I wouldn't trust ujson to do it at all. Obviously there's the nasty stuff where it just throws away precision by default, but even if you use a very high double_precision value, I wouldn't be surprised if it still blew up on e.g. denormalized numbers. And I bet it makes rounding errors in some cases; you can't print by just dividing by ten and ignoring rounding error and get accurate results. And even with the new precise_float option for decoding, I bet it completely blows up if it's in a non-C locale. (strtod can not, in general, read JSON-formatted floats; it might expect a , as decimal separator etc.)

Python ships with high-quality double<->string conversion routines based on those in netlib -- PyOS_string_to_double, PyOS_double_to_string (or PyOS_ascii_strtod, PyOS_ascii_formatd in pre-2.7 versions).

It's possible that using these routines would produce some measurable slowdown... the string_to_double routine in particular has some mallocs, returns a newly-malloc'd pointer that you have to free, instead of writing into an existing buffer. But it'd be pretty easy for someone to patch ujson to use them and then run a benchmark to see whether they're actually slower. (And a 10% slowdown or whatever would be a pretty reasonable trade-off for, like, getting correct results, anyway...)

Implementation is here if anyone's curious about what's involved in doing this right: http://hg.python.org/cpython/file/44f455e6163d/Python/dtoa.c

dieterv77 · 2013-06-27T15:22:17Z

FWIW: cpickle in 2.7 uses PyOS_double_to_string(x, 'g', 17, 0, NULL) when not using a binary format to encode a double, and uses PyOS_string_to_double to decode.

trottier · 2013-06-27T16:19:14Z

Some benchmarks, for what they're worth.

import time
try: import numpypy as numpy
except: import numpy

def roundtrip_ujson(val):
    return ujson.loads(ujson.dumps(val))

def roundtrip_python(val):
    return float(repr(val))

def benchmark(roundtrip_fun=roundtrip_python, n=100000):
    p = []; stored = []
    for i in range(n):
        d = time.time()
        val = roundtrip_fun(d)
        delt = time.time() - d
        p.append(delt)
        stored.append(val) # just in case pypy optimized it away

    print "Mean", numpy.mean(p)
    print "Median", sorted(p)[int(len(p)/2)]

if __name__ == '__main__':
    import sys
    if 'ujson' in sys.argv:
        import ujson
        print "ujson", ujson.__version__, "performance"
        benchmark(roundtrip_fun=roundtrip_ujson)
    else:
        print sys.version, "performance"
        benchmark()

Results:

> /usr/local/bin/python /tmp/benchmark_float.py ujson ; pypy /tmp/benchmark_float.py ; /usr/local/bin/python /tmp/benchmark_float.py 
ujson 1.33 performance
Mean 1.13488674164e-06
Median 9.53674316406e-07
2.7.3 (5acfe049a5b0, May 21 2013, 13:47:22)
[PyPy 2.0.2 with GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)] performance
Mean 1.66995048523e-06
Median 1.90734863281e-06
2.7.3 (default, Mar 29 2013, 17:18:17)
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.27)] performance
Mean 2.4831032753e-06
Median 2.14576721191e-06

trottier · 2013-07-19T16:41:49Z

So @wesm in v0.12.0 what guarantee is there that users won't lose data when converting small floats to JSON? Should there be a warning issued or should the conversion fail when, say,

abs(val) < 1e-3

?

wesm · 2013-07-19T17:33:04Z

with the update to newest ujson the story is better:

In [2]: json.dumps(1e-5)
Out[2]: '0.00001'

In [3]: json.dumps(1e-6)
Out[3]: '0.000001'

In [4]: json.dumps(1e-7)
Out[4]: '0.0000001'

In [5]: json.dumps(1e-8)
Out[5]: '0.00000001'

In [6]: json.dumps(1e-9)
Out[6]: '0.000000001'

In [7]: json.dumps(1e-10)
Out[7]: '0.0000000001'

In [8]: json.dumps(1e-11)
Out[8]: '0.0'

In [9]: json.dumps(1e-10)
Out[9]: '0.0000000001'

In [10]: json.dumps(1e-11)
Out[10]: '0.0'

but still not good. anyone give some kind of assessment of how hard this would be to fix? for example, swapping in simplejson's float stringification

jreback · 2013-07-19T17:57:29Z

As a workaround, this is pretty easy to test for

currently raising, but could do a warning instead
better message?

In [1]: DataFrame([[1e-16,'foo',1e-8]],columns=list('ABC')).to_json()
ValueError: ujson currently cannot accurately format float data less
than 1e-15. A work-around is to multiply the data by
a large positive factor and divide on deseriliazation

jreback · 2013-07-19T18:07:25Z

PR #4295

trottier · 2013-07-19T18:15:56Z

Well, part of the issue is that we don't really know how much precision users will want or expect, and they won't really be able to ask for it. E.g. 1.2345678901234567 (~16 decimal places) is a decimal representation of a double that uses all the available precision. Ujson doesn't use the word "precision" properly, so if you try to represent 1.2345678901234567e-1 you'll lose one decimal place of precision, e-2 and you'll lose two, etc.

ujson.dumps(1.23456789123456e-5, double_precision=100)
==> '0.000012345678912'

ujson.dumps(1.23456789123456e-6, double_precision=100)
==> '0.000001234567891'

Komnomnomnom · 2013-07-19T19:57:18Z

@wesm double_precision defaults to 10, although it caps out at 15 (and should probably raise a ValueError after this)

In [2]: dumps(1e-10)
Out[2]: '0.0000000001'

In [3]: dumps(1e-11)
Out[3]: '0.0'

In [4]: dumps(1e-11, double_precision=15)
Out[4]: '0.00000000001'

In [5]: dumps(1e-12, double_precision=15)
Out[5]: '0.000000000001'

In [6]: dumps(1e-13, double_precision=15)
Out[6]: '0.0000000000001'

In [7]: dumps(1e-14, double_precision=15)
Out[7]: '0.00000000000001'

In [8]: dumps(1e-15, double_precision=15)
Out[8]: '0.000000000000001'

In [9]: dumps(1e-16, double_precision=15)
Out[9]: '0.0'

In [10]: dumps(1e-16, double_precision=20)
Out[10]: '0.0'

In [11]: dumps(1e-17, double_precision=20)
Out[11]: '0.0'

In [12]: dumps(1e-16, double_precision=20)
Out[12]: '0.0'

wesm · 2013-07-19T21:24:59Z

well if it's any help here is what chrome does:

> JSON.stringify(1e-40)
"1e-40"
> JSON.stringify(1e-7)
"1e-7"
> JSON.stringify(1e-6)
"0.000001"
> JSON.stringify(1e-50)
"1e-50"

Komnomnomnom · 2013-07-19T21:37:53Z

I'm playing around with calling sprintf to get similar behaviour. I'll send through a PR if I get it to play nice.

njsmith · 2013-07-19T22:40:13Z

Don't use sprintf, use a good quality dedicated float/string converter.
Like the one provided by the python runtime and mentioned above.
On 19 Jul 2013 22:38, "Komnomnomnom" [email protected] wrote:

I'm playing around with calling sprintf to get similar behaviour. I'll
send through a PR if I get it to play nice.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/4042#issuecomment-21279148
.

…w ValueError for bad double_precision arg pandas-dev#4042

Komnomnomnom · 2013-07-19T23:17:26Z

@njsmith in my PR sprintf is only used for very small and very large numbers which means it will handle (some) of the numbers given here a bit better. It might be worth investigating the optional use of PyOS_string_to_double etc in future though if you think it will provide exact round-trip precision.

trottier · 2013-07-19T23:42:29Z

For the record: I was mistaken in grouping together ujson's treatment of big and small numbers. There are issues with how it handles both, but the issues are only warning-level/not-nice for big numbers (i.e., its use of "precision" is wrong), while they give rise to incorrect behavior for the small ones.

Komnomnomnom · 2013-07-19T23:44:10Z

@trottier can you have a look at #4299

ENH: ujson better handling of very large and very small numbers, throw ValueError for bad double_precision arg #4042

* commit 'v0.12.0rc1-127-gec8920a': DOC: docs for precise_float option in read_json BUG: explicity change nan -> NaT when assigning to datelike dtypes ENH: expose ujson precise_float argument on decode ENH: ujson better handling of very large and very small numbers, throw ValueError for bad double_precision arg pandas-dev#4042 minor: some trailing spaces and a pylint "pragma" to stop complaining about Series._ix defined elsewhere ENH: test_perf.py - use psutil to set affinity (if absent functionality - then affinity module) TST: print out byteorder in ci/print_versions.py DOC: Fix typo. Update CONTRIBUTING.md with note on attribution in PRs

test encoding/decoding of big and small floats

35fe102

Komnomnomnom referenced this pull request Jun 28, 2013

ENH: update bundled ujson to latest v1.33

a8d15bd

jreback mentioned this pull request Jul 19, 2013

TST: raise an error json serialization of floats that cannot be accurate represented #4295

Closed

Komnomnomnom added a commit to Komnomnomnom/pandas that referenced this pull request Jul 19, 2013

ENH: ujson better handling of very large and very small numbers, thro…

1871002

…w ValueError for bad double_precision arg pandas-dev#4042

Komnomnomnom mentioned this pull request Jul 20, 2013

ENH: ujson better handling of very large and very small numbers, throw ValueError for bad double_precision arg #4042 #4299

Merged

jreback closed this in #4299 Jul 20, 2013

jreback added a commit that referenced this pull request Jul 20, 2013

Merge pull request #4299 from Komnomnomnom/ujson-small-floats

ec8920a

ENH: ujson better handling of very large and very small numbers, throw ValueError for bad double_precision arg #4042

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure accurate encoding/decoding of big and small floats #4042

Ensure accurate encoding/decoding of big and small floats #4042

trottier commented Jun 26, 2013

Komnomnomnom commented Jun 26, 2013

ghost commented Jun 26, 2013

trottier commented Jun 26, 2013

dieterv77 commented Jun 27, 2013

njsmith commented Jun 27, 2013

dieterv77 commented Jun 27, 2013

trottier commented Jun 27, 2013

trottier commented Jul 19, 2013

wesm commented Jul 19, 2013

jreback commented Jul 19, 2013

jreback commented Jul 19, 2013

trottier commented Jul 19, 2013

Komnomnomnom commented Jul 19, 2013

wesm commented Jul 19, 2013

Komnomnomnom commented Jul 19, 2013

njsmith commented Jul 19, 2013

Komnomnomnom commented Jul 19, 2013

trottier commented Jul 19, 2013

Komnomnomnom commented Jul 19, 2013

Ensure accurate encoding/decoding of big and small floats #4042

Ensure accurate encoding/decoding of big and small floats #4042

Conversation

trottier commented Jun 26, 2013

Komnomnomnom commented Jun 26, 2013

ghost commented Jun 26, 2013

trottier commented Jun 26, 2013

dieterv77 commented Jun 27, 2013

njsmith commented Jun 27, 2013

dieterv77 commented Jun 27, 2013

trottier commented Jun 27, 2013

trottier commented Jul 19, 2013

wesm commented Jul 19, 2013

jreback commented Jul 19, 2013

jreback commented Jul 19, 2013

trottier commented Jul 19, 2013

Komnomnomnom commented Jul 19, 2013

wesm commented Jul 19, 2013

Komnomnomnom commented Jul 19, 2013

njsmith commented Jul 19, 2013

Komnomnomnom commented Jul 19, 2013

trottier commented Jul 19, 2013

Komnomnomnom commented Jul 19, 2013