Skip to content

Ensure accurate encoding/decoding of big and small floats #4042

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

trottier
Copy link

ujson has trouble with very big, and especially very small, floats:

ultrajson/ultrajson#69
ultrajson/ultrajson#83
ultrajson/ultrajson#90

From: #1263 (comment)

>>> import ujson
>>> ujson.dumps(1e-40)
'0.0'
>>> ujson.dumps(1e-40, double_precision=17)
'0.0'

I believe this makes ujson inappropriate for inclusion in pandas.

This pull requests adds tests that ujson should, but won't, completely pass. I don't feel yet sufficiently familiar with pandas' design to substitute in (e.g.) simplejson.

@Komnomnomnom
Copy link
Contributor

@trottier the bundled ujson is a custom modified fork to add built-in support for encoding and decoding pandas objects at the C-level (better performance), so it's not a simple drop-in. IIRC standard ujson is also significantly faster than simplejson.

That said though I do wonder if this could be fixed. It seems to handle very large numbers pretty well, and a very brief look at the code suggests it just reverts to sprintf if the value is above a threshold, I wonder if something similar could be done for very small values. I'm not in a position to play with this now but I'll have a deeper look when I get some time.

@ghost
Copy link

ghost commented Jun 26, 2013

ujson knows how to deal with numpy array directly, which meshes well
with pandas' underlying use of them.
simplejson (or the stdlib json, same thing) does not. crudely:

In [16]: import ujson
    ...: import simplejson
    ...: a=np.arange(10000,dtype='f')
    ...: %timeit ujson.dumps(a)
    ...: try:
    ...:      simplejson.dumps(a)
    ...: except:
    ...:      print("well, no.")
    ...: %timeit simplejson.dumps(a.tolist())
1000 loops, best of 3: 967 µs per loop
well, no.
100 loops, best of 3: 4 ms per loop

@trottier
Copy link
Author

Interesting -- I didn't know that ujson had been modified, etc. Sounds like it would be simpler to fix ujson than switch to something else. Regardless, that ujson handles floating point properly is critical for an application like pandas.

ujson doesn't really handle very large numbers all that well:

>>> ujson.dumps(1e20)
'1.000000000000000e+20'
>>> simplejson.dumps(1e20)
'1e+20'
>>>

... in its representation it implies precision that isn't there. Perhaps this is because it doesn't use "precision" properly:

>>> ujson.dumps(1e20, double_precision=1)
'1.000000000000000e+20' # should be '1.0e20'

@dieterv77
Copy link
Contributor

Just to provide a little more color on this:
Currently, ultrajson will use sprintf("%.15e") to print values whose absolute value exceeds 1e16 - 1. In that case, the double_precision argument is unused.

I also created a branch where i did the same thing for values below 1e-16 (dieterv77/ultrajson@17d9705)
This way you get
In [3]: ujson.dumps(1e-40)
Out[3]: '9.999999999999999e-41'

@njsmith
Copy link

njsmith commented Jun 27, 2013

@trottier: I assume the actual problem is that you want to be able to dump data from pandas to json and then read it back again, and got bitten because the json routines throw away precision? That's pretty nasty.

It doesn't matter if you add extra decimal points; IEEE doubles always have the same amount of precision in them. The important thing is that for any (finite) float value, you get bit-accurate round-tripping. The usual rule is to print as many decimals as are necessary to accomplish this, but if you have extra that's okay too.

That said, double printing/reading is extremely hard, and from a quick skim I wouldn't trust ujson to do it at all. Obviously there's the nasty stuff where it just throws away precision by default, but even if you use a very high double_precision value, I wouldn't be surprised if it still blew up on e.g. denormalized numbers. And I bet it makes rounding errors in some cases; you can't print by just dividing by ten and ignoring rounding error and get accurate results. And even with the new precise_float option for decoding, I bet it completely blows up if it's in a non-C locale. (strtod can not, in general, read JSON-formatted floats; it might expect a , as decimal separator etc.)

Python ships with high-quality double<->string conversion routines based on those in netlib -- PyOS_string_to_double, PyOS_double_to_string (or PyOS_ascii_strtod, PyOS_ascii_formatd in pre-2.7 versions).

It's possible that using these routines would produce some measurable slowdown... the string_to_double routine in particular has some mallocs, returns a newly-malloc'd pointer that you have to free, instead of writing into an existing buffer. But it'd be pretty easy for someone to patch ujson to use them and then run a benchmark to see whether they're actually slower. (And a 10% slowdown or whatever would be a pretty reasonable trade-off for, like, getting correct results, anyway...)

Implementation is here if anyone's curious about what's involved in doing this right: http://hg.python.org/cpython/file/44f455e6163d/Python/dtoa.c

@dieterv77
Copy link
Contributor

FWIW: cpickle in 2.7 uses PyOS_double_to_string(x, 'g', 17, 0, NULL) when not using a binary format to encode a double, and uses PyOS_string_to_double to decode.

@trottier
Copy link
Author

Some benchmarks, for what they're worth.

import time
try: import numpypy as numpy
except: import numpy

def roundtrip_ujson(val):
    return ujson.loads(ujson.dumps(val))

def roundtrip_python(val):
    return float(repr(val))

def benchmark(roundtrip_fun=roundtrip_python, n=100000):
    p = []; stored = []
    for i in range(n):
        d = time.time()
        val = roundtrip_fun(d)
        delt = time.time() - d
        p.append(delt)
        stored.append(val) # just in case pypy optimized it away

    print "Mean", numpy.mean(p)
    print "Median", sorted(p)[int(len(p)/2)]

if __name__ == '__main__':
    import sys
    if 'ujson' in sys.argv:
        import ujson
        print "ujson", ujson.__version__, "performance"
        benchmark(roundtrip_fun=roundtrip_ujson)
    else:
        print sys.version, "performance"
        benchmark()

Results:

> /usr/local/bin/python /tmp/benchmark_float.py ujson ; pypy /tmp/benchmark_float.py ; /usr/local/bin/python /tmp/benchmark_float.py 
ujson 1.33 performance
Mean 1.13488674164e-06
Median 9.53674316406e-07
2.7.3 (5acfe049a5b0, May 21 2013, 13:47:22)
[PyPy 2.0.2 with GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)] performance
Mean 1.66995048523e-06
Median 1.90734863281e-06
2.7.3 (default, Mar 29 2013, 17:18:17)
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.27)] performance
Mean 2.4831032753e-06
Median 2.14576721191e-06

@trottier
Copy link
Author

So @wesm in v0.12.0 what guarantee is there that users won't lose data when converting small floats to JSON? Should there be a warning issued or should the conversion fail when, say,

abs(val) < 1e-3 

?

@wesm
Copy link
Member

wesm commented Jul 19, 2013

with the update to newest ujson the story is better:

In [2]: json.dumps(1e-5)
Out[2]: '0.00001'

In [3]: json.dumps(1e-6)
Out[3]: '0.000001'

In [4]: json.dumps(1e-7)
Out[4]: '0.0000001'

In [5]: json.dumps(1e-8)
Out[5]: '0.00000001'

In [6]: json.dumps(1e-9)
Out[6]: '0.000000001'

In [7]: json.dumps(1e-10)
Out[7]: '0.0000000001'

In [8]: json.dumps(1e-11)
Out[8]: '0.0'

In [9]: json.dumps(1e-10)
Out[9]: '0.0000000001'

In [10]: json.dumps(1e-11)
Out[10]: '0.0'

but still not good. anyone give some kind of assessment of how hard this would be to fix? for example, swapping in simplejson's float stringification

@jreback
Copy link
Contributor

jreback commented Jul 19, 2013

As a workaround, this is pretty easy to test for

currently raising, but could do a warning instead
better message?

In [1]: DataFrame([[1e-16,'foo',1e-8]],columns=list('ABC')).to_json()
ValueError: ujson currently cannot accurately format float data less
than 1e-15. A work-around is to multiply the data by
a large positive factor and divide on deseriliazation

@jreback
Copy link
Contributor

jreback commented Jul 19, 2013

PR #4295

@trottier
Copy link
Author

Well, part of the issue is that we don't really know how much precision users will want or expect, and they won't really be able to ask for it. E.g. 1.2345678901234567 (~16 decimal places) is a decimal representation of a double that uses all the available precision. Ujson doesn't use the word "precision" properly, so if you try to represent 1.2345678901234567e-1 you'll lose one decimal place of precision, e-2 and you'll lose two, etc.

ujson.dumps(1.23456789123456e-5, double_precision=100)
==> '0.000012345678912'

ujson.dumps(1.23456789123456e-6, double_precision=100)
==> '0.000001234567891'

@Komnomnomnom
Copy link
Contributor

@wesm double_precision defaults to 10, although it caps out at 15 (and should probably raise a ValueError after this)

In [2]: dumps(1e-10)
Out[2]: '0.0000000001'

In [3]: dumps(1e-11)
Out[3]: '0.0'

In [4]: dumps(1e-11, double_precision=15)
Out[4]: '0.00000000001'

In [5]: dumps(1e-12, double_precision=15)
Out[5]: '0.000000000001'

In [6]: dumps(1e-13, double_precision=15)
Out[6]: '0.0000000000001'

In [7]: dumps(1e-14, double_precision=15)
Out[7]: '0.00000000000001'

In [8]: dumps(1e-15, double_precision=15)
Out[8]: '0.000000000000001'

In [9]: dumps(1e-16, double_precision=15)
Out[9]: '0.0'

In [10]: dumps(1e-16, double_precision=20)
Out[10]: '0.0'

In [11]: dumps(1e-17, double_precision=20)
Out[11]: '0.0'

In [12]: dumps(1e-16, double_precision=20)
Out[12]: '0.0'

@wesm
Copy link
Member

wesm commented Jul 19, 2013

well if it's any help here is what chrome does:

> JSON.stringify(1e-40)
"1e-40"
> JSON.stringify(1e-7)
"1e-7"
> JSON.stringify(1e-6)
"0.000001"
> JSON.stringify(1e-50)
"1e-50"

@Komnomnomnom
Copy link
Contributor

I'm playing around with calling sprintf to get similar behaviour. I'll send through a PR if I get it to play nice.

@njsmith
Copy link

njsmith commented Jul 19, 2013

Don't use sprintf, use a good quality dedicated float/string converter.
Like the one provided by the python runtime and mentioned above.
On 19 Jul 2013 22:38, "Komnomnomnom" [email protected] wrote:

I'm playing around with calling sprintf to get similar behaviour. I'll
send through a PR if I get it to play nice.


Reply to this email directly or view it on GitHubhttps://github.com//pull/4042#issuecomment-21279148
.

Komnomnomnom added a commit to Komnomnomnom/pandas that referenced this pull request Jul 19, 2013
@Komnomnomnom
Copy link
Contributor

@njsmith in my PR sprintf is only used for very small and very large numbers which means it will handle (some) of the numbers given here a bit better. It might be worth investigating the optional use of PyOS_string_to_double etc in future though if you think it will provide exact round-trip precision.

@trottier
Copy link
Author

For the record: I was mistaken in grouping together ujson's treatment of big and small numbers. There are issues with how it handles both, but the issues are only warning-level/not-nice for big numbers (i.e., its use of "precision" is wrong), while they give rise to incorrect behavior for the small ones.

@Komnomnomnom
Copy link
Contributor

@trottier can you have a look at #4299

jreback added a commit that referenced this pull request Jul 20, 2013
ENH: ujson better handling of very large and very small numbers, throw ValueError for bad double_precision arg #4042
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Jul 25, 2013
* commit 'v0.12.0rc1-127-gec8920a':
  DOC: docs for precise_float option in read_json
  BUG: explicity change nan -> NaT when assigning to datelike dtypes
  ENH: expose ujson precise_float argument on decode
  ENH: ujson better handling of very large and very small numbers, throw ValueError for bad double_precision arg pandas-dev#4042
  minor: some trailing spaces and a pylint "pragma" to stop complaining about Series._ix defined elsewhere
  ENH: test_perf.py - use psutil to set affinity (if absent functionality - then affinity module)
  TST: print out byteorder in ci/print_versions.py
  DOC: Fix typo.
  Update CONTRIBUTING.md with note on attribution in PRs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants