-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Ensure accurate encoding/decoding of big and small floats #4042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@trottier the bundled ujson is a custom modified fork to add built-in support for encoding and decoding pandas objects at the C-level (better performance), so it's not a simple drop-in. IIRC standard ujson is also significantly faster than simplejson. That said though I do wonder if this could be fixed. It seems to handle very large numbers pretty well, and a very brief look at the code suggests it just reverts to |
ujson knows how to deal with numpy array directly, which meshes well In [16]: import ujson
...: import simplejson
...: a=np.arange(10000,dtype='f')
...: %timeit ujson.dumps(a)
...: try:
...: simplejson.dumps(a)
...: except:
...: print("well, no.")
...: %timeit simplejson.dumps(a.tolist())
1000 loops, best of 3: 967 µs per loop
well, no.
100 loops, best of 3: 4 ms per loop |
Interesting -- I didn't know that ujson had been modified, etc. Sounds like it would be simpler to fix ujson than switch to something else. Regardless, that ujson handles floating point properly is critical for an application like pandas. ujson doesn't really handle very large numbers all that well:
... in its representation it implies precision that isn't there. Perhaps this is because it doesn't use "precision" properly:
|
Just to provide a little more color on this: I also created a branch where i did the same thing for values below 1e-16 (dieterv77/ultrajson@17d9705) |
@trottier: I assume the actual problem is that you want to be able to dump data from pandas to json and then read it back again, and got bitten because the json routines throw away precision? That's pretty nasty. It doesn't matter if you add extra decimal points; IEEE doubles always have the same amount of precision in them. The important thing is that for any (finite) float value, you get bit-accurate round-tripping. The usual rule is to print as many decimals as are necessary to accomplish this, but if you have extra that's okay too. That said, double printing/reading is extremely hard, and from a quick skim I wouldn't trust ujson to do it at all. Obviously there's the nasty stuff where it just throws away precision by default, but even if you use a very high Python ships with high-quality double<->string conversion routines based on those in netlib -- It's possible that using these routines would produce some measurable slowdown... the string_to_double routine in particular has some mallocs, returns a newly-malloc'd pointer that you have to free, instead of writing into an existing buffer. But it'd be pretty easy for someone to patch ujson to use them and then run a benchmark to see whether they're actually slower. (And a 10% slowdown or whatever would be a pretty reasonable trade-off for, like, getting correct results, anyway...) Implementation is here if anyone's curious about what's involved in doing this right: http://hg.python.org/cpython/file/44f455e6163d/Python/dtoa.c |
FWIW: cpickle in 2.7 uses PyOS_double_to_string(x, 'g', 17, 0, NULL) when not using a binary format to encode a double, and uses PyOS_string_to_double to decode. |
Some benchmarks, for what they're worth.
Results:
|
So @wesm in v0.12.0 what guarantee is there that users won't lose data when converting small floats to JSON? Should there be a warning issued or should the conversion fail when, say,
? |
with the update to newest ujson the story is better:
but still not good. anyone give some kind of assessment of how hard this would be to fix? for example, swapping in simplejson's float stringification |
As a workaround, this is pretty easy to test for currently raising, but could do a warning instead
|
PR #4295 |
Well, part of the issue is that we don't really know how much precision users will want or expect, and they won't really be able to ask for it. E.g. 1.2345678901234567 (~16 decimal places) is a decimal representation of a double that uses all the available precision. Ujson doesn't use the word "precision" properly, so if you try to represent 1.2345678901234567e-1 you'll lose one decimal place of precision, e-2 and you'll lose two, etc.
|
@wesm
|
well if it's any help here is what chrome does:
|
I'm playing around with calling |
Don't use sprintf, use a good quality dedicated float/string converter.
|
…w ValueError for bad double_precision arg pandas-dev#4042
@njsmith in my PR |
For the record: I was mistaken in grouping together ujson's treatment of big and small numbers. There are issues with how it handles both, but the issues are only warning-level/not-nice for big numbers (i.e., its use of "precision" is wrong), while they give rise to incorrect behavior for the small ones. |
ENH: ujson better handling of very large and very small numbers, throw ValueError for bad double_precision arg #4042
* commit 'v0.12.0rc1-127-gec8920a': DOC: docs for precise_float option in read_json BUG: explicity change nan -> NaT when assigning to datelike dtypes ENH: expose ujson precise_float argument on decode ENH: ujson better handling of very large and very small numbers, throw ValueError for bad double_precision arg pandas-dev#4042 minor: some trailing spaces and a pylint "pragma" to stop complaining about Series._ix defined elsewhere ENH: test_perf.py - use psutil to set affinity (if absent functionality - then affinity module) TST: print out byteorder in ci/print_versions.py DOC: Fix typo. Update CONTRIBUTING.md with note on attribution in PRs
ujson has trouble with very big, and especially very small, floats:
ultrajson/ultrajson#69
ultrajson/ultrajson#83
ultrajson/ultrajson#90
From: #1263 (comment)
I believe this makes ujson inappropriate for inclusion in pandas.
This pull requests adds tests that ujson should, but won't, completely pass. I don't feel yet sufficiently familiar with pandas' design to substitute in (e.g.) simplejson.