ENH: json_normalize() avoid loss of precision for int64 with missing values #16918

jzwinck · 2017-07-14T03:55:42Z

This code:

x = 1234567890123456789
x - pd.io.json.json_normalize([{'x': x}, {}]).loc[0, 'x'].astype(int)

Gives 21, when reasonable users might expect it to give 0.

This inaccuracy occurs when one field has an int64 value that is not present in all records, triggering a conversion of the Series dtype to float64. Of course, Pandas does this conversion so that it can put NAN where no value exists.

One solution could be to add a fill_value parameter, as seen in add(), unstack(), and other Pandas functions. It would be good for this to support a dict as well as a single value, in case different fill values are required for different columns.

The usage might be like this:

pd.io.json.json_normalize([{'x': x}, {}], fill_value=-1).x
# or
pd.io.json.json_normalize([{'x': x}, {}], fill_value={'x': -1}).x

Then instead of the current result:

0    1.234568e+18
1             NaN
Name: x, dtype: float64

The result would be:

0    1234567890123456789
1                     -1
Name: x, dtype: int64

I'm using Pandas 0.20.1.

The text was updated successfully, but these errors were encountered:

gfyoung · 2017-07-14T05:11:15Z

@jzwinck : Sounds pretty reasonable, given that it is consistent with our API for other functions.

I would suggest you try implementing the fill_value=-1 functionality first and save the dict input for later. That might need to be a larger discussion since that point could easily be applied to any non-Series (or 1-D array-like) object in pandas.

jzwinck · 2017-07-14T09:58:12Z

@gfyoung Thanks. I just realized that this also behaves badly:

pd.DataFrame([{'x': x}, {}])

Can you suggest a way to make that work? I would leverage that solution to improve json_normalize(), but it seems the various DataFrame constructors do not support fill_value or similar.

If there is no good way to do this with the simple case of constructing a DataFrame, then we could either implement one, or we could make json_normalize() populate all missing keys. The latter seems easier to implement but likely worse for performance. Then again, json_normalize() has a comment saying the performance is "disastrous" already. :)

jreback · 2017-07-14T14:09:47Z

type conversions from json are not unexpected; it is not a typed format.

gfyoung · 2017-07-14T15:25:23Z

Can you suggest a way to make that work?

This is a somewhat unusual way of initializing a DataFrame, so I'm inclined to go with two. That being said, have a look at performance to see just how "disastrous" it is. 😄

type conversions from json are not unexpected; it is not a typed format.

What are your thoughts regarding this proposal to add fill_value then?

jzwinck · 2017-07-17T01:48:00Z

@gfyoung I made it work well enough for my purposes by adding fill_value=None to the signature of nested_to_record(), then doing this near the end of the function:

 if fill_value is not None:
    # make it so every new_d has all keys
    for new_d in new_ds:
        for key in keys:
            if key not in new_d:
                new_d[key] = fill_value

This simple solution has three significant problems:

It produces bad data when field types differ, e.g. float and str columns don't need fill_value for my use case--only int64 or uint64.
It doesn't do anything when record_path is not None.
It is not efficient. Making it efficient probably requires a DataFrame constructor supporting fill_value.

It's unlikely that I'll contribute a patch for this one, but I do think it is important to address.

gfyoung · 2017-07-17T02:57:21Z

@jzwinck : Okay, sounds good! This work will be useful for anyone who takes up this issue later.

jzwinck · 2020-04-28T05:52:00Z

Fixed by #27335.

jzwinck changed the title ~~ENH: json_normalize() needs a way to avoid loss of precision for int64~~ ENH: json_normalize() avoid loss of precision for int64 with missing values Jul 14, 2017

gfyoung added Enhancement IO JSON read_json, to_json, json_normalize labels Jul 14, 2017

jiangyue12392 mentioned this issue Jun 27, 2019

ENH: Json fill_value for missing fields #27073

Closed

4 tasks

jiangyue12392 mentioned this issue Jul 11, 2019

ENH: Use IntergerArray to avoid forced conversion from integer to float #27335

Merged

5 tasks

jzwinck closed this as completed Apr 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: json_normalize() avoid loss of precision for int64 with missing values #16918

ENH: json_normalize() avoid loss of precision for int64 with missing values #16918

jzwinck commented Jul 14, 2017 •

edited

Loading

gfyoung commented Jul 14, 2017 •

edited

Loading

jzwinck commented Jul 14, 2017

jreback commented Jul 14, 2017

gfyoung commented Jul 14, 2017

jzwinck commented Jul 17, 2017

gfyoung commented Jul 17, 2017

jzwinck commented Apr 28, 2020

ENH: json_normalize() avoid loss of precision for int64 with missing values #16918

ENH: json_normalize() avoid loss of precision for int64 with missing values #16918

Comments

jzwinck commented Jul 14, 2017 • edited Loading

gfyoung commented Jul 14, 2017 • edited Loading

jzwinck commented Jul 14, 2017

jreback commented Jul 14, 2017

gfyoung commented Jul 14, 2017

jzwinck commented Jul 17, 2017

gfyoung commented Jul 17, 2017

jzwinck commented Apr 28, 2020

jzwinck commented Jul 14, 2017 •

edited

Loading

gfyoung commented Jul 14, 2017 •

edited

Loading