Skip to content

ENH: json_normalize() avoid loss of precision for int64 with missing values #16918

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jzwinck opened this issue Jul 14, 2017 · 7 comments
Closed
Labels
Enhancement IO JSON read_json, to_json, json_normalize

Comments

@jzwinck
Copy link
Contributor

jzwinck commented Jul 14, 2017

This code:

x = 1234567890123456789
x - pd.io.json.json_normalize([{'x': x}, {}]).loc[0, 'x'].astype(int)

Gives 21, when reasonable users might expect it to give 0.

This inaccuracy occurs when one field has an int64 value that is not present in all records, triggering a conversion of the Series dtype to float64. Of course, Pandas does this conversion so that it can put NAN where no value exists.

One solution could be to add a fill_value parameter, as seen in add(), unstack(), and other Pandas functions. It would be good for this to support a dict as well as a single value, in case different fill values are required for different columns.

The usage might be like this:

pd.io.json.json_normalize([{'x': x}, {}], fill_value=-1).x
# or
pd.io.json.json_normalize([{'x': x}, {}], fill_value={'x': -1}).x

Then instead of the current result:

0    1.234568e+18
1             NaN
Name: x, dtype: float64

The result would be:

0    1234567890123456789
1                     -1
Name: x, dtype: int64

I'm using Pandas 0.20.1.

@jzwinck jzwinck changed the title ENH: json_normalize() needs a way to avoid loss of precision for int64 ENH: json_normalize() avoid loss of precision for int64 with missing values Jul 14, 2017
@gfyoung gfyoung added Enhancement IO JSON read_json, to_json, json_normalize labels Jul 14, 2017
@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

@jzwinck : Sounds pretty reasonable, given that it is consistent with our API for other functions.

I would suggest you try implementing the fill_value=-1 functionality first and save the dict input for later. That might need to be a larger discussion since that point could easily be applied to any non-Series (or 1-D array-like) object in pandas.

@jzwinck
Copy link
Contributor Author

jzwinck commented Jul 14, 2017

@gfyoung Thanks. I just realized that this also behaves badly:

pd.DataFrame([{'x': x}, {}])

Can you suggest a way to make that work? I would leverage that solution to improve json_normalize(), but it seems the various DataFrame constructors do not support fill_value or similar.

If there is no good way to do this with the simple case of constructing a DataFrame, then we could either implement one, or we could make json_normalize() populate all missing keys. The latter seems easier to implement but likely worse for performance. Then again, json_normalize() has a comment saying the performance is "disastrous" already. :)

@jreback
Copy link
Contributor

jreback commented Jul 14, 2017

type conversions from json are not unexpected; it is not a typed format.

@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

Can you suggest a way to make that work?

This is a somewhat unusual way of initializing a DataFrame, so I'm inclined to go with two. That being said, have a look at performance to see just how "disastrous" it is. 😄

type conversions from json are not unexpected; it is not a typed format.

What are your thoughts regarding this proposal to add fill_value then?

@jzwinck
Copy link
Contributor Author

jzwinck commented Jul 17, 2017

@gfyoung I made it work well enough for my purposes by adding fill_value=None to the signature of nested_to_record(), then doing this near the end of the function:

 if fill_value is not None:
    # make it so every new_d has all keys
    for new_d in new_ds:
        for key in keys:
            if key not in new_d:
                new_d[key] = fill_value

This simple solution has three significant problems:

  1. It produces bad data when field types differ, e.g. float and str columns don't need fill_value for my use case--only int64 or uint64.
  2. It doesn't do anything when record_path is not None.
  3. It is not efficient. Making it efficient probably requires a DataFrame constructor supporting fill_value.

It's unlikely that I'll contribute a patch for this one, but I do think it is important to address.

@gfyoung
Copy link
Member

gfyoung commented Jul 17, 2017

@jzwinck : Okay, sounds good! This work will be useful for anyone who takes up this issue later.

@jzwinck
Copy link
Contributor Author

jzwinck commented Apr 28, 2020

Fixed by #27335.

@jzwinck jzwinck closed this as completed Apr 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants