ENH: Json fill_value for missing fields #27073

jiangyue12392 · 2019-06-27T05:40:31Z

closes ENH: json_normalize() avoid loss of precision for int64 with missing values #16918
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

A new argument fill_value is added to allow users to supply default values for missing fields for each column. This enhancement also addresses the undesirable type conversion from int to float due to missing fields as raised in #16918

codecov · 2019-06-27T06:14:55Z

Codecov Report

Merging #27073 into master will decrease coverage by <.01%.
The diff coverage is 87.5%.

@@            Coverage Diff             @@
##           master   #27073      +/-   ##
==========================================
- Coverage   92.04%   92.03%   -0.01%     
==========================================
  Files         180      180              
  Lines       50714    50716       +2     
==========================================
- Hits        46679    46676       -3     
- Misses       4035     4040       +5

Flag	Coverage Δ
#multiple	`90.67% <87.5%> (-0.01%)`	⬇️
#single	`41.86% <50%> (-0.09%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/frame.py	`96.89% <100%> (-0.12%)`	⬇️
pandas/core/internals/construction.py	`95.95% <100%> (ø)`	⬆️
pandas/io/json/normalize.py	`96% <75%> (-0.94%)`	⬇️
pandas/io/gbq.py	`88.88% <0%> (-11.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e955515...7c2b71c. Read the comment docs.

codecov · 2019-06-27T06:15:04Z

Codecov Report

Merging #27073 into master will decrease coverage by <.01%.
The diff coverage is 87.5%.

@@            Coverage Diff             @@
##           master   #27073      +/-   ##
==========================================
- Coverage   92.04%   92.03%   -0.01%     
==========================================
  Files         180      180              
  Lines       50714    50716       +2     
==========================================
- Hits        46679    46676       -3     
- Misses       4035     4040       +5

Flag	Coverage Δ
#multiple	`90.67% <87.5%> (-0.01%)`	⬇️
#single	`41.86% <50%> (-0.09%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/frame.py	`96.89% <100%> (-0.12%)`	⬇️
pandas/core/internals/construction.py	`95.95% <100%> (ø)`	⬆️
pandas/io/json/normalize.py	`96% <75%> (-0.94%)`	⬇️
pandas/io/gbq.py	`88.88% <0%> (-11.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e955515...41baaa3. Read the comment docs.

WillAyd

Read through the original issue so it seems like this is a way around conversions from int to float with missing data. A lot has changed since this came up and we now have a nullable integer type, so I'm not sure we actually want to add this parameter to the API or if we just need to figure out how to return an Int64 here

cc @gfyoung

WillAyd · 2019-06-27T12:37:21Z

pandas/_libs/lib.pyx


    k = len(columns)
    n = len(dicts)

    result = np.empty((n, k), dtype='O')
+    if fill_value:


Rather than do this as a (n x k) array can't you just assign the appropriate value down on line 336 below?

Sure, it can be done that way. I wasn't very sure about how good dictionary value retrieval compares to index retrieval from a list. I was afraid that (k x n) times of dictionary access is much slower than k x n times of index retrieval from a list. Also, this is a (1 x k) array instead of (n x k) array.

jreback

not in favor of adding this keyword because we already have .fillna() once this is called; or are you suggesting something that cannot already be done? note - all in favor of solving the referenced issue though

gfyoung · 2019-06-27T20:39:56Z

or are you suggesting something that cannot already be done

@jreback : By the time read_json completes, the damage is already done per se (the precision has been lost). This PR attempts to prevent that from happening via the fill_value parameter. There are two ways we could address this:

Don't force casting to numeric in cases like these (API change). This is the option we have in to_numeric (e.g. we keep the column as object when we risk losing precision).
Allow people to work around this via fill_value. This is an interesting option, but if we do this, I would expect that we also apply this to the rest of the IO interface (read_csv, read_excel, etc.)

jiangyue12392 · 2019-07-04T04:43:20Z

@jreback @gfyoung So which direction shall I pursue to fix the precision issue due to conversion?

gfyoung · 2019-07-04T06:31:41Z

@jiangyue12392 : Sorry for the silence! We're actually in the midst of pushing a release candidate for 0.25.0. I would merge / rebase master for the time being.

@WillAyd @jreback : If either of you have bandwidth to give feedback on the two directions I gave above, that would be great here.

WillAyd · 2019-07-04T15:34:38Z

Yea I don’t think adding this to the API is the best approach. Certainly more complicated but would rather see if there’s a way to return Int64 with missing data

…

Sent from my iPhone

On Jul 3, 2019, at 11:31 PM, gfyoung ***@***.***> wrote: @jiangyue12392 : Sorry for the silence! We're actually in the midst of pushing a release candidate for 0.25.0. I would merge / rebase master for the time being. @WillAyd @jreback : If either of you have bandwidth to give feedback on the two directions I gave above, that would be great here. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jiangyue12392 · 2019-07-08T07:37:07Z

@WillAyd Can you point me to more information about the nullable integer type? I may be able to come up with some other solutions with that.

WillAyd · 2019-07-09T02:16:04Z

@jiangyue12392 there is a dedicated section in the docs to it:

https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

It's a relatively new feature still so anything you can do to improve integration with that would be awesome!

jiangyue12392 · 2019-07-11T04:58:44Z

@WillAyd @gfyoung @jreback I made another attempt to solve this issue with IntegerArray at #27335. Please advise.

WillAyd · 2019-07-15T01:06:37Z

superseded by #27335

Jiang Yue added 2 commits June 27, 2019 11:31

add optional fill_value for nan in json_normalize

b25faf7

add test cases

e53b620

jiangyue12392 changed the title ~~Json fill value~~ ENH: Json fill_value for missing fields Jun 27, 2019

jiangyue12392 force-pushed the json_fill_value branch from 798da62 to 7c2b71c Compare June 27, 2019 05:48

add whatsnew entry

41baaa3

jiangyue12392 force-pushed the json_fill_value branch from 7c2b71c to 41baaa3 Compare June 27, 2019 07:35

WillAyd requested changes Jun 27, 2019

View reviewed changes

WillAyd added IO JSON read_json, to_json, json_normalize Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jun 27, 2019

jreback requested changes Jun 27, 2019

View reviewed changes

jiangyue12392 mentioned this pull request Jul 11, 2019

ENH: Use IntergerArray to avoid forced conversion from integer to float #27335

Merged

5 tasks

WillAyd closed this Jul 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Json fill_value for missing fields #27073

ENH: Json fill_value for missing fields #27073

jiangyue12392 commented Jun 27, 2019

codecov bot commented Jun 27, 2019

codecov bot commented Jun 27, 2019 •

edited

Loading

WillAyd left a comment

WillAyd Jun 27, 2019

jiangyue12392 Jun 27, 2019

jreback left a comment

gfyoung commented Jun 27, 2019 •

edited

Loading

jiangyue12392 commented Jul 4, 2019

gfyoung commented Jul 4, 2019

WillAyd commented Jul 4, 2019 via email

jiangyue12392 commented Jul 8, 2019

WillAyd commented Jul 9, 2019

jiangyue12392 commented Jul 11, 2019

WillAyd commented Jul 15, 2019

ENH: Json fill_value for missing fields #27073

ENH: Json fill_value for missing fields #27073

Conversation

jiangyue12392 commented Jun 27, 2019

codecov bot commented Jun 27, 2019

Codecov Report

codecov bot commented Jun 27, 2019 • edited Loading

Codecov Report

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd Jun 27, 2019

Choose a reason for hiding this comment

jiangyue12392 Jun 27, 2019

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

gfyoung commented Jun 27, 2019 • edited Loading

jiangyue12392 commented Jul 4, 2019

gfyoung commented Jul 4, 2019

WillAyd commented Jul 4, 2019 via email

jiangyue12392 commented Jul 8, 2019

WillAyd commented Jul 9, 2019

jiangyue12392 commented Jul 11, 2019

WillAyd commented Jul 15, 2019

codecov bot commented Jun 27, 2019 •

edited

Loading

gfyoung commented Jun 27, 2019 •

edited

Loading