Skip to content

ENH: Json fill_value for missing fields #27073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

jiangyue12392
Copy link
Contributor

A new argument fill_value is added to allow users to supply default values for missing fields for each column. This enhancement also addresses the undesirable type conversion from int to float due to missing fields as raised in #16918

@jiangyue12392 jiangyue12392 changed the title Json fill value ENH: Json fill_value for missing fields Jun 27, 2019
@codecov
Copy link

codecov bot commented Jun 27, 2019

Codecov Report

Merging #27073 into master will decrease coverage by <.01%.
The diff coverage is 87.5%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #27073      +/-   ##
==========================================
- Coverage   92.04%   92.03%   -0.01%     
==========================================
  Files         180      180              
  Lines       50714    50716       +2     
==========================================
- Hits        46679    46676       -3     
- Misses       4035     4040       +5
Flag Coverage Δ
#multiple 90.67% <87.5%> (-0.01%) ⬇️
#single 41.86% <50%> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 96.89% <100%> (-0.12%) ⬇️
pandas/core/internals/construction.py 95.95% <100%> (ø) ⬆️
pandas/io/json/normalize.py 96% <75%> (-0.94%) ⬇️
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e955515...7c2b71c. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 27, 2019

Codecov Report

Merging #27073 into master will decrease coverage by <.01%.
The diff coverage is 87.5%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #27073      +/-   ##
==========================================
- Coverage   92.04%   92.03%   -0.01%     
==========================================
  Files         180      180              
  Lines       50714    50716       +2     
==========================================
- Hits        46679    46676       -3     
- Misses       4035     4040       +5
Flag Coverage Δ
#multiple 90.67% <87.5%> (-0.01%) ⬇️
#single 41.86% <50%> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 96.89% <100%> (-0.12%) ⬇️
pandas/core/internals/construction.py 95.95% <100%> (ø) ⬆️
pandas/io/json/normalize.py 96% <75%> (-0.94%) ⬇️
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e955515...41baaa3. Read the comment docs.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read through the original issue so it seems like this is a way around conversions from int to float with missing data. A lot has changed since this came up and we now have a nullable integer type, so I'm not sure we actually want to add this parameter to the API or if we just need to figure out how to return an Int64 here

cc @gfyoung


k = len(columns)
n = len(dicts)

result = np.empty((n, k), dtype='O')
if fill_value:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than do this as a (n x k) array can't you just assign the appropriate value down on line 336 below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it can be done that way. I wasn't very sure about how good dictionary value retrieval compares to index retrieval from a list. I was afraid that (k x n) times of dictionary access is much slower than k x n times of index retrieval from a list. Also, this is a (1 x k) array instead of (n x k) array.

@WillAyd WillAyd added IO JSON read_json, to_json, json_normalize Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jun 27, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not in favor of adding this keyword because we already have .fillna() once this is called; or are you suggesting something that cannot already be done? note - all in favor of solving the referenced issue though

@gfyoung
Copy link
Member

gfyoung commented Jun 27, 2019

or are you suggesting something that cannot already be done

@jreback : By the time read_json completes, the damage is already done per se (the precision has been lost). This PR attempts to prevent that from happening via the fill_value parameter. There are two ways we could address this:

  • Don't force casting to numeric in cases like these (API change). This is the option we have in to_numeric (e.g. we keep the column as object when we risk losing precision).

  • Allow people to work around this via fill_value. This is an interesting option, but if we do this, I would expect that we also apply this to the rest of the IO interface (read_csv, read_excel, etc.)

@jiangyue12392
Copy link
Contributor Author

@jreback @gfyoung So which direction shall I pursue to fix the precision issue due to conversion?

@gfyoung
Copy link
Member

gfyoung commented Jul 4, 2019

@jiangyue12392 : Sorry for the silence! We're actually in the midst of pushing a release candidate for 0.25.0. I would merge / rebase master for the time being.

@WillAyd @jreback : If either of you have bandwidth to give feedback on the two directions I gave above, that would be great here.

@WillAyd
Copy link
Member

WillAyd commented Jul 4, 2019 via email

@jiangyue12392
Copy link
Contributor Author

@WillAyd Can you point me to more information about the nullable integer type? I may be able to come up with some other solutions with that.

@WillAyd
Copy link
Member

WillAyd commented Jul 9, 2019

@jiangyue12392 there is a dedicated section in the docs to it:

https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

It's a relatively new feature still so anything you can do to improve integration with that would be awesome!

@jiangyue12392
Copy link
Contributor Author

@WillAyd @gfyoung @jreback I made another attempt to solve this issue with IntegerArray at #27335. Please advise.

@WillAyd
Copy link
Member

WillAyd commented Jul 15, 2019

superseded by #27335

@WillAyd WillAyd closed this Jul 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: json_normalize() avoid loss of precision for int64 with missing values
4 participants