Skip to content

BUG: Fix nested_to_record with None values in nested levels #21164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 7, 2018

Conversation

ssikdar1
Copy link
Contributor

@ssikdar1 ssikdar1 commented May 22, 2018

Continue after pop so you dont pop with the same key twice in a row

@ssikdar1
Copy link
Contributor Author

git diff upstream/master -u -- ".py" | flake8 --diff
$ git diff upstream/master -u -- "
.py" | flake8 --diff
$

@pep8speaks
Copy link

pep8speaks commented May 22, 2018

Hello @ssikdar1! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 07, 2018 at 12:16 Hours UTC

@codecov
Copy link

codecov bot commented May 22, 2018

Codecov Report

Merging #21164 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21164      +/-   ##
==========================================
+ Coverage   91.84%   91.84%   +<.01%     
==========================================
  Files         153      153              
  Lines       49505    49506       +1     
==========================================
+ Hits        45466    45467       +1     
  Misses       4039     4039
Flag Coverage Δ
#multiple 90.23% <100%> (ø) ⬆️
#single 41.88% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/json/normalize.py 96.96% <100%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 172ab7a...34c745e. Read the comment docs.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change - you can add a whatsnew for v0.23.1

@@ -80,6 +80,7 @@ def nested_to_record(ds, prefix="", sep=".", level=0):
if level != 0: # so we skip copying for top level, common case
v = new_d.pop(k)
new_d[newkey] = v
continue
if v is None: # pop the key if the value is None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps instead of continue this should just be elif?

@WillAyd WillAyd added the IO JSON read_json, to_json, json_normalize label May 22, 2018
@WillAyd WillAyd added this to the 0.23.1 milestone May 22, 2018
@ssikdar1
Copy link
Contributor Author

Can you check my latest commits? I changed it to not use a continue.

Also changed the whatsnew as well.

@codecov
Copy link

codecov bot commented May 23, 2018

Codecov Report

Merging #21164 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21164   +/-   ##
=======================================
  Coverage   91.84%   91.84%           
=======================================
  Files         153      153           
  Lines       49505    49505           
=======================================
  Hits        45466    45466           
  Misses       4039     4039
Flag Coverage Δ
#multiple 90.23% <100%> (ø) ⬆️
#single 41.88% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/json/normalize.py 96.93% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 172ab7a...34c745e. Read the comment docs.

@codecov-io
Copy link

codecov-io commented May 23, 2018

Codecov Report

Merging #21164 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21164   +/-   ##
=======================================
  Coverage   91.85%   91.85%           
=======================================
  Files         153      153           
  Lines       49555    49555           
=======================================
  Hits        45518    45518           
  Misses       4037     4037
Flag Coverage Δ
#multiple 90.25% <100%> (ø) ⬆️
#single 41.87% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/json/normalize.py 96.93% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cea0a81...92a6263. Read the comment docs.

@@ -80,8 +80,9 @@ def nested_to_record(ds, prefix="", sep=".", level=0):
if level != 0: # so we skip copying for top level, common case
v = new_d.pop(k)
new_d[newkey] = v
if v is None: # pop the key if the value is None
new_d.pop(k)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use elif instead of a separate else and if

@ssikdar1
Copy link
Contributor Author

kk, changed to elif

@@ -45,6 +45,7 @@ Bug Fixes
~~~~~~~~~

- tab completion on :class:`Index` in IPython no longer outputs deprecation warnings (:issue:`21125`)
- fix normalize to not except when json_normalize is fed a key with null value on inner most key (:issue:`21158`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to IO section, use a ref to json_normalize. see if you can describe this is more plain english.

@ssikdar1
Copy link
Contributor Author

Moved into the I/O section in whatsnew. Changed the message hopefully its clearer.

@@ -81,7 +81,7 @@ Indexing
I/O
^^^

-
- bug in :meth:`nested_to_record` when :meth:`json_normalize` was called with certain nested jsons (:issue:`21158`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps better said "Bug in ... with None values in nested levels"

@@ -375,3 +375,27 @@ def test_nonetype_dropping(self):
'info.last_updated': '26/05/2012'}]

assert result == expected

def test_nonetype_inner_most_level(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if we have a test case to handle None values that show up in the initial level?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking that we had a test to cover None at the top-most level. I suppose it makes sense to test None at the deepest level (which you have) and also somewhere in the middle, to make sure it doesn't mess up subsequent un-nesting operations.

Can you at least add a test with None in one of the middle layers? You can probably simplify the example below (doesn't need to be a direct copy/paste from the issue) and use parametrization to re-use the logic for different inputs (i.e. one with None at the deepest level, one with None somewhere in the middle). Depending on how that looks maybe you want to even move the example with None at the top level here as well

@WillAyd WillAyd changed the title Change nested_to_record to have a continue BUG: Fix nested_to_record with None values in nested levels May 24, 2018
@ssikdar1
Copy link
Contributor Author

@WillAyd

Can you check if we have a test case to handle None values that show up in the initial level?

These test case below have a json list with a key as None in the initial level of a json.

https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/json/test_normalize.py#L354

data = [
{'info': None,
'author_name':
{'first': 'Smith', 'last_name': 'Appleseed'}
},
{'info':
{'created_at': '11/08/1993', 'last_updated': '26/05/2012'},
'author_name':
{'first': 'Jane', 'last_name': 'Doe'}
}
]

https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/json/test_normalize.py#L58

[
{'info': None},
{'info':
{'created ...

@@ -81,7 +81,7 @@ Indexing
I/O
^^^

-
- bug in :meth:`nested_to_record` when :meth:`json_normalize` was called with None values in nested levels (:issue:`21158`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nested_to_record is not public, don't use it here

use double backticks around None

say nested levels in JSON

@@ -81,7 +81,7 @@ Indexing
I/O
^^^

-
- bug when :meth:`json_normalize` was called with `None` values in nested levels in JSON (:issue:`21158`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize "Bug" and as Jeff mentioned use double backticks around None, not just single

@@ -375,3 +375,27 @@ def test_nonetype_dropping(self):
'info.last_updated': '26/05/2012'}]

assert result == expected

def test_nonetype_inner_most_level(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking that we had a test to cover None at the top-most level. I suppose it makes sense to test None at the deepest level (which you have) and also somewhere in the middle, to make sure it doesn't mess up subsequent un-nesting operations.

Can you at least add a test with None in one of the middle layers? You can probably simplify the example below (doesn't need to be a direct copy/paste from the issue) and use parametrization to re-use the logic for different inputs (i.e. one with None at the deepest level, one with None somewhere in the middle). Depending on how that looks maybe you want to even move the example with None at the top level here as well

@ssikdar1
Copy link
Contributor Author

ssikdar1 commented May 26, 2018

Can you at least add a test with None in one of the middle layers?

  • I changed the test to be a test with None at multiple levels.
+        data = {
 +            "id": None,
 +            "location": {
 +                "id": None,
 +                "country": {
 +                    "id": None,
 +                    "state": {
 +                        "id": None,
 +                        "town.info": {
 +                            "region": None,
 +                            "x": 49.151580810546875,
 +                            "y": -33.148521423339844,
 +                            "z": 27.572303771972656}}}
 +            }
 +        }

Whatsnew:
Capitalized the b in Bug and put None in ticks

}
result = nested_to_record(data)
expected = {
'location.id': None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was reading through the commentary of the issue and noticed there was some confusion on the topic, but I don't understand why we would want to drop the 'id': None record here - is that solely driven by the elif statement? If so, perhaps we don't need that at all?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I'm late replying to this, but I agree. I don't understand the intention of dropping the top level 'id': None. I would much rather know explicitly that the value for a field is None than to be guessing or writing extra checks for whether the field exists.

@ssikdar1
Copy link
Contributor Author

ssikdar1 commented May 28, 2018

@WillAyd

I was reading through the commentary of the issue and noticed there was some confusion on the topic, but I don't understand why we would want to drop the 'id': None record here - is that solely driven by the elif statement? If so, perhaps we don't need that at all?

So I took an example from here #20030

import pandas as pd
data_partial_fail = \
        [{'info': None, 
         'author_name': 
         {'first': 'Smith', 'last_name': 'Appleseed'}
        }, 
        {'info': 
         {'created_at': '11/08/1993', 'last_updated': '26/05/2012'},
        'author_name': 
         {'first': 'Jane', 'last_name': 'Doe'}
        }]
p = pd.io.json.json_normalize(data_partial_fail)
print(p.columns)

With the my current pull request as is it is, it prints:

Index(['author_name.first', 'author_name.last_name', 'info.created_at',
       'info.last_updated'],
      dtype='object')

Now commenting out the elif, it prints:

(test) shan@shan-ThinkPad-T530:~/test$ python foo.py 
Index(['author_name.first', 'author_name.last_name', 'info', 'info.created_at',
       'info.last_updated'],
      dtype='object')

The issue here being the added column info.
The bug issue fix to 20030 added the original change in an attempt to get rid of the of the extra info column:
https://github.com/pandas-dev/pandas/pull/20399/files#diff-9c654764f5f21c8e9d58d9ebf14de86dR83

The change just didnt consider that a pop would occur twice in a row if the level > 0 and was also value was also None.

If we get rid of the elif we would undo issue 20030 from that perspective?

@@ -81,7 +81,7 @@ Indexing
I/O
^^^

-
- Bug when :meth:`json_normalize` was called with `None` values in nested levels in JSON (:issue:`21158`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this wont' render, you need pandas.io.json.json_normalize I think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double back ticks around None

# GH21158: If inner level json has a key with a null value
# make sure it doesnt do a new_d.pop twice and except
data = {
"id": None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this have the same result if id is NOT repeated (with None), just missing at various levels? e.g. try with a single id with None at the top and bottom levels (in another case, leave this one as well)

@jreback
Copy link
Contributor

jreback commented May 29, 2018

pls rebase as well

@ssikdar1 ssikdar1 force-pushed the json_normalize_KeyError_21158 branch from f3b35a7 to fa9ecd5 Compare May 29, 2018 03:01
@ssikdar1 ssikdar1 force-pushed the json_normalize_KeyError_21158 branch from fa9ecd5 to ab24d02 Compare May 29, 2018 03:09
@ssikdar1
Copy link
Contributor Author

  • Whatsnew changed to have pandas.io.json.json_normalize and None
  • another testcase added to test the id: None at various levels
  • rebased w/ master

@jreback
Copy link
Contributor

jreback commented Jun 4, 2018

looks ok to me. @WillAyd pls merge when you are satisifed.

@WillAyd
Copy link
Member

WillAyd commented Jun 4, 2018

@jreback somewhat conflicted on this. This PR certainly fixes the issue at hand, but what is the point of us dropping None values at the top level of the JSON yet maintaining it at deeper levels? Seems like it would be more consistent from an end user perspective to maintain at all levels and leave it to them to explicitly drop if they wanted.

Could probably go forward with this for 0.23.1 as the bug fix but then revisit top level None-handling for 0.24.0?

@jreback
Copy link
Contributor

jreback commented Jun 4, 2018

@WillAyd hmm, that's a good point. shouldn't id be in the expected? why are we dropping it at all? yeah that sees wrong

@WillAyd
Copy link
Member

WillAyd commented Jun 4, 2018

As @ssikdar1 points out it does stem from #20030, though in reading through that and the subsequent PR I think it is confusing what was actually being addressed. The PR would imply that automatically dropping the top-level None value is a feature, though I would argue that's not desirable and is in fact a bug.

@jreback
Copy link
Contributor

jreback commented Jun 7, 2018

@ssikdar1 can you rebase

@WillAyd ok with this?

@WillAyd
Copy link
Member

WillAyd commented Jun 7, 2018

Yep we can go with this for now. I'll open a separate issue for the top-level None parsing

@WillAyd WillAyd merged commit ab6aaf7 into pandas-dev:master Jun 7, 2018
@WillAyd
Copy link
Member

WillAyd commented Jun 7, 2018

Thanks @ssikdar1 !

@ssikdar1
Copy link
Contributor Author

ssikdar1 commented Jun 7, 2018

No problem!

@ssikdar1 ssikdar1 deleted the json_normalize_KeyError_21158 branch June 7, 2018 15:59
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Jun 12, 2018
TomAugspurger pushed a commit that referenced this pull request Jun 12, 2018
david-liu-brattle-1 pushed a commit to david-liu-brattle-1/pandas that referenced this pull request Jun 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

json_normalize gives KeyError in 0.23
7 participants