Skip to content

Avoids exception when pandas.io.json.json_normalize contains items in… #14505

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 30 commits into from
Closed

Conversation

dickreuter
Copy link
Contributor

@dickreuter dickreuter commented Oct 26, 2016

Continued in #14583


When using pandas.io.json.json_normalize to parse a nested json and convert it to a dataframe, the meta parameter can be used to use fields as metadata for each record in resulting table. In some cases, not all items may contain all of the specified meta fields. This change will avoid throwing an error and output np.nan instead.

… meta parameter that don't always occur in every item of the list

… meta parameter that don't always occur in every item of the list
@codecov-io
Copy link

codecov-io commented Oct 26, 2016

Current coverage is 85.27% (diff: 90.00%)

Merging #14505 into master will increase coverage by <.01%

@@             master     #14505   diff @@
==========================================
  Files           140        140          
  Lines         50670      50698    +28   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43205      43233    +28   
  Misses         7465       7465          
  Partials          0          0          

Powered by Codecov. Last update e3d943d...33495bc

@jreback
Copy link
Contributor

jreback commented Oct 26, 2016

what exactly is this fixing? would need a test.

@jreback jreback added the IO JSON read_json, to_json, json_normalize label Oct 26, 2016
jreback and others added 4 commits October 26, 2016 18:18
…wers

xref numpy/numpy#8127    closes #14489

Author: Jeff Reback <[email protected]>

Closes #14498 from jreback/compat and squashes the following commits:

882872e [Jeff Reback] COMPAT/TST: fix test for range testing of negative integers to neg powers
Title is self-explanatory.  Affects Python 2.x only.  Closes #14477.

Author: gfyoung <[email protected]>

Closes #14492 from gfyoung/quotechar-unicode-2.x and squashes the following commits:

ec9f59a [gfyoung] BUG: Accept unicode quotechars again in pd.read_csv
@dickreuter
Copy link
Contributor Author

dickreuter commented Oct 26, 2016

j={
    "Trades" : [{
            "general" : {
                "tradeid" : 100,
                "trade_version" : 1,
                "stocks" : [{

                        "symbol" : "AAPL",
                        "name" : "Apple",
                        "price" : "0"

                    }, {

                        "symbol" : "GOOG",
                        "name" : "Google",
                        "price" : "0"

                    }
                ]
            },
        }, {
            "general" : {
                "tradeid" : 100,
                "stocks" : [{

                        "symbol" : "AAPL",
                        "name" : "Apple",
                        "price" : "0"

                    }, {
                        "symbol" : "GOOG",
                        "name" : "Google",
                        "price" : "0"

                    }
                ]
            },
        }
    ]
}
json_normalize(data=j['Trades'], record_path=[['general','stocks']], meta=[['general','tradeid'],['general','trade_version']])

The above will fail because trade_version is only available in one of the two items in the list, so there is no way to output it if not all elements are exactly the same. With my change it will simply ignore it and output nan instead of throwing an error.

jorisvandenbossche and others added 5 commits October 27, 2016 09:11
When the driver was not installed, but sqlalchemy itself was, when passing a URI string, you got an error indicating that SQLAlchemy was not installed, instead of the driver not being installed. This was because the import error for the driver was captured as import error for sqlalchemy.
@TomAugspurger
Copy link
Contributor

@dickreuter Can you add that as a test and add a release note? See http://pandas.pydata.org/pandas-docs/stable/contributing.html#contributing-to-the-code-base

@dickreuter
Copy link
Contributor Author

dickreuter commented Oct 29, 2016

This would be the test, but unclear where I should store it. Any suggestions?

from unittest import TestCase
from pandas.io.json import json_normalize


class Tester(TestCase):
    def test_json_normalise_fix(self):
        j = {
            "Trades": [{
                "general": {
                    "tradeid": 100,
                    "trade_version": 1,
                    "stocks": [{

                        "symbol": "AAPL",
                        "name": "Apple",
                        "price": "0"

                    }, {

                        "symbol": "GOOG",
                        "name": "Google",
                        "price": "0"

                    }
                    ]
                },
            }, {
                "general": {
                    "tradeid": 100,
                    "stocks": [{

                        "symbol": "AAPL",
                        "name": "Apple",
                        "price": "0"

                    }, {
                        "symbol": "GOOG",
                        "name": "Google",
                        "price": "0"

                    }
                    ]
                },
            }
            ]
        }
        j = json_normalize(data=j['Trades'], record_path=[['general', 'stocks']],
                           meta=[['general', 'tradeid'], ['general', 'trade_version']])
        self.assertEqual(len(j), 4)

@TomAugspurger
Copy link
Contributor

Looks like those tests are all in https://github.com/pandas-dev/pandas/blob/master/pandas/io/tests/json/test_json_norm.py

You could add it as a test method under TestJSONNormalize

@dickreuter
Copy link
Contributor Author

Test and documentation is now added.

@@ -78,3 +78,4 @@ Bug Fixes


- Bug in ``pd.pivot_table`` may raise ``TypeError`` or ``ValueError`` when ``index`` or ``columns`` is not scalar and ``values`` is not specified (:issue:`14380`)
- Bug in ``pandas.io.json.json_normalize``When parsing a nested json and convert it to a dataframe, the meta parameter can be used to use fields as metadata for each record in resulting table. In some cases, not all items may contain all of the specified meta fields. This change will avoid throwing an error and output np.nan instead. (:issue '14505')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls simplify. a user just wants to know does this issue pertain to them, and a short expl.

make the issue

(:issue:14505)

meta_val = _pull_field(obj, val[level:])
try:
meta_val = _pull_field(obj, val[level:])
except:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use a bare except, list a specific exception KeyError?

@@ -225,6 +225,51 @@ def test_nested_flattens(self):

self.assertEqual(result, expected)

def test_json_normalise_fix(self):
j = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment


}
]
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this prob does not pass linting. make sure it does.

}
j = json_normalize(data=j['Trades'], record_path=[['general', 'stocks']],
meta=[['general', 'tradeid'], ['general', 'trade_version']])
self.assertEqual(len(j), 4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

construct the expected frame and use assert_frame_equal

@@ -792,7 +792,10 @@ def _recursive_extract(data, path, seen_meta, level=0):
if level + 1 > len(val):
meta_val = seen_meta[key]
else:
meta_val = _pull_field(obj, val[level:])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a keyword, call it errors='raise'|'ignore'. You are defining ignore. Please leave the default as raise (which is the current behavior).

@jreback jreback added the Error Reporting Incorrect or improved errors from pandas label Oct 31, 2016
Added documenation
Shortened what's new
Removed commas in dictionary for linting compatibility
@dickreuter
Copy link
Contributor Author

Added keyword errors {'raise'|'ignore}
Added documentation
Shortened what's new
Removed commas in dictionary for linting compatibility

jreback and others added 13 commits November 2, 2016 05:59
pandas.core.common.array_equivalent was removed without deprecation warning.
This commits adds it back to the core.common namespace with deprecation warning
* BUG/API: Index.append with mixed object/Categorical indices

* Only coerce to object if the calling index is not categorical

* Add test for the df.info() case (GH14298)
… meta parameter that don't always occur in every item of the list
Added documenation
Shortened what's new
Removed commas in dictionary for linting compatibility
# Conflicts:
#	doc/source/whatsnew/v0.19.1.txt
@jreback
Copy link
Contributor

jreback commented Nov 3, 2016

you need to rebase on master

git rebase -i origin/master

@dickreuter
Copy link
Contributor Author

dickreuter commented Nov 3, 2016

I did that earlier today. It now says: "This branch is 8 commits ahead of pandas-dev:master.". There should currently be no more conflicts.

@jreback
Copy link
Contributor

jreback commented Nov 3, 2016

maybe you didn't push it
it's not about conflicts

@dickreuter
Copy link
Contributor Author

My fork on github seems up to date with what I have locally, so I assume it has been pushed. Are there any further changes you expected me to implement that are not present?

@jreback
Copy link
Contributor

jreback commented Nov 3, 2016

@dickreuter its impossible to see until you rebase on master. this should have just your commits

https://github.com/pandas-dev/pandas/pull/14505/commits

@dickreuter
Copy link
Contributor Author

I see, isn't it showing those commits of others only because I did a rebase of my fork (and then a local rebase of my local copy instead of a merge?). If that's a problem I could delete my fork and create a new one, then make the changes again and create a new pull request, unless you have a better suggestion.

@jreback
Copy link
Contributor

jreback commented Nov 3, 2016

you prob just need something like

git fetch origin
git rebase -i origin/master
git push yourremote thisbranchname -f

@dickreuter
Copy link
Contributor Author

git fetch origin --> doesn't fetch anything as my local copy is in
sync with my remote origingit rebase -i origin/master --> noop - no
commits to pick

I think all my changes can be seen
here:9848837

What may be confusing is that I also did an automatic reformatting
with pycharm to make the file conform with pep8. But those changes
only concern empty spaces.

On 3 November 2016 at 23:32, Jeff Reback [email protected] wrote:

you prob just need something like

git fetch origin
git rebase -i origin/master
git push yourremote thisbranchname -f


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#14505 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABMrfm7iAtn0lL-LYi_wVB5pLLx7lS5Eks5q6m76gaJpZM4Khogo
.

@jreback
Copy link
Contributor

jreback commented Nov 3, 2016

could also be called upstream

(pandas) [Thu Nov 03 19:52:46 ~/pandas]$ git remote -v|grep origin
origin  https://github.com/pandas-dev/pandas.git (fetch)
origin  https://github.com/pandas-dev/pandas.git (push)

it doesn't matter if your branch is in sync with YOUR upstream, rather it needs to be in sync with pandas master (and on top of it), that's what a rebase is.

you need to rebase to remove all of the merges of master. you shouldn't do that, instead rebase.

@dickreuter
Copy link
Contributor Author

This seems to be the problem:
http://stackoverflow.com/questions/40413071/after-rebasing-my-github-fork-commits-from-others-are-in-my-pull-request/40413455#40413455

Will try to fix it, or if it's too complicated just delete and redo.

@jorisvandenbossche
Copy link
Member

@jreback Regarding your comment above (#14505 (comment)): pandas-dev repo is typically called 'upstream', and your own fork 'origin' (that's how our contributor guide also says it), so you need git fetch upstream and git rebase -i upstream/master instead of git fetch origin and git rebase -i origin/master to rebase properly.

@jorisvandenbossche
Copy link
Member

Follow-up in #14583

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.