Skip to content

BUG: Fix pd.json_normalize to not skip the first element of a generator input #38698

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Dec 30, 2020
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -253,6 +253,8 @@ I/O
- Allow custom error values for parse_dates argument of :func:`read_sql`, :func:`read_sql_query` and :func:`read_sql_table` (:issue:`35185`)
- Bug in :func:`to_hdf` raising ``KeyError`` when trying to apply
for subclasses of ``DataFrame`` or ``Series`` (:issue:`33748`).
- Bug in :func:`json_normalize` resulting in the first element of a generator object not being included in the returned ``DataFrame`` (:issue:`35923`)


Period
^^^^^^
Expand Down
5 changes: 5 additions & 0 deletions pandas/io/json/_normalize.py
Original file line number Diff line number Diff line change
Expand Up @@ -267,6 +267,11 @@ def _pull_records(js: Dict[str, Any], spec: Union[List, str]) -> List:
data = [data]

if record_path is None:
if np.ndim(data) == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about inserting a call to next(data) to skip past the first element?

It's always good to be as lazy as possible but tbf I don't know how much that matters here

Copy link
Contributor Author

@avinashpancham avinashpancham Dec 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arw2019 wdym with skipping past the first line with "next(data)"? Cause we do not want to skip the first element.

Below I've written down my understanding and solution to the problem. But lmk if you see this differently!

The problem here is that in the old situation the first line of an input generator was consumed and not part of the output dataframe. This was caused by this line: https://github.com/pandas-dev/pandas/blob/master/pandas/io/json/_normalize.py#L270.

By consuming the whole generator and putting it in a list we don't have that issue anymore

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay right your solution looks good

(There's actually already a TODO there about expanding the generator with nested records, as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah great. Wrt to the TODO, which one do you mean? I see this one: TODO: handle record value which are lists, at least error reasonably, but IMO that would justify its own PR since it is related to a different issue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, that TODO is beyond the scope of this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just put a list(data) on L264 (you could even limit this to a Iterable type if you want)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this entire function needs to be split up into module level functions and cleaned up. I believe we have several open issues about this (but orthogonal to this PR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put list(data) on L264, but limited it to type Iterator and not Iterable since we use different conversion logic for dicts

Wrt to the splitting of the modules. I will add it to my list, but first I want to address some other PRs for dtypes.

# GH35923 Fix pd.json_normalize to not skip the first element of a
# generator input
data = list(data)

if any([isinstance(x, dict) for x in y.values()] for y in data):
# naive normalization, this is idempotent for flat records
# and potentially will inflate the data considerably for
Expand Down
11 changes: 11 additions & 0 deletions pandas/tests/io/json/test_normalize.py
Original file line number Diff line number Diff line change
Expand Up @@ -518,6 +518,17 @@ def test_meta_non_iterable(self):
)
tm.assert_frame_equal(result, expected)

def test_generator(self, state_data):
# GH35923 Fix pd.json_normalize to not skip the first element of a
# generator input
def generator_data():
yield from state_data[0]["counties"]

result = json_normalize(generator_data())
expected = DataFrame(state_data[0]["counties"])

tm.assert_frame_equal(result, expected)


class TestNestedToRecord:
def test_flat_stays_flat(self):
Expand Down