Skip to content

BUG: pd.read_json May Not Maintain Numeric String Index #38727

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 4, 2021

Conversation

theoniko
Copy link
Contributor

@theoniko theoniko commented Dec 27, 2020

Copy link
Member

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @theoniko for the PR!

Some comments. We'll also need a whatsnew for this (1.3)

and obj._get_axis(axis_name).dtype == "object"
and obj.index.inferred_type == "string"
):
index_list = list(obj._get_axis(axis_name).values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have to convert to a list here? We should be able to inspect values directly I think

and obj.index.inferred_type == "string"
):
index_list = list(obj._get_axis(axis_name).values)
are_all_strings = all(type(element) == str for element in index_list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have helpers got these kinds of checks in pandas/core/dtypes/common.py, may as well use those

are_all_int_or_floats = all(
are_int_or_floats is True for element in are_int_or_floats
)
if are_all_int_or_floats and are_all_strings:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this ever hit? only one of those can be true at any time, no?

@@ -913,7 +933,6 @@ def _try_convert_data(self, name, data, use_dtypes=True, convert_dates=True):
result = True
except (TypeError, ValueError):
pass

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls revert this to minimize the diff

@@ -192,11 +192,14 @@ def test_roundtrip_str_axes(self, orient, convert_axes, numpy, dtype):
# to disambiguate whether those keys actually were strings or numeric
# beforehand and numeric wins out.
# TODO: Split should be able to support this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we delete this TODO now?

@@ -850,6 +850,26 @@ def _convert_axes(self):
obj = self.obj
assert obj is not None # for mypy
for axis_name in obj._AXIS_ORDERS:
if (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if condition seems too specific. Is there a more generic way to handle this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as i understood the minimum preconditions to keep string index should in read_json orient equal to split or index and the index of json format to be a list of int or float strings. Am i right? Should i move the checks in _try_convert_data?
Do you have a more conrete and robust suggestion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a slightly different way. Could you please re-review?

@rhshadrach rhshadrach added Bug Dtype Conversions Unexpected or buggy dtype conversions IO JSON read_json, to_json, json_normalize labels Dec 28, 2020
@theoniko theoniko force-pushed the read-json-numeric-string-index branch from dd46945 to 42f912e Compare December 28, 2020 09:57
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be less specialized (as indicated by comments)

@theoniko theoniko force-pushed the read-json-numeric-string-index branch 2 times, most recently from 28b31fb to bb21639 Compare December 29, 2020 12:20
@pep8speaks
Copy link

pep8speaks commented Dec 29, 2020

Hello @theoniko! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-01-04 22:32:49 UTC

@theoniko theoniko requested a review from mroeschke December 29, 2020 12:21
@theoniko theoniko force-pushed the read-json-numeric-string-index branch 4 times, most recently from 01488e4 to bfd8e89 Compare December 29, 2020 13:26
expected.columns = expected.columns.astype(np.int64)
expected.index = expected.index.astype(np.int64)
elif orient == "records" and convert_axes:
expected.columns = expected.columns.astype(np.int64)
elif convert_axes and orient == "split":
expected.columns = expected.columns.astype(np.int64)
expected.index = expected.index.astype(str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this conversion? I think the problem is that we are converting when we don't need to

@theoniko theoniko force-pushed the read-json-numeric-string-index branch from bfd8e89 to db10b73 Compare January 1, 2021 09:08
@theoniko
Copy link
Contributor Author

theoniko commented Jan 1, 2021

All requested changes were made. Please re-review the pr.

@@ -267,7 +267,7 @@ I/O
- Bug in :func:`to_hdf` raising ``KeyError`` when trying to apply
for subclasses of ``DataFrame`` or ``Series`` (:issue:`33748`).
- Bug in :func:`json_normalize` resulting in the first element of a generator object not being included in the returned ``DataFrame`` (:issue:`35923`)

- Bug in :func:`read_json` does not maintan numeric string index (:issue:`28556`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not specific enough to the actual case, pls update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

and self.orient == "split"
and len(data)
):
result = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of using result at all, just return on a conversion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the condition to the end and removed the result variable.


if data.dtype == "object":

if (
isinstance(data, Index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think you need to test if an Index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jreback
Copy link
Contributor

jreback commented Jan 1, 2021

merge master as well

@theoniko theoniko force-pushed the read-json-numeric-string-index branch 6 times, most recently from 3ba5e89 to 0762111 Compare January 2, 2021 13:17
@theoniko theoniko requested a review from jreback January 2, 2021 16:24
except (TypeError, ValueError):
pass

return data, result
if name == "index" and self.orient == "split" and len(data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok looks a lot better. can you add a comment here and structure this more like

# if we have an index, we want to preserve dtypes
if name == "index" and len(data):
    if self.orient == "split":
        return data, False

return data, True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@theoniko theoniko force-pushed the read-json-numeric-string-index branch 2 times, most recently from 9234406 to 2e5e743 Compare January 3, 2021 17:33
@@ -272,6 +272,7 @@ I/O
- Bug in :func:`json_normalize` resulting in the first element of a generator object not being included in the returned ``DataFrame`` (:issue:`35923`)
- Bug in :func:`read_excel` forward filling :class:`MultiIndex` names with multiple header and index columns specified (:issue:`34673`)
- :func:`pandas.read_excel` now respects :func:``pandas.set_option`` (:issue:`34252`)
- Bug in :func:`read_json` when ``orient="split"`` does not maintan numeric string index (:issue:`28556`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug in :func:read_json when orient="split" does not maintain numeric string index (:issue:28556)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@theoniko theoniko force-pushed the read-json-numeric-string-index branch from 2e5e743 to f2a2785 Compare January 4, 2021 18:11
@jreback jreback added this to the 1.3 milestone Jan 4, 2021
@jreback
Copy link
Contributor

jreback commented Jan 4, 2021

lgtm pushed a small doc change and can merge on green

@jreback jreback merged commit 9bc1364 into pandas-dev:master Jan 4, 2021
luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pd.read_json May Not Maintain Numeric String Index
7 participants