Skip to content

Round trip json serialization issues #22525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pazzarpj opened this issue Aug 28, 2018 · 4 comments
Closed

Round trip json serialization issues #22525

pazzarpj opened this issue Aug 28, 2018 · 4 comments
Labels
IO JSON read_json, to_json, json_normalize Needs Info Clarification about behavior needed to assess issue

Comments

@pazzarpj
Copy link

pazzarpj commented Aug 28, 2018

Problem description

I would hope there would be a way for each of the serialization methods to store and retrieve enough metadata to reconstruct the original object.
Reading through #4889, it seems that most of the problems associated should have been resolved with 0.20 with the introduction of orient='table'.

When writing tests for this, it became apparent that several circumstances could produce issues.

tl;dr The schema as dumped and read is inconsistent, producing problematic behavior

Example 1: Empty DataFrame

import pandas
from pandas.util.testing import assert_frame_equal
df = pandas.DataFrame()
assert_frame_equal(df, pandas.read_json(df.to_json(orient='table'), orient='table'))
AssertionError: DataFrame.index are different

DataFrame.index classes are not equivalent
[left]:  Float64Index([], dtype='float64')
[right]: Index([], dtype='object')

inspecting the to_json output we get

df.to_json(orient='table')
'{"schema": {
    "fields":[{"name":"index","type":"string"}],
    "primaryKey":["index"],"pandas_version":"0.20.0"}, "data": []}'

According to http://pandas-docs.github.io/pandas-docs-travis/io.html#io-table-schema
the "type":"string" should indicate the pandas type of "object". The DataFrame.to_json is not respecting the type in memory and is converting the type from float64 to object.

Example 2: Int DataFrame

import pandas
from pandas.util.testing import assert_frame_equal
df = pandas.DataFrame([1])
assert_frame_equal(df, pandas.read_json(df.to_json(orient='table'), orient='table'))

ValueError: Cannot convert non-finite values (NA or inf) to integer

I'm not sure exactly what is going on here.
If I drop the orient='table', it creates a valid DataFrame with equal values but still has the problem of still failing the assertion due to inferred index data types.
Inspecting the to_json output we get:

df.to_json(orient='table')                                            
'{"schema": {
    "fields":[{"name":"index","type":"integer"},"name":0,"type":"integer"}],
    "primaryKey":"index"],"pandas_version":"0.20.0"}, "data": [{"index":0,"0":1}]}'

So now we correctly encode the type as integer but we can't decode it.

Example 3: Float DataFrame

import pandas
from pandas.util.testing import assert_frame_equal
df = pandas.DataFrame([1.0])
assert_frame_equal(df, pandas.read_json(df.to_json(orient='table'), orient='table'))
AssertionError: DataFrame.iloc[:, 0] are different

DataFrame.iloc[:, 0] values are different (100.0 %)
[left]:  [1.0]
[right]: [nan]

Now we have a silent converting of the float value of 1.0 to nan. Probably related to the example 2 issue but in this case the index can handle nan as it is a float index instead of integer.
Inspecting the json output

df.to_json(orient='table')                                            
'{"schema": {
    "fields:[{"name":"index","type":"integer"},"name":0,"type":"number"}],
    "primaryKey":"index"],"pandas_version":"0.20.0"}, "data": [{"index":0,"0":1.0}]}'                                                               

Code to replicate most cases that could fail

from hypothesis import given, strategies as st
from hypothesis.extra.pandas import data_frames, column
from pandas.util.testing import assert_frame_equal
import pandas

# Generate dataframe columns
st_columns = st.builds(column, st.text(min_size=1), dtype=st.sampled_from([int, float, str, bool]))

# General a list of columns but ensure the column name is unique
st_lst_of_columns = st.lists(st_columns, unique_by=lambda x: x.name, min_size=1)

# Parse a list of random columns into the dataframe strategy
st_random_df = st_lst_of_columns.flatmap(data_frames)

@given(st_random_df)
def test_dataframe(df):
    assert_frame_equal(df, df)

@given(st_random_df)
def test_round_trip_json_dataframe(df):
    assert_frame_equal(df, pandas.read_json(df.to_json(orient='table'), orient='table'))

Output of pd.show_versions()

INSTALLED VERSIONS ------------------

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.7.2
pip: 18.0
setuptools: 40.2.0
Cython: None
numpy: 1.15.0
scipy: None
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: 1.7.7
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@pazzarpj
Copy link
Author

Having just noticed the comment in io.json.json.read_json

Notes
-----
Specific to ``orient='table'``, if a :class:`DataFrame` with a literal
:class:`Index` name of `index` gets written with :func:`to_json`, the
subsequent read operation will incorrectly set the :class:`Index` name to
``None``. This is because `index` is also used by :func:`DataFrame.to_json`
to denote a missing :class:`Index` name, and the subsequent
:func:`read_json` operation cannot distinguish between the two. The same
limitation is encountered with a :class:`MultiIndex` and any names
beginning with ``'level_'``.

From that I can see that this as been noted during development.
Assuming that to be true, does this also cover the case with the numbers trying to be converted to nan?

@WillAyd
Copy link
Member

WillAyd commented Aug 28, 2018

A smattering of issues here but generally I don't think any of these can be supported due to having a numeric column name (see #19129). Can you try assigning a non-numeric column name and see if that resolves?

Error messages can certainly be improved; aforementioned issue is still open if you want to take a look

@WillAyd WillAyd added IO JSON read_json, to_json, json_normalize Needs Info Clarification about behavior needed to assess issue labels Aug 28, 2018
@pazzarpj
Copy link
Author

Thanks Will,
I was in the middle of investigating that possibility. I hadn't seen #19129, It seems that they would be closely related if not the same issue.

Give me some time to investigate if all of the issues stem from #19129 or whether there are new bug discovered here.

@mroeschke
Copy link
Member

I agree that this looks like a duplicate to #19129. If not, we can reopen or create a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants