Round trip json serialization issues #22525

pazzarpj · 2018-08-28T06:02:29Z

Problem description

I would hope there would be a way for each of the serialization methods to store and retrieve enough metadata to reconstruct the original object.
Reading through #4889, it seems that most of the problems associated should have been resolved with 0.20 with the introduction of orient='table'.

When writing tests for this, it became apparent that several circumstances could produce issues.

tl;dr The schema as dumped and read is inconsistent, producing problematic behavior

Example 1: Empty DataFrame

import pandas
from pandas.util.testing import assert_frame_equal
df = pandas.DataFrame()
assert_frame_equal(df, pandas.read_json(df.to_json(orient='table'), orient='table'))
AssertionError: DataFrame.index are different

DataFrame.index classes are not equivalent
[left]:  Float64Index([], dtype='float64')
[right]: Index([], dtype='object')

inspecting the to_json output we get

df.to_json(orient='table')
'{"schema": {
    "fields":[{"name":"index","type":"string"}],
    "primaryKey":["index"],"pandas_version":"0.20.0"}, "data": []}'

According to http://pandas-docs.github.io/pandas-docs-travis/io.html#io-table-schema
the "type":"string" should indicate the pandas type of "object". The DataFrame.to_json is not respecting the type in memory and is converting the type from float64 to object.

Example 2: Int DataFrame

import pandas
from pandas.util.testing import assert_frame_equal
df = pandas.DataFrame([1])
assert_frame_equal(df, pandas.read_json(df.to_json(orient='table'), orient='table'))

ValueError: Cannot convert non-finite values (NA or inf) to integer

I'm not sure exactly what is going on here.
If I drop the orient='table', it creates a valid DataFrame with equal values but still has the problem of still failing the assertion due to inferred index data types.
Inspecting the to_json output we get:

df.to_json(orient='table')                                            
'{"schema": {
    "fields":[{"name":"index","type":"integer"},"name":0,"type":"integer"}],
    "primaryKey":"index"],"pandas_version":"0.20.0"}, "data": [{"index":0,"0":1}]}'

So now we correctly encode the type as integer but we can't decode it.

Example 3: Float DataFrame

import pandas
from pandas.util.testing import assert_frame_equal
df = pandas.DataFrame([1.0])
assert_frame_equal(df, pandas.read_json(df.to_json(orient='table'), orient='table'))
AssertionError: DataFrame.iloc[:, 0] are different

DataFrame.iloc[:, 0] values are different (100.0 %)
[left]:  [1.0]
[right]: [nan]

Now we have a silent converting of the float value of 1.0 to nan. Probably related to the example 2 issue but in this case the index can handle nan as it is a float index instead of integer.
Inspecting the json output

df.to_json(orient='table')                                            
'{"schema": {
    "fields:[{"name":"index","type":"integer"},"name":0,"type":"number"}],
    "primaryKey":"index"],"pandas_version":"0.20.0"}, "data": [{"index":0,"0":1.0}]}'

Code to replicate most cases that could fail

from hypothesis import given, strategies as st
from hypothesis.extra.pandas import data_frames, column
from pandas.util.testing import assert_frame_equal
import pandas

# Generate dataframe columns
st_columns = st.builds(column, st.text(min_size=1), dtype=st.sampled_from([int, float, str, bool]))

# General a list of columns but ensure the column name is unique
st_lst_of_columns = st.lists(st_columns, unique_by=lambda x: x.name, min_size=1)

# Parse a list of random columns into the dataframe strategy
st_random_df = st_lst_of_columns.flatmap(data_frames)

@given(st_random_df)
def test_dataframe(df):
    assert_frame_equal(df, df)

@given(st_random_df)
def test_round_trip_json_dataframe(df):
    assert_frame_equal(df, pandas.read_json(df.to_json(orient='table'), orient='table'))

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.7.2
pip: 18.0
setuptools: 40.2.0
Cython: None
numpy: 1.15.0
scipy: None
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: 1.7.7
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

pazzarpj · 2018-08-28T08:31:37Z

Having just noticed the comment in io.json.json.read_json

Notes
-----
Specific to ``orient='table'``, if a :class:`DataFrame` with a literal
:class:`Index` name of `index` gets written with :func:`to_json`, the
subsequent read operation will incorrectly set the :class:`Index` name to
``None``. This is because `index` is also used by :func:`DataFrame.to_json`
to denote a missing :class:`Index` name, and the subsequent
:func:`read_json` operation cannot distinguish between the two. The same
limitation is encountered with a :class:`MultiIndex` and any names
beginning with ``'level_'``.

From that I can see that this as been noted during development.
Assuming that to be true, does this also cover the case with the numbers trying to be converted to nan?

WillAyd · 2018-08-28T16:41:41Z

A smattering of issues here but generally I don't think any of these can be supported due to having a numeric column name (see #19129). Can you try assigning a non-numeric column name and see if that resolves?

Error messages can certainly be improved; aforementioned issue is still open if you want to take a look

pazzarpj · 2018-08-28T23:30:22Z

Thanks Will,
I was in the middle of investigating that possibility. I hadn't seen #19129, It seems that they would be closely related if not the same issue.

Give me some time to investigate if all of the issues stem from #19129 or whether there are new bug discovered here.

…9129) (pandas-dev#22525)

mroeschke · 2020-01-21T04:20:31Z

I agree that this looks like a duplicate to #19129. If not, we can reopen or create a new issue.

WillAyd added IO JSON read_json, to_json, json_normalize Needs Info Clarification about behavior needed to assess issue labels Aug 28, 2018

albertvillanova pushed a commit to albertvillanova/pandas that referenced this issue Feb 28, 2019

Fix JSON orient='table' issues for numeric column names (pandas-dev#1…

c88affc

…9129) (pandas-dev#22525)

albertvillanova mentioned this issue Feb 28, 2019

Fix JSON orient='table' issues with numeric column names #25488

Closed

3 tasks

mroeschke closed this as completed Jan 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Round trip json serialization issues #22525

Round trip json serialization issues #22525

pazzarpj commented Aug 28, 2018 •

edited

Loading

pazzarpj commented Aug 28, 2018

WillAyd commented Aug 28, 2018

pazzarpj commented Aug 28, 2018

mroeschke commented Jan 21, 2020

Round trip json serialization issues #22525

Round trip json serialization issues #22525

Comments

pazzarpj commented Aug 28, 2018 • edited Loading

Problem description

Example 1: Empty DataFrame

Example 2: Int DataFrame

Example 3: Float DataFrame

Code to replicate most cases that could fail

Output of pd.show_versions()

pazzarpj commented Aug 28, 2018

WillAyd commented Aug 28, 2018

pazzarpj commented Aug 28, 2018

mroeschke commented Jan 21, 2020

pazzarpj commented Aug 28, 2018 •

edited

Loading

Output of `pd.show_versions()`