BUG: pandas to_json with orient "table" returns wrong schema & data string #38256

tgrandje · 2020-12-03T08:06:11Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

test = pd.DataFrame([[1,2,3],[4,5,6]], columns=[1, 2, 3])
s = test.to_json(orient="table")
print(s)
# wrong string :
# '{"schema":{"fields":[{"name":"index","type":"integer"},{"name":1,"type":"integer"},{"name":2,"type":"integer"},{"name":3,"type":"integer"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":0,"1":1,"2":2,"3":3},{"index":1,"1":4,"2":5,"3":6}]}'
pd.read_json(s, orient="table")
#ValueError: Cannot convert non-finite values (NA or inf) to integer

Problem description

When the initial columns are integers, the schema dict returns correct names (that unquoted integers), but the data dict identifies columns as string (quoted integers). Therefore, any loaded dataframe from this json format will return a dataframe full of empty (NaN) values or fail with an exception (I don't know which triggers which ; this minimal example here will trigger an exception ; my original dataset with multiindexes in stackoverflow returned an empty dataframe...

Expected Output

This output for pandas.to_json(orient="table") could be read (though it is losing the "int" label key and transforming it to strings) :

'{"schema":{"fields":[{"name":"index","type":"integer"},{"name":"1","type":"integer"},{"name":"2","type":"integer"},{"name":"3","type":"integer"}],"primaryKey":["index"],"pandas_version":"0.0.20"},"data":[{"index":0,"1":1,"2":2,"3":3},{"index":1,"1":4,"2":5,"3":6}]}'

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 67a3d42
python : 3.6.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder : little
LC_ALL : None
LANG : fr
LOCALE : None.None

pandas : 1.1.4
numpy : 1.18.4
pytz : 2017.2
dateutil : 2.8.1
pip : 20.2.4
setuptools : 36.6.0
Cython : 0.27.2
pytest : 3.2.3
hypothesis : None
sphinx : 1.6.5
blosc : 1.5.1
feather : 0.4.0
xlsxwriter : 1.0.2
lxml.etree : 4.1.0
html5lib : 0.9999999
pymysql : None
psycopg2 : None
jinja2 : 2.9.6
IPython : 6.2.1
pandas_datareader: None
bs4 : 4.6.0
bottleneck : 1.2.1
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 2.2.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.7.1
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.1.14
tables : None
tabulate : 0.8.5
xarray : 0.9.6
xlrd : 1.1.0
xlwt : None
numba : 0.35.0

The text was updated successfully, but these errors were encountered:

jreback · 2020-12-03T11:45:47Z

show the code for reading

the pandas_version is correct ; it's the last time this format changed (this is documented)

tgrandje · 2020-12-03T11:57:53Z

This will work (all strings) :
pd.read_json('{"schema":{"fields":[{"name":"index","type":"integer"},{"name":"1","type":"integer"},{"name":"2","type":"integer"},{"name":"3","type":"integer"}],"primaryKey":["index"],"pandas_version":"0.0.20"},"data":[{"index":0,"1":1,"2":2,"3":3},{"index":1,"1":4,"2":5,"3":6}]}', orient="table")

This won't work (all ints, I get ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.) :
pd.read_json('{"schema":{"fields":[{"name":"index","type":"integer"},{"name":1,"type":"integer"},{"name":2,"type":"integer"},{"name":3,"type":"integer"}],"primaryKey":["index"],"pandas_version":"1.1.4"},"data":[{"index":0,1:1,2:2,3:3},{"index":1,1:4,2:5,3:6}]}')

(Note that I made a mistake : this last code was the one I posted before as "expected output" but there are obviously some things about the 'mixing dicts' exception which are out of my level.)

This won't work either (with the string worked out by pandas from the to_json as posted before) :
pd.read_json('{"schema":{"fields":[{"name":"index","type":"integer"},{"name":1,"type":"integer"},{"name":2,"type":"integer"},{"name":3,"type":"integer"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":0,"1":1,"2":2,"3":3},{"index":1,"1":4,"2":5,"3":6}]}', orient="table")

The last one raise an exception (this was intended to be a minimal reproduceable example) ; the one I tested at first (see this post in stackoverflow) returned a dataframe with empty values.

jreback · 2020-12-03T12:03:32Z

pls edit the top post and show only a complete round trip - and where it fails

this is not designed to be hand edited - integer keys are also not allowed (so puzzled where that is coming from)

tgrandje · 2020-12-03T13:00:42Z

Just edited the post. Though the error here is not the exception raised when reading the json, it is the way pandas writes it (so the exception is perfectly right and not related to this problem).

The int keys (which are not usual in pandas either, I know) came from a 'melt' method ; if key integers are strictly forbidden by the json format, maybe we should insert something like this in the to_json method :

if any([isinstance(x, int) for x in df.columns]):
  #raise Exception("int columns are not allowed") / warnings.warn("int columns will be transformed to strings")

arw2019 · 2020-12-09T05:03:15Z

Is this the minimal reproducer that's supposed to roundtrip?

In [10]: import pandas as pd
    ...: import pandas._testing as tm
    ...: 
    ...: df = pd.DataFrame([[1,2,3],[4,5,6]], columns=[1, 2, 3])
    ...: json = test.to_json(orient="table")
    ...: res = pd.read_json(json, orient="table")
    ...: tm.assert_frame_equal(res, df)

Currently it throws:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-85dad6d132f6> in <module>
      4 df = pd.DataFrame([[1,2,3],[4,5,6]], columns=[1, 2, 3])
      5 json = test.to_json(orient="table")
----> 6 res = pd.read_json(json, orient="table")
      7 tm.assert_frame_equal(res, df)

~/repos/pandas/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    197                 else:
    198                     kwargs[new_arg_name] = new_arg_value
--> 199             return func(*args, **kwargs)
    200 
    201         return cast(F, wrapper)

~/repos/pandas/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    297                 )
    298                 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 299             return func(*args, **kwargs)
    300 
    301         return wrapper

~/repos/pandas/pandas/io/json/_json.py in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines, chunksize, compression, nrows, storage_options)
    561 
    562     with json_reader:
--> 563         return json_reader.read()
    564 
    565 

~/repos/pandas/pandas/io/json/_json.py in read(self)
    692                 obj = self._get_object_parser(self._combine_lines(data_lines))
    693         else:
--> 694             obj = self._get_object_parser(self.data)
    695         self.close()
    696         return obj

~/repos/pandas/pandas/io/json/_json.py in _get_object_parser(self, json)
    714         obj = None
    715         if typ == "frame":
--> 716             obj = FrameParser(json, **kwargs).parse()
    717 
    718         if typ == "series" or obj is None:

~/repos/pandas/pandas/io/json/_json.py in parse(self)
    829 
    830         else:
--> 831             self._parse_no_numpy()
    832 
    833         if self.obj is None:

~/repos/pandas/pandas/io/json/_json.py in _parse_no_numpy(self)
   1093             )
   1094         elif orient == "table":
-> 1095             self.obj = parse_table_schema(json, precise_float=self.precise_float)
   1096         else:
   1097             self.obj = DataFrame(

~/repos/pandas/pandas/io/json/_table_schema.py in parse_table_schema(json, precise_float)
    330         )
    331 
--> 332     df = df.astype(dtypes)
    333 
    334     if "primaryKey" in table["schema"]:

~/repos/pandas/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5855                 if col_name in dtype:
   5856                     results.append(
-> 5857                         col.astype(dtype=dtype[col_name], copy=copy, errors=errors)
   5858                     )
   5859                 else:

~/repos/pandas/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5870         else:
   5871             # else, only a single dtype is given
-> 5872             new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   5873             return self._constructor(new_data).__finalize__(self, method="astype")
   5874 

~/repos/pandas/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
    629         self, dtype, copy: bool = False, errors: str = "raise"
    630     ) -> "BlockManager":
--> 631         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    632 
    633     def convert(

~/repos/pandas/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
    425                     applied = b.apply(f, **kwargs)
    426                 else:
--> 427                     applied = getattr(b, f)(**kwargs)
    428             except (TypeError, NotImplementedError):
    429                 if not ignore_failures:

~/repos/pandas/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
    671             vals1d = values.ravel()
    672             try:
--> 673                 values = astype_nansafe(vals1d, dtype, copy=True)
    674             except (ValueError, TypeError):
    675                 # e.g. astype_nansafe can fail on object-dtype of strings

~/repos/pandas/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
   1062 
   1063         if not np.isfinite(arr).all():
-> 1064             raise ValueError("Cannot convert non-finite values (NA or inf) to integer")
   1065 
   1066     elif is_object_dtype(arr):

ValueError: Cannot convert non-finite values (NA or inf) to integer

tgrandje · 2020-12-10T09:52:01Z

@arw2019
Yes and no : yes this is the whole minimal example (I added the read_json as aked by @jreback).
Though as I said, the exception thrown is quite normal as the (silent) error occurs in "to_json". In fact, it casts the columns' labels to string in data, but those stay as integer in schema. So it won't be decoded after that.

EDIT :
I'd say the problem lies in file pandas/pandas/io/json/_table_schema.py and is still there in current master (though I haven't tested it):
new_field["name"] = name (line 256)
name = arr.name (line 109)

I can't figure how the objToJSON command from pandas._libs.json is working as I don't know C (best update of the code might be to mimic it as it seems to be used to parse the data).

Wolf-Byte · 2021-11-29T20:50:18Z

Is there any update on this? I ran into the same issue, heres a simple round trip to replicate the issue.

import pandas as pd

# List
arr = ["123"]

# Create the dataframe
dataframe = pd.DataFrame(arr)
print(dataframe)

# Get the table as a schema
dataframe_schema = dataframe.to_json(orient='table')
print(dataframe_schema)

# Load the DataFrame from the json object
dataframe = pd.read_json(dataframe_schema, orient='table')
print(dataframe)

Output

     0
0  123

{"schema": {"fields":[{"name":"index","type":"integer"},{"name":0,"type":"string"}],"primaryKey":["index"],"pandas_version":"0.20.0"}, "data": [{"index":0,"0":"123"}]}

     0
0  NaN

For anyone else with the same issue, as a workaround I am casting columns names in the schema output to strings.

import pandas as pd

# List
arr = ["123"]

# Create the dataframe
dataframe = pd.DataFrame(arr)
print(dataframe)

# Get the table as a schema
dataframe_schema = json.loads(dataframe.to_json(orient='table'))

# BUG FIX: Loop over the schema fields
for field in dataframe_schema.get("schema").get("fields"): 
    # Cast the column name to a string
    field["name"] = str(field.get("name"))
    
# Dump the object to a string
dataframe_schema_str = json.dumps(dataframe_schema)
print(dataframe_schema_str)

dataframe = pd.read_json(dataframe_schema_str, orient='table')
print(dataframe)

jreback · 2021-11-29T23:05:06Z

@Wolf-Byte updates happen when community folks push PRs

the core team can provide review

tgrandje added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 3, 2020

arw2019 added IO JSON read_json, to_json, json_normalize and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 9, 2020

jmg-duarte mentioned this issue Sep 10, 2022

BUG: pd.read_json sets wrong value for numeric column names #40674

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pandas to_json with orient "table" returns wrong schema & data string #38256

BUG: pandas to_json with orient "table" returns wrong schema & data string #38256

tgrandje commented Dec 3, 2020 •

edited

Loading

INSTALLED VERSIONS

jreback commented Dec 3, 2020

tgrandje commented Dec 3, 2020

jreback commented Dec 3, 2020

tgrandje commented Dec 3, 2020

arw2019 commented Dec 9, 2020

tgrandje commented Dec 10, 2020 •

edited

Loading

Wolf-Byte commented Nov 29, 2021

jreback commented Nov 29, 2021

BUG: pandas to_json with orient "table" returns wrong schema & data string #38256

BUG: pandas to_json with orient "table" returns wrong schema & data string #38256

Comments

tgrandje commented Dec 3, 2020 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Dec 3, 2020

tgrandje commented Dec 3, 2020

jreback commented Dec 3, 2020

tgrandje commented Dec 3, 2020

arw2019 commented Dec 9, 2020

tgrandje commented Dec 10, 2020 • edited Loading

Wolf-Byte commented Nov 29, 2021

jreback commented Nov 29, 2021

tgrandje commented Dec 3, 2020 •

edited

Loading

Output of `pd.show_versions()`

tgrandje commented Dec 10, 2020 •

edited

Loading