Skip to content

BUG: pandas to_json with orient "table" returns wrong schema & data string #38256

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
tgrandje opened this issue Dec 3, 2020 · 8 comments
Open
2 of 3 tasks
Labels
Bug IO JSON read_json, to_json, json_normalize

Comments

@tgrandje
Copy link

tgrandje commented Dec 3, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

test = pd.DataFrame([[1,2,3],[4,5,6]], columns=[1, 2, 3])
s = test.to_json(orient="table")
print(s)
# wrong string :
# '{"schema":{"fields":[{"name":"index","type":"integer"},{"name":1,"type":"integer"},{"name":2,"type":"integer"},{"name":3,"type":"integer"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":0,"1":1,"2":2,"3":3},{"index":1,"1":4,"2":5,"3":6}]}'
pd.read_json(s, orient="table")
#ValueError: Cannot convert non-finite values (NA or inf) to integer

Problem description

When the initial columns are integers, the schema dict returns correct names (that unquoted integers), but the data dict identifies columns as string (quoted integers). Therefore, any loaded dataframe from this json format will return a dataframe full of empty (NaN) values or fail with an exception (I don't know which triggers which ; this minimal example here will trigger an exception ; my original dataset with multiindexes in stackoverflow returned an empty dataframe...

Expected Output

This output for pandas.to_json(orient="table") could be read (though it is losing the "int" label key and transforming it to strings) :

'{"schema":{"fields":[{"name":"index","type":"integer"},{"name":"1","type":"integer"},{"name":"2","type":"integer"},{"name":"3","type":"integer"}],"primaryKey":["index"],"pandas_version":"0.0.20"},"data":[{"index":0,"1":1,"2":2,"3":3},{"index":1,"1":4,"2":5,"3":6}]}'

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 67a3d42
python : 3.6.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder : little
LC_ALL : None
LANG : fr
LOCALE : None.None

pandas : 1.1.4
numpy : 1.18.4
pytz : 2017.2
dateutil : 2.8.1
pip : 20.2.4
setuptools : 36.6.0
Cython : 0.27.2
pytest : 3.2.3
hypothesis : None
sphinx : 1.6.5
blosc : 1.5.1
feather : 0.4.0
xlsxwriter : 1.0.2
lxml.etree : 4.1.0
html5lib : 0.9999999
pymysql : None
psycopg2 : None
jinja2 : 2.9.6
IPython : 6.2.1
pandas_datareader: None
bs4 : 4.6.0
bottleneck : 1.2.1
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 2.2.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.7.1
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.1.14
tables : None
tabulate : 0.8.5
xarray : 0.9.6
xlrd : 1.1.0
xlwt : None
numba : 0.35.0

@tgrandje tgrandje added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 3, 2020
@jreback
Copy link
Contributor

jreback commented Dec 3, 2020

show the code for reading

the pandas_version is correct ; it's the last time this format changed (this is documented)

@tgrandje
Copy link
Author

tgrandje commented Dec 3, 2020

This will work (all strings) :
pd.read_json('{"schema":{"fields":[{"name":"index","type":"integer"},{"name":"1","type":"integer"},{"name":"2","type":"integer"},{"name":"3","type":"integer"}],"primaryKey":["index"],"pandas_version":"0.0.20"},"data":[{"index":0,"1":1,"2":2,"3":3},{"index":1,"1":4,"2":5,"3":6}]}', orient="table")

This won't work (all ints, I get ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.) :
pd.read_json('{"schema":{"fields":[{"name":"index","type":"integer"},{"name":1,"type":"integer"},{"name":2,"type":"integer"},{"name":3,"type":"integer"}],"primaryKey":["index"],"pandas_version":"1.1.4"},"data":[{"index":0,1:1,2:2,3:3},{"index":1,1:4,2:5,3:6}]}')

(Note that I made a mistake : this last code was the one I posted before as "expected output" but there are obviously some things about the 'mixing dicts' exception which are out of my level.)

This won't work either (with the string worked out by pandas from the to_json as posted before) :
pd.read_json('{"schema":{"fields":[{"name":"index","type":"integer"},{"name":1,"type":"integer"},{"name":2,"type":"integer"},{"name":3,"type":"integer"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":0,"1":1,"2":2,"3":3},{"index":1,"1":4,"2":5,"3":6}]}', orient="table")

The last one raise an exception (this was intended to be a minimal reproduceable example) ; the one I tested at first (see this post in stackoverflow) returned a dataframe with empty values.

@jreback
Copy link
Contributor

jreback commented Dec 3, 2020

pls edit the top post and show only a complete round trip - and where it fails

this is not designed to be hand edited - integer keys are also not allowed (so puzzled where that is coming from)

@tgrandje
Copy link
Author

tgrandje commented Dec 3, 2020

Just edited the post. Though the error here is not the exception raised when reading the json, it is the way pandas writes it (so the exception is perfectly right and not related to this problem).

The int keys (which are not usual in pandas either, I know) came from a 'melt' method ; if key integers are strictly forbidden by the json format, maybe we should insert something like this in the to_json method :

if any([isinstance(x, int) for x in df.columns]):
  #raise Exception("int columns are not allowed") / warnings.warn("int columns will be transformed to strings")

@arw2019
Copy link
Member

arw2019 commented Dec 9, 2020

Is this the minimal reproducer that's supposed to roundtrip?

In [10]: import pandas as pd
    ...: import pandas._testing as tm
    ...: 
    ...: df = pd.DataFrame([[1,2,3],[4,5,6]], columns=[1, 2, 3])
    ...: json = test.to_json(orient="table")
    ...: res = pd.read_json(json, orient="table")
    ...: tm.assert_frame_equal(res, df)

Currently it throws:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-85dad6d132f6> in <module>
      4 df = pd.DataFrame([[1,2,3],[4,5,6]], columns=[1, 2, 3])
      5 json = test.to_json(orient="table")
----> 6 res = pd.read_json(json, orient="table")
      7 tm.assert_frame_equal(res, df)

~/repos/pandas/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    197                 else:
    198                     kwargs[new_arg_name] = new_arg_value
--> 199             return func(*args, **kwargs)
    200 
    201         return cast(F, wrapper)

~/repos/pandas/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    297                 )
    298                 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 299             return func(*args, **kwargs)
    300 
    301         return wrapper

~/repos/pandas/pandas/io/json/_json.py in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines, chunksize, compression, nrows, storage_options)
    561 
    562     with json_reader:
--> 563         return json_reader.read()
    564 
    565 

~/repos/pandas/pandas/io/json/_json.py in read(self)
    692                 obj = self._get_object_parser(self._combine_lines(data_lines))
    693         else:
--> 694             obj = self._get_object_parser(self.data)
    695         self.close()
    696         return obj

~/repos/pandas/pandas/io/json/_json.py in _get_object_parser(self, json)
    714         obj = None
    715         if typ == "frame":
--> 716             obj = FrameParser(json, **kwargs).parse()
    717 
    718         if typ == "series" or obj is None:

~/repos/pandas/pandas/io/json/_json.py in parse(self)
    829 
    830         else:
--> 831             self._parse_no_numpy()
    832 
    833         if self.obj is None:

~/repos/pandas/pandas/io/json/_json.py in _parse_no_numpy(self)
   1093             )
   1094         elif orient == "table":
-> 1095             self.obj = parse_table_schema(json, precise_float=self.precise_float)
   1096         else:
   1097             self.obj = DataFrame(

~/repos/pandas/pandas/io/json/_table_schema.py in parse_table_schema(json, precise_float)
    330         )
    331 
--> 332     df = df.astype(dtypes)
    333 
    334     if "primaryKey" in table["schema"]:

~/repos/pandas/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5855                 if col_name in dtype:
   5856                     results.append(
-> 5857                         col.astype(dtype=dtype[col_name], copy=copy, errors=errors)
   5858                     )
   5859                 else:

~/repos/pandas/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5870         else:
   5871             # else, only a single dtype is given
-> 5872             new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   5873             return self._constructor(new_data).__finalize__(self, method="astype")
   5874 

~/repos/pandas/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
    629         self, dtype, copy: bool = False, errors: str = "raise"
    630     ) -> "BlockManager":
--> 631         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    632 
    633     def convert(

~/repos/pandas/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
    425                     applied = b.apply(f, **kwargs)
    426                 else:
--> 427                     applied = getattr(b, f)(**kwargs)
    428             except (TypeError, NotImplementedError):
    429                 if not ignore_failures:

~/repos/pandas/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
    671             vals1d = values.ravel()
    672             try:
--> 673                 values = astype_nansafe(vals1d, dtype, copy=True)
    674             except (ValueError, TypeError):
    675                 # e.g. astype_nansafe can fail on object-dtype of strings

~/repos/pandas/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
   1062 
   1063         if not np.isfinite(arr).all():
-> 1064             raise ValueError("Cannot convert non-finite values (NA or inf) to integer")
   1065 
   1066     elif is_object_dtype(arr):

ValueError: Cannot convert non-finite values (NA or inf) to integer

@arw2019 arw2019 added IO JSON read_json, to_json, json_normalize and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 9, 2020
@tgrandje
Copy link
Author

tgrandje commented Dec 10, 2020

@arw2019
Yes and no : yes this is the whole minimal example (I added the read_json as aked by @jreback).
Though as I said, the exception thrown is quite normal as the (silent) error occurs in "to_json". In fact, it casts the columns' labels to string in data, but those stay as integer in schema. So it won't be decoded after that.

EDIT :
I'd say the problem lies in file pandas/pandas/io/json/_table_schema.py and is still there in current master (though I haven't tested it):
new_field["name"] = name (line 256)
name = arr.name (line 109)

I can't figure how the objToJSON command from pandas._libs.json is working as I don't know C (best update of the code might be to mimic it as it seems to be used to parse the data).

@Wolf-Byte
Copy link

Is there any update on this? I ran into the same issue, heres a simple round trip to replicate the issue.

import pandas as pd

# List
arr = ["123"]

# Create the dataframe
dataframe = pd.DataFrame(arr)
print(dataframe)

# Get the table as a schema
dataframe_schema = dataframe.to_json(orient='table')
print(dataframe_schema)

# Load the DataFrame from the json object
dataframe = pd.read_json(dataframe_schema, orient='table')
print(dataframe)

Output

     0
0  123

{"schema": {"fields":[{"name":"index","type":"integer"},{"name":0,"type":"string"}],"primaryKey":["index"],"pandas_version":"0.20.0"}, "data": [{"index":0,"0":"123"}]}

     0
0  NaN

For anyone else with the same issue, as a workaround I am casting columns names in the schema output to strings.

import pandas as pd

# List
arr = ["123"]

# Create the dataframe
dataframe = pd.DataFrame(arr)
print(dataframe)

# Get the table as a schema
dataframe_schema = json.loads(dataframe.to_json(orient='table'))

# BUG FIX: Loop over the schema fields
for field in dataframe_schema.get("schema").get("fields"): 
    # Cast the column name to a string
    field["name"] = str(field.get("name"))
    
# Dump the object to a string
dataframe_schema_str = json.dumps(dataframe_schema)
print(dataframe_schema_str)

dataframe = pd.read_json(dataframe_schema_str, orient='table')
print(dataframe)

@jreback
Copy link
Contributor

jreback commented Nov 29, 2021

@Wolf-Byte updates happen when community folks push PRs

the core team can provide review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

No branches or pull requests

4 participants