Skip to content

Raise ValueError for read_json and orient='table' With Numeric Column Names #19129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
WillAyd opened this issue Jan 8, 2018 · 20 comments
Open
Assignees
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas good first issue IO JSON read_json, to_json, json_normalize

Comments

@WillAyd
Copy link
Member

WillAyd commented Jan 8, 2018

Code Sample, a copy-pastable example if possible

df = pd.DataFrame([[1,2,3,4]], columns=[5,6,7,8])
df.to_json('test.json', orient='table')
pd.read_json('test.json', orient='table')

KeyError: '[5 6 7 8] not in index'

Output of pd.show_versions()

INSTALLED VERSIONS

commit: 16d0262
python: 3.6.3.final.0
python-bits: 64
OS: Darwin
OS-release: 17.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.0.dev0+79.g16d026212
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.3
scipy: 1.0.0
pyarrow: 0.8.0
xarray: 0.10.0
IPython: 6.2.1
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.1.1
openpyxl: 2.5.0b1
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.1.13
pymysql: 0.7.11.None
psycopg2: None
jinja2: 2.10
s3fs: 0.1.2
fastparquet: 0.1.3
pandas_gbq: None
pandas_datareader: None

@WillAyd WillAyd changed the title read_json and table='orient' Fails With Numeric Column Names read_json and table='orient' Fails With Numeric Column Names Jan 8, 2018
@TomAugspurger TomAugspurger changed the title read_json and table='orient' Fails With Numeric Column Names read_json and orient='table' Fails With Numeric Column Names Jan 8, 2018
@TomAugspurger
Copy link
Contributor

I thought that https://frictionlessdata.io/specs/table-schema/ required the field names to be strings, but that may not be the case.

@TomAugspurger TomAugspurger added the IO JSON read_json, to_json, json_normalize label Jan 8, 2018
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Jan 8, 2018
@robmarkcole
Copy link
Contributor

There is an error with string column/index names too:

df = pd.DataFrame([['a', 'b'], ['c', 'd']],
                  index=['row 1', 'row 2'],
                  columns=['col 1', 'col 2'])

df.to_json(orient='table')
pd.read_json(_, orient='table')
...
ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.

This issue should be renamed to read_json and orient='table' Fails

@WillAyd
Copy link
Member Author

WillAyd commented Mar 10, 2018

@robmarkcole what version are you using? Your example ran fine for me on master.

INSTALLED VERSIONS

commit: 28e7a9498457e8cccc105a6261958197889325fa
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.0.dev0+487.g28e7a9498
pytest: 3.4.1
pip: 9.0.1
setuptools: 38.5.1
Cython: 0.27.3
numpy: 1.14.1
scipy: 1.0.0
pyarrow: 0.8.0
xarray: 0.10.0
IPython: 6.2.1
sphinx: 1.7.0
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.1.2
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: 0.1.3
fastparquet: 0.1.4
pandas_gbq: None
pandas_datareader: None

@robmarkcole
Copy link
Contributor

@WillAyd error on 0.22.0. No error on 0.23.0.dev0

@WillAyd
Copy link
Member Author

WillAyd commented Mar 11, 2018

General support for read_json with table='orient' was only just added for the v0.23 release so makes sense it doesn't work on 0.22. See #19039

@MichaMucha
Copy link

Hi!
Captain Obvious here, just wanted to say getting this in 0.23.0 as well

import pandas as pd
breaking_case = pd.DataFrame({
    1: [1,2], 
    2: [3,4]}
)
pd.read_json(breaking_case.to_json(orient='table'), orient='table')

KeyError: '[1 2] not in index'

@WillAyd
Copy link
Member Author

WillAyd commented May 17, 2018

@MichaMucha that's right this was never implemented, though not sure if it's valid JSON either. If interested investigation into the schema linked above to confirm or deny is welcome!

@MichaMucha
Copy link

MichaMucha commented May 19, 2018

Thanks for the quick reply!

I read through the page and it seems that:

  • a descriptor specifying columns is an "ordered dict" equivalent
  • order of column descriptions in that dict implies column order
  • each column is described by a few properties, one of those is name.
  • technically speaking, { "name" : 1 } is valid JSON

I would reason that you can have numerical column names.

direct quotes from the spec (a "field" is a column in our case):
A Table Schema is represented by a descriptor. The descriptor MUST be a JSON object (JSON is defined in RFC 4627).
It MUST contain a property fields. fields MUST be an array where each entry in the array is a field descriptor (as defined below). The order of elements in fields array MUST be the order of fields in the CSV file.
A field descriptor MUST be a JSON object that describes a single field.

Let me know what can I do if I can help more

@WillAyd
Copy link
Member Author

WillAyd commented May 20, 2018

Thanks for the review! The table implementation is located in pandas/pandas/io/json/table_schema.py so if you poke around there you should see where this could be implemented. Assuming this is your first time, also be sure to read through the contributing guide:

https://pandas.pydata.org/pandas-docs/stable/contributing.html

Hope that helps but let me know if you have any questions

@MichaMucha
Copy link

Thank you! Sorry took me a while to find time.
Thanks for the links! I looked at the table implementation. Turns out table_schema.py is not to blame!
The assignment happens at line 96 and leaves the type intact.

I dug a little deeper and it seems that this guy over here -
pandas/pandas/io/json/json.py:214 - JSONTableWriter._write
wants to write data in a row-oriented fashion.
Every row becomes a dict, this makes every column a key, and JSON needs keys to be strings.
When you read it back, you get strings obviously.

One workaround I can think about is to match field['name'] to the stringified columns found in datahere, and cast them to the type of field['name'].
This will cause ambiguity trouble if you have a data frame with a column "1" and 1 for example.

Another solution I can imagine is an error asking the user to provide string columns.

Let me know what you think.
Thanks again for the contributing guide!

@WillAyd
Copy link
Member Author

WillAyd commented Jun 1, 2018

@MichaMucha IIUC that supports the argument that numeric column names should not be allowed, given the column names are the keys in the JSON table schema and JSON keys need to be strings.

If that's the case and you are looking to contribute then I'd suggest perhaps raising a more descriptive ValueError when trying to write to the JSON table schema using non-string column names

@WillAyd WillAyd changed the title read_json and orient='table' Fails With Numeric Column Names Raise ValueError for read_json and orient='table' With Numeric Column Names Jun 18, 2018
@albertvillanova
Copy link
Contributor

albertvillanova commented Feb 20, 2019

@WillAyd I think the problem is with the JSON serialization of the DataFrame (the 'data' for orient='table') and not with 'schema' (TableSchema spec).

IIUC, the aim of orient='table' is to make round-trip JSON serialization-deserialization of pandas objects. As pandas DataFrame can have non-string column names (indeed, that is the default if column names are not passed explicitly at instantiation), then the column names SHOULD NOT be used as keys in the JSON object (JSON spec imposes that keys must be strings).

It is also the case for index names: they can be non-strings. Therefore, index names should not be used as keys in the JSON object.

@WillAyd
Copy link
Member Author

WillAyd commented Feb 20, 2019

#19129 should already cover that just needs community PR

@albertvillanova
Copy link
Contributor

Yes, but I was suggesting not just "raising a more descriptive ValueError" (sic), but changing the implementation of the JSON serialization for orient='table'.

@mroeschke mroeschke added Enhancement Error Reporting Incorrect or improved errors from pandas labels May 8, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@IsaacG
Copy link

IsaacG commented Jul 12, 2023

take

@chandra-teajunkie
Copy link

Hi,
take
I would like to try working on this if this is still up.

@IsaacG
Copy link

IsaacG commented Feb 17, 2025

release

@chandra-teajunkie
Copy link

chandra-teajunkie commented Feb 17, 2025

Hi Isaac,
I am unsure about what the comment means.
Sorry if its something obvious

@IsaacG
Copy link

IsaacG commented Feb 20, 2025

I am not working on this issue and happy to no longer be assigned to this issue.

I cannot comment on the importance/relevance/value of this issue at this time.

@chandra-teajunkie
Copy link

got it.
thank you for the response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas good first issue IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants