ENH: simple, compact and reversible JSON interface #53252

loco-philippe · 2023-05-16T07:41:43Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

The data type is not explicitely taken into account in the current JSON interface.

To work around this problem, a data schema must be associated with the json file.

Nevertheless, the current Json interface is not reversible.

Feature Description

proposal

To have a simple, compact and reversible solution, I propose to use the JSON-NTV format (Named and Typed Value) - which integrates the notion of type - and its JSON-TAB variation for tabular data.
This solution allows to include a large number of types (not necessarily Pandas dtype).

examples

Several examples are provided in the linked NoteBook

references

Alternative Solutions

The alternative solution is to describe each piece of data in the form {'type': xxx, 'value': xxx}, which significantly weighs down the format used.
I don't know of an alternative Json format that integrates the notion of type (please let me know if you know any!)

Additional Context

-> NTV repository : https://github.com/loco-philippe/NTV

lithomas1 · 2023-05-16T18:19:59Z

Does orient="table" not do what you are proposing already?

It has a schema like so
https://specs.frictionlessdata.io/table-schema/#descriptor

loco-philippe · 2023-05-17T09:56:43Z

Thank you Thomas for your answer.

In principle, yes, this option takes into account the notion of type.

But this is very limited (see examples added in the Notebook) :

Types and Json interface
- the only way to keep the types in the json interface is to use the orient='table' option
- only few types are allowed in json-table interface : int64, float64, bool, datetime64, timedelta64, categorical
- allowed types are not always kept in json-table interface
- data with 'object' dtype is kept only id data is string
- with categorical dtype, the underlying dtype is not included in json interface
Data compactness
- json-table interface is not compact (in this example the size is double or triple the size of the compact format
Reversibility
- Interface is reversible only with json dtype
External types
- the interface does not accept external types
- to integrate external types, it is necessary to first create ExtensionArray and ExtensionDtype objects

The proposal made meets these limitations whithout complex code.

lithomas1 · 2023-05-18T23:35:36Z

In general, I think we should only have 1 "table" format for pandas in read_json/to_json.
There is also the issue of backwards compatibility if we do change the format.

Note that some of the issues are a bug, while others (ExtensionDtypes not being handled properly) are issues with the format itself)

Personally, I don't think the size issue is a big issue since if the output size matters to you, you should probablypick a binary format like parquet anyways.

Can the existing format be adapted in a way that fixes the type issues/issues with roundtripping?

loco-philippe · 2023-05-19T22:29:39Z

Hello Thomas,

Thank you for that answer.

I will add two additional remarks:

the types defined in Tableschema are only partially taken into account (examples of types not taken into account in the interface: string-uri, array, date, time, year, geopoint):
the read_json() interface works with the following data: {'simple': [1,2,3] } (contrary to what is indicated in the documentation) but it is impossible with to_json() to recreate this json ( yet basic).

I think that the problem cannot be limited to bug fixes and that a clear strategy must be defined for the Json interface in particular with the gradual abandonment in open-data solutions of the obsolete CSV format in favor of a Json format.

As stated, the proposed solution addresses several shortcomings of the current interface and could simply fit into the pandas environment (the other option would be to consider that the Json interface is a peripheral function of pandas and can remain external to pandas) regardless of the orient='table' option.

If the Notebook is not sufficient to assess the interest of the proposed solution, do not hesitate to challenge me !

Note: For the needs of open-data and sharing with the greatest number, the parquet format cannot be retained (but I agree on the interest of this format for performance and size requirements)

martinfleis · 2023-05-26T23:43:38Z

As far as I can tell, JSON NTV is not in any form a standardised JSON format, there is no community movement behind it and @loco-philippe is the sole author of https://github.com/loco-philippe/NTV, which is where the specification seems to live.

I believe that pandas (and geopandas, which is where I came from to this issue) should try to follow either de facto or de jure standards and do not opt in for a file format that does not have any community support at this moment. This can obviously change in the future and that is where this PR should be revised.

That does not mean that the JSON NTV is a wrong file format or that there's anything wrong with it but starting with its adoption by implementing it directly in pandas JSON IO feels wrong.

loco-philippe · 2023-05-28T15:25:34Z

Hello Martin,

As indicated in the issue (and detailed in the attached Notebook), the json interface is not reversible (to_json then read_json does not always return the initial object) and several shortcomings and bugs are present. The main cause of this problem is that the data type is not taken into account in the JSON format (or very partially with the orient='table' option).

The proposal made answers this problem (the example at the beginning of Notebook simply and clearly illustrates the interest of the proposal).

However, your answer is focused on the JSON-NTV format and does not address the identified problem, which can be interpreted in two different ways:

either the shortcomings of the current interface are not considered critical (non-priority improvements)
either the json interface is considered as a peripheral function on which it is not important to work (outsourcing possible)

Regarding the underlying JSON-NTV format, its impact is quite low for tabular data (it is limited to adding the type in the field name).
Nevertheless, your remark is relevant: The JSON-NTV format is indeed a shared, documented, supported and implemented format, but indeed the community support is for the moment reduced but it only asks to expand!!

To conclude, I therefore remain interested in:

understand what is the strategy or action plan envisaged to improve the JSON interface
as indicated in the issue, identify what are the possible alternative solutions

Dr-Irv · 2023-06-05T14:43:18Z

IMHO, there is nothing wrong with proposing a new value of orient in to_json(). I did this with to_dict() and orient="tight" where I wanted a concise dictionary representation that preserved the names of the indices. But we most likely wouldn't make it the default value.

See also related discussion in #4889

loco-philippe · 2023-06-05T22:06:27Z

Thank you Irv for your support !

I agree with the proposal to set a new value for orient (even if it's not the default...).
For example "typed" (to indicate that the type is included) or "NTV" to indicate the reference to the format or another proposal (?)

Furthermore, what is the workflow for validating this enhancement :

do the Needs discussion label first need to be deleted ?
are additional elements necessary for the core team to decide ?
does the proposal need to be clarified or supplemented (for example, the multi-index is not yet addressed) ?
do I assign the issue to myself now via take label ?

Dr-Irv · 2023-06-05T22:27:28Z

One thing that would be a concern is that we'd create a dependency between your NTV library and pandas, and I'm not sure if the core team would be OK with that.

My suggestion is that you attend the pandas developer meeting on June 14 at 1800 UTC (see https://pandas.pydata.org/docs/development/community.html#community-meeting for info) to bring this idea up and we can get more feedback there.

loco-philippe · 2023-06-06T09:19:53Z

Yes, I understand the dependency problem especially since this NTV library is not a recognized standard.

Your suggestion is a very good idea, unfortunately I will not be able to be available on June 14th. Would it be inconvenient to attend the meeting on the 28th instead?

I also propose to share before the meeting a note on the json interface presenting the general problem (current situation, defects, limits), the key points (consideration of non-pandas types, json exchange format, json / pandas object conversion) and the various possible implementation options (with/without dependency, scope limited or not).

Dr-Irv · 2023-06-06T12:05:25Z

I also propose to share before the meeting a note on the json interface presenting the general problem (current situation, defects, limits), the key points (consideration of non-pandas types, json exchange format, json / pandas object conversion) and the various possible implementation options (with/without dependency, scope limited or not).

You could consider doing a PDEP that would gain more visibility. See https://pandas.pydata.org/pdeps/0001-purpose-and-guidelines.html

loco-philippe · 2023-06-07T08:44:10Z

Thanks for the advice !

A first project will be submitted before June 18.

loco-philippe · 2023-06-18T21:12:43Z

The first project is submitted (PR #53714)

loco-philippe · 2023-06-27T09:44:39Z

Following the PDEP-12 proposal #53714, do I have to prepare something else ?

Dr-Irv · 2023-06-27T13:30:53Z

Following the PDEP-12 proposal #53714, do I have to prepare something else ?

I need to take a look at this and provide some feedback. Thanks for the reminder

loco-philippe · 2023-07-11T21:39:55Z

I added two complements to the attached notebook:

example of difference in behavior between to_csv and to_json
example of JSON-TAB format for multidimensional data and compatibility between Xarray-DataArray structure and pandas-DataFrame.

I am preparing an issue equivalent to this one for Xarray (with in a second step an issue for a reversible conversion between DataFrame and DataArray - transformation of a tabular structure into a multidimensional structure)

loco-philippe · 2023-07-22T21:27:44Z

FAQ is added to the PDEP document

loco-philippe · 2023-08-04T13:14:11Z

Today, the orient='table' interface only partially meets the Table-Schema specification (only 5 data types out of 20 are taken into account - table here) and without addition to the notion of dtype, it will not be possible to respect the specification.

Thus, if the interface must comply with the Table-schema specification, it will be necessary to implement a solution similar to the one proposed (this has moreover been partially achieved with the notion of extDtype found in the interface for several formats).

loco-philippe · 2023-08-23T21:36:22Z

The JSON-NTV format and the NTV structure are now included in the IETF data-base as an Internet-Draft.

Comments are welcome !

loco-philippe added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 16, 2023

lithomas1 added IO JSON read_json, to_json, json_normalize Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 18, 2023

loco-philippe mentioned this issue May 26, 2023

ENH: simple, compact and reversible JSON interface geopandas/geopandas#2904

Closed

loco-philippe mentioned this issue Jun 18, 2023

PDEP-12: compact-and-reversible-JSON-interface.md #53714

Merged

1 task

loco-philippe mentioned this issue Sep 6, 2023

ENH: Extending the orient="table" option to all Table Schema types #55038

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: simple, compact and reversible JSON interface #53252

ENH: simple, compact and reversible JSON interface #53252

loco-philippe commented May 16, 2023

lithomas1 commented May 16, 2023

loco-philippe commented May 17, 2023

lithomas1 commented May 18, 2023

loco-philippe commented May 19, 2023

martinfleis commented May 26, 2023

loco-philippe commented May 28, 2023

Dr-Irv commented Jun 5, 2023

loco-philippe commented Jun 5, 2023

Dr-Irv commented Jun 5, 2023

loco-philippe commented Jun 6, 2023

Dr-Irv commented Jun 6, 2023

loco-philippe commented Jun 7, 2023

loco-philippe commented Jun 18, 2023

loco-philippe commented Jun 27, 2023

Dr-Irv commented Jun 27, 2023

loco-philippe commented Jul 11, 2023

loco-philippe commented Jul 22, 2023

loco-philippe commented Aug 4, 2023

loco-philippe commented Aug 23, 2023