Skip to content

ENH: simple, compact and reversible JSON interface #53252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 3 tasks
loco-philippe opened this issue May 16, 2023 · 19 comments
Open
1 of 3 tasks

ENH: simple, compact and reversible JSON interface #53252

loco-philippe opened this issue May 16, 2023 · 19 comments
Labels
Enhancement IO JSON read_json, to_json, json_normalize Needs Discussion Requires discussion from core team before further action

Comments

@loco-philippe
Copy link
Contributor

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

The data type is not explicitely taken into account in the current JSON interface.

To work around this problem, a data schema must be associated with the json file.

Nevertheless, the current Json interface is not reversible.

Feature Description

proposal

To have a simple, compact and reversible solution, I propose to use the JSON-NTV format (Named and Typed Value) - which integrates the notion of type - and its JSON-TAB variation for tabular data.
This solution allows to include a large number of types (not necessarily Pandas dtype).

examples

Several examples are provided in the linked NoteBook

references

Alternative Solutions

The alternative solution is to describe each piece of data in the form {'type': xxx, 'value': xxx}, which significantly weighs down the format used.
I don't know of an alternative Json format that integrates the notion of type (please let me know if you know any!)

Additional Context

-> NTV repository : https://github.com/loco-philippe/NTV

@loco-philippe loco-philippe added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 16, 2023
@lithomas1
Copy link
Member

Does orient="table" not do what you are proposing already?

It has a schema like so
https://specs.frictionlessdata.io/table-schema/#descriptor

@loco-philippe
Copy link
Contributor Author

Thank you Thomas for your answer.

In principle, yes, this option takes into account the notion of type.

But this is very limited (see examples added in the Notebook) :

  • Types and Json interface
    • the only way to keep the types in the json interface is to use the orient='table' option
    • only few types are allowed in json-table interface : int64, float64, bool, datetime64, timedelta64, categorical
    • allowed types are not always kept in json-table interface
    • data with 'object' dtype is kept only id data is string
    • with categorical dtype, the underlying dtype is not included in json interface
  • Data compactness
    • json-table interface is not compact (in this example the size is double or triple the size of the compact format
  • Reversibility
    • Interface is reversible only with json dtype
  • External types
    • the interface does not accept external types
    • to integrate external types, it is necessary to first create ExtensionArray and ExtensionDtype objects

The proposal made meets these limitations whithout complex code.

@lithomas1
Copy link
Member

In general, I think we should only have 1 "table" format for pandas in read_json/to_json.
There is also the issue of backwards compatibility if we do change the format.

Note that some of the issues are a bug, while others (ExtensionDtypes not being handled properly) are issues with the format itself)

Personally, I don't think the size issue is a big issue since if the output size matters to you, you should probablypick a binary format like parquet anyways.

Can the existing format be adapted in a way that fixes the type issues/issues with roundtripping?

@lithomas1 lithomas1 added IO JSON read_json, to_json, json_normalize Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 18, 2023
@loco-philippe
Copy link
Contributor Author

Hello Thomas,

Thank you for that answer.

I will add two additional remarks:

  • the types defined in Tableschema are only partially taken into account (examples of types not taken into account in the interface: string-uri, array, date, time, year, geopoint):
  • the read_json() interface works with the following data: {'simple': [1,2,3] } (contrary to what is indicated in the documentation) but it is impossible with to_json() to recreate this json ( yet basic).

I think that the problem cannot be limited to bug fixes and that a clear strategy must be defined for the Json interface in particular with the gradual abandonment in open-data solutions of the obsolete CSV format in favor of a Json format.

As stated, the proposed solution addresses several shortcomings of the current interface and could simply fit into the pandas environment (the other option would be to consider that the Json interface is a peripheral function of pandas and can remain external to pandas) regardless of the orient='table' option.

If the Notebook is not sufficient to assess the interest of the proposed solution, do not hesitate to challenge me !

Note: For the needs of open-data and sharing with the greatest number, the parquet format cannot be retained (but I agree on the interest of this format for performance and size requirements)

@martinfleis
Copy link
Contributor

As far as I can tell, JSON NTV is not in any form a standardised JSON format, there is no community movement behind it and @loco-philippe is the sole author of https://github.com/loco-philippe/NTV, which is where the specification seems to live.

I believe that pandas (and geopandas, which is where I came from to this issue) should try to follow either de facto or de jure standards and do not opt in for a file format that does not have any community support at this moment. This can obviously change in the future and that is where this PR should be revised.

That does not mean that the JSON NTV is a wrong file format or that there's anything wrong with it but starting with its adoption by implementing it directly in pandas JSON IO feels wrong.

@loco-philippe
Copy link
Contributor Author

Hello Martin,

As indicated in the issue (and detailed in the attached Notebook), the json interface is not reversible (to_json then read_json does not always return the initial object) and several shortcomings and bugs are present. The main cause of this problem is that the data type is not taken into account in the JSON format (or very partially with the orient='table' option).

The proposal made answers this problem (the example at the beginning of Notebook simply and clearly illustrates the interest of the proposal).

However, your answer is focused on the JSON-NTV format and does not address the identified problem, which can be interpreted in two different ways:

  • either the shortcomings of the current interface are not considered critical (non-priority improvements)
  • either the json interface is considered as a peripheral function on which it is not important to work (outsourcing possible)

Regarding the underlying JSON-NTV format, its impact is quite low for tabular data (it is limited to adding the type in the field name).
Nevertheless, your remark is relevant: The JSON-NTV format is indeed a shared, documented, supported and implemented format, but indeed the community support is for the moment reduced but it only asks to expand!!

To conclude, I therefore remain interested in:

  • understand what is the strategy or action plan envisaged to improve the JSON interface
  • as indicated in the issue, identify what are the possible alternative solutions

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 5, 2023

IMHO, there is nothing wrong with proposing a new value of orient in to_json(). I did this with to_dict() and orient="tight" where I wanted a concise dictionary representation that preserved the names of the indices. But we most likely wouldn't make it the default value.

See also related discussion in #4889

@loco-philippe
Copy link
Contributor Author

Thank you Irv for your support !

I agree with the proposal to set a new value for orient (even if it's not the default...).
For example "typed" (to indicate that the type is included) or "NTV" to indicate the reference to the format or another proposal (?)

Furthermore, what is the workflow for validating this enhancement :

  • do the Needs discussion label first need to be deleted ?
  • are additional elements necessary for the core team to decide ?
  • does the proposal need to be clarified or supplemented (for example, the multi-index is not yet addressed) ?
  • do I assign the issue to myself now via take label ?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 5, 2023

One thing that would be a concern is that we'd create a dependency between your NTV library and pandas, and I'm not sure if the core team would be OK with that.

My suggestion is that you attend the pandas developer meeting on June 14 at 1800 UTC (see https://pandas.pydata.org/docs/development/community.html#community-meeting for info) to bring this idea up and we can get more feedback there.

@loco-philippe
Copy link
Contributor Author

Yes, I understand the dependency problem especially since this NTV library is not a recognized standard.

Your suggestion is a very good idea, unfortunately I will not be able to be available on June 14th. Would it be inconvenient to attend the meeting on the 28th instead?

I also propose to share before the meeting a note on the json interface presenting the general problem (current situation, defects, limits), the key points (consideration of non-pandas types, json exchange format, json / pandas object conversion) and the various possible implementation options (with/without dependency, scope limited or not).

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 6, 2023

I also propose to share before the meeting a note on the json interface presenting the general problem (current situation, defects, limits), the key points (consideration of non-pandas types, json exchange format, json / pandas object conversion) and the various possible implementation options (with/without dependency, scope limited or not).

You could consider doing a PDEP that would gain more visibility. See https://pandas.pydata.org/pdeps/0001-purpose-and-guidelines.html

@loco-philippe
Copy link
Contributor Author

Thanks for the advice !

A first project will be submitted before June 18.

@loco-philippe
Copy link
Contributor Author

The first project is submitted (PR #53714)

@loco-philippe
Copy link
Contributor Author

Following the PDEP-12 proposal #53714, do I have to prepare something else ?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 27, 2023

Following the PDEP-12 proposal #53714, do I have to prepare something else ?

I need to take a look at this and provide some feedback. Thanks for the reminder

@loco-philippe
Copy link
Contributor Author

I added two complements to the attached notebook:

  • example of difference in behavior between to_csv and to_json
  • example of JSON-TAB format for multidimensional data and compatibility between Xarray-DataArray structure and pandas-DataFrame.

I am preparing an issue equivalent to this one for Xarray (with in a second step an issue for a reversible conversion between DataFrame and DataArray - transformation of a tabular structure into a multidimensional structure)

@loco-philippe
Copy link
Contributor Author

FAQ is added to the PDEP document

@loco-philippe
Copy link
Contributor Author

Today, the orient='table' interface only partially meets the Table-Schema specification (only 5 data types out of 20 are taken into account - table here) and without addition to the notion of dtype, it will not be possible to respect the specification.

Thus, if the interface must comply with the Table-schema specification, it will be necessary to implement a solution similar to the one proposed (this has moreover been partially achieved with the notion of extDtype found in the interface for several formats).

@loco-philippe
Copy link
Contributor Author

The JSON-NTV format and the NTV structure are now included in the IETF data-base as an Internet-Draft.

Comments are welcome !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO JSON read_json, to_json, json_normalize Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants