-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: simple, compact and reversible JSON interface #53252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Does It has a schema like so |
Thank you Thomas for your answer. In principle, yes, this option takes into account the notion of type. But this is very limited (see examples added in the Notebook) :
The proposal made meets these limitations whithout complex code. |
In general, I think we should only have 1 "table" format for pandas in Note that some of the issues are a bug, while others (ExtensionDtypes not being handled properly) are issues with the format itself) Personally, I don't think the size issue is a big issue since if the output size matters to you, you should probablypick a binary format like parquet anyways. Can the existing format be adapted in a way that fixes the type issues/issues with roundtripping? |
Hello Thomas, Thank you for that answer. I will add two additional remarks:
I think that the problem cannot be limited to bug fixes and that a clear strategy must be defined for the Json interface in particular with the gradual abandonment in open-data solutions of the obsolete CSV format in favor of a Json format. As stated, the proposed solution addresses several shortcomings of the current interface and could simply fit into the pandas environment (the other option would be to consider that the Json interface is a peripheral function of pandas and can remain external to pandas) regardless of the If the Notebook is not sufficient to assess the interest of the proposed solution, do not hesitate to challenge me ! Note: For the needs of open-data and sharing with the greatest number, the parquet format cannot be retained (but I agree on the interest of this format for performance and size requirements) |
As far as I can tell, JSON NTV is not in any form a standardised JSON format, there is no community movement behind it and @loco-philippe is the sole author of https://github.com/loco-philippe/NTV, which is where the specification seems to live. I believe that pandas (and geopandas, which is where I came from to this issue) should try to follow either de facto or de jure standards and do not opt in for a file format that does not have any community support at this moment. This can obviously change in the future and that is where this PR should be revised. That does not mean that the JSON NTV is a wrong file format or that there's anything wrong with it but starting with its adoption by implementing it directly in pandas JSON IO feels wrong. |
Hello Martin, As indicated in the issue (and detailed in the attached Notebook), the json interface is not reversible ( The proposal made answers this problem (the example at the beginning of Notebook simply and clearly illustrates the interest of the proposal). However, your answer is focused on the JSON-NTV format and does not address the identified problem, which can be interpreted in two different ways:
Regarding the underlying JSON-NTV format, its impact is quite low for tabular data (it is limited to adding the type in the field name). To conclude, I therefore remain interested in:
|
IMHO, there is nothing wrong with proposing a new value of See also related discussion in #4889 |
Thank you Irv for your support ! I agree with the proposal to set a new value for Furthermore, what is the workflow for validating this enhancement :
|
One thing that would be a concern is that we'd create a dependency between your NTV library and pandas, and I'm not sure if the core team would be OK with that. My suggestion is that you attend the pandas developer meeting on June 14 at 1800 UTC (see https://pandas.pydata.org/docs/development/community.html#community-meeting for info) to bring this idea up and we can get more feedback there. |
Yes, I understand the dependency problem especially since this NTV library is not a recognized standard. Your suggestion is a very good idea, unfortunately I will not be able to be available on June 14th. Would it be inconvenient to attend the meeting on the 28th instead? I also propose to share before the meeting a note on the json interface presenting the general problem (current situation, defects, limits), the key points (consideration of non-pandas types, json exchange format, json / pandas object conversion) and the various possible implementation options (with/without dependency, scope limited or not). |
You could consider doing a PDEP that would gain more visibility. See https://pandas.pydata.org/pdeps/0001-purpose-and-guidelines.html |
Thanks for the advice ! A first project will be submitted before June 18. |
The first project is submitted (PR #53714) |
Following the PDEP-12 proposal #53714, do I have to prepare something else ? |
I need to take a look at this and provide some feedback. Thanks for the reminder |
I added two complements to the attached notebook:
I am preparing an issue equivalent to this one for |
FAQ is added to the PDEP document |
Today, the Thus, if the interface must comply with the Table-schema specification, it will be necessary to implement a solution similar to the one proposed (this has moreover been partially achieved with the notion of |
The JSON-NTV format and the NTV structure are now included in the IETF data-base as an Internet-Draft. Comments are welcome ! |
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
The data type is not explicitely taken into account in the current JSON interface.
To work around this problem, a data schema must be associated with the json file.
Nevertheless, the current Json interface is not reversible.
Feature Description
proposal
To have a simple, compact and reversible solution, I propose to use the JSON-NTV format (Named and Typed Value) - which integrates the notion of type - and its JSON-TAB variation for tabular data.
This solution allows to include a large number of types (not necessarily Pandas dtype).
examples
Several examples are provided in the linked NoteBook
references
Alternative Solutions
The alternative solution is to describe each piece of data in the form {'type': xxx, 'value': xxx}, which significantly weighs down the format used.
I don't know of an alternative Json format that integrates the notion of type (please let me know if you know any!)
Additional Context
-> NTV repository : https://github.com/loco-philippe/NTV
The text was updated successfully, but these errors were encountered: