From 049b7b098587a9b20763d2e074f25fcae16fc547 Mon Sep 17 00:00:00 2001 From: Philippe THOMY Date: Sun, 18 Jun 2023 15:09:53 +0200 Subject: [PATCH 01/22] Create 0007-compact-and-reversible-JSON-interface.md --- ...7-compact-and-reversible-JSON-interface.md | 287 ++++++++++++++++++ 1 file changed, 287 insertions(+) create mode 100644 web/pandas/pdeps/0007-compact-and-reversible-JSON-interface.md diff --git a/web/pandas/pdeps/0007-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0007-compact-and-reversible-JSON-interface.md new file mode 100644 index 0000000000000..82374eca1ac5a --- /dev/null +++ b/web/pandas/pdeps/0007-compact-and-reversible-JSON-interface.md @@ -0,0 +1,287 @@ +# PDEP-7: Compact and reversible JSON interface + +- Created: 16 June 2023 +- Status: Under discussions +- Discussion: [#53252](https://github.com/pandas-dev/pandas/issues/53252) +- Author: [Philippe THOMY](https://github.com/loco-philippe) +- Revision: 1 + + +#### Summary +- [Abstract](./pandas_PDEP.md/#Abstract) + - [Problem description](./pandas_PDEP.md/#Problem-description) + - [Feature Description](./pandas_PDEP.md/#Feature-Description) +- [Scope](./pandas_PDEP.md/#Scope) +- [Motivation](./pandas_PDEP.md/#Motivation) + - [Why is it important to have a compact and reversible JSON interface ?](./pandas_PDEP.md/#Why-is-it-important-to-have-a-compact-and-reversible-JSON-interface-?) + - [Is it relevant to take an extended type into account ?](./pandas_PDEP.md/#Is-it-relevant-to-take-an-extended-type-into-account-?) +- [Description](./pandas_PDEP.md/#Description) + - [data typing](./pandas_PDEP.md/#Data-typing) + - [JSON format](./pandas_PDEP.md/#JSON-format) + - [Conversion](./pandas_PDEP.md/#Conversion) +- [Usage and impact](./pandas_PDEP.md/#Usage-and-impact) + - [Usage](./pandas_PDEP.md/#Usage) + - [Compatibility](./pandas_PDEP.md/#Compatibility) + - [Impacts on the pandas framework](./pandas_PDEP.md/#Impacts-on-the-pandas-framework) + - [Risk to do / risk not to do](./pandas_PDEP.md/#Risk-to-do-/-risk-not-to-do) +- [Implementation](./pandas_PDEP.md/#Implementation) + - [Modules](./pandas_PDEP.md/#Modules) + - [Implementation options](./pandas_PDEP.md/#Implementation-options) +- [F.A.Q.](./pandas_PDEP.md/#F.A.Q.) +- [Core team decision](./pandas_PDEP.md/#Core-team-decision) +- [Timeline](./pandas_PDEP.md/#Timeline) +- [PDEP history](./pandas_PDEP.md/#PDEP-history) +------------------------- +## Abstract + +### Problem description +The `dtype` is not explicitely taken into account in the current JSON interface. +To work around this problem, a data schema (e.g. `TableSchema`) must be associated with the JSON file. + +Nevertheless, the current JSON interface is not reversible and has inconsistencies related to the consideration of the `dtype`. + +Some JSON-interface problems are detailed in the [linked NoteBook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#1---Current-Json-interface) + + +### Feature Description +To have a simple, compact and reversible solution, I propose to use the [JSON-NTV format (Named and Typed Value)](https://github.com/loco-philippe/NTV#readme) - which integrates the notion of type - and its JSON-TAB variation for tabular data. + +This solution allows to include a large number of types (not necessarily pandas `dtype`). + +In the example below, a DataFrame with several data types is converted to JSON. +The DataFrame resulting from this JSON is identical to the initial DataFrame (reversibility). +With the existing JSON interface, this conversion is not possible. + +*data example* +```python +In [1]: from shapely.geometry import Point + from datetime import date + +In [2]: data = {'index': [100, 200, 300, 400, 500, 600], + 'dates::date': pd.Series([date(1964,1,1), date(1985,2,5), date(2022,1,21), date(1964,1,1), date(1985,2,5), date(2022,1,21)]), + 'value': [10, 10, 20, 20, 30, 30], + 'value32': pd.Series([12, 12, 22, 22, 32, 32], dtype='int32'), + 'res': [10, 20, 30, 10, 20, 30], + 'coord::point': pd.Series([Point(1,2), Point(3,4), Point(5,6), Point(7,8), Point(3,4), Point(5,6)]), + 'names': pd.Series(['john', 'eric', 'judith', 'mila', 'hector', 'maria'], dtype='string'), + 'unique': True } + +In [3]: df = pd.DataFrame(data).set_index('index') + +In [4]: df +Out[4]: + dates::date value value32 res coord::point names unique + index + 100 1964-01-01 10 12 10 POINT (1 2) john True + 200 1985-02-05 10 12 20 POINT (3 4) eric True + 300 2022-01-21 20 22 30 POINT (5 6) judith True + 400 1964-01-01 20 22 10 POINT (7 8) mila True + 500 1985-02-05 30 32 20 POINT (3 4) hector True + 600 2022-01-21 30 32 30 POINT (5 6) maria True +``` + +*JSON representation* + +```python +In [5]: df_json = Ntv.obj(df) + pprint(df_json.to_obj(), compact=True, width=120) +Out[5]: + {':tab': {'coord::point': [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0], [3.0, 4.0], [5.0, 6.0]], + 'dates::date': ['1964-01-01', '1985-02-05', '2022-01-21', '1964-01-01', '1985-02-05', '2022-01-21'], + 'index': [100, 200, 300, 400, 500, 600], + 'names::string': ['john', 'eric', 'judith', 'mila', 'hector', 'maria'], + 'res': [10, 20, 30, 10, 20, 30], + 'unique': [True, True, True, True, True, True], + 'value': [10, 10, 20, 20, 30, 30], + 'value32::int32': [12, 12, 22, 22, 32, 32]}} +``` + + +*Reversibility* + +```python +In [5]: df_from_json = df_json.to_obj(format='obj') + print('df created from JSON is equal to initial df ? ', df_from_json.equals(df)) + +Out[5]: df created from JSON is equal to initial df ? True +``` +Several other examples are provided in the [linked NoteBook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#2---Series) + +## Scope +The objective is to make available the proposed JSON interface for any type of data. + +The proposed interface is compatible with existing data. + +## Motivation + +### Why is it important to have a compact and reversible JSON interface ? +- a reversible interface provides an exchange format. +- a textual exchange format facilitates exchanges between platforms (e.g. OpenData) +- a JSON exchange format can be used at API level + +### Is it relevant to take an extended type into account ? +- it avoids the addition of an additional data schema +- it increases the semantic scope of the data processed by pandas +- the use of a complementary type avoids having to modify the pandas data model + + +## Description + +The proposed solution is based on several key points: +- data typing +- JSON format for tabular data +- conversion to and from JSON format + +### Data typing +Data types are defined and managed in the NTV project (name, JSON encoder and decoder). + +Pandas `dtype` are compatible with NTV types : + +| **pandas dtype** | **NTV type** | +|--------------------|------------| +| intxx | intxx | +| uintxx | uintxx | +| floatxx | floatxx | +| datetime[ns] | datetime | +| datetime[ns, ] | datetimetz | +| timedelta[ns] | durationiso| +| string | string | +| boolean | boolean | + +Note: +- datetime with timezone is a single NTV type (string ISO8601) +- `CategoricalDtype` and `SparseDtype` are included in the tabular JSON format +- `object` `dtype` is depending on the context (see below) +- `PeriodDtype` and `IntervalDtype` are to be defined + +JSON types (implicit or explicit) are converted in `dtype` following pandas JSON interface: + +| **JSON type** | **pandas dtype** | +|----------------|-------------------| +| number | int64 / float64 | +| string | string / object | +| array | object | +| object | object | +| true, false | boolean | +| null | NaT / NaN / None | + +Note: +- if an NTV type is defined, the `dtype` is ajusted accordingly +- the consideration of null type data needs to be clarified + +The other NTV types are associated with `object` `dtype`. + +### JSON format +The JSON format is defined in [JSON-TAB](https://github.com/loco-philippe/NTV/blob/main/documentation/JSON-TAB-standard.pdf) specification. +It includes the naming rules originally defined in the [JSON-ND project](https://github.com/glenkleidon/JSON-ND) and support for categorical data. +The specification have to be updated to include sparse data. + +### Conversion +When data is associated with a non-`object` `dtype`, pandas conversion methods are used. +Otherwise, NTV conversion is used. + +#### pandas -> JSON +- `NTV type` is not defined : use `to_json()` +- `NTV type` is defined and `dtype` is not `object` : use `to_json()` +- `NTV type` is defined and `dtype` is `object` : use NTV conversion + +#### JSON -> pandas +- `NTV type` is compatible with a `dtype` : use `read_json()` +- `NTV type` is not compatible with a `dtype` : use NTV conversion + +## Usage and Impact + +### Usage +It seems to me that this proposal responds to important issues: +- having an efficient text format for data exchange + + The alternative CSV format is not reversible and obsolete (last revision in 2005). Current CSV tools do not comply with the standard. + +- taking into account "semantic" data in pandas objects + +### Compatibility +Interface can be used without NTV type (compatibility with existing data - [see examples](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#4---Annexe-:-Series-tests)) + +If the interface is available, throw a new `orient` option in the JSON interface, the use of the feature is decoupled from the other features. + +### Impacts on the pandas framework +Initially, the impacts are very limited: +- modification of the `name` of `Series` or `DataFrame columns` (no functional impact), +- added an option in the Json interface (e.g. `orient='ntv'`) and added associated methods (no functionnal interference with the other methods) + +In later stages, several developments could be considered: +- validation of the `name` of `Series` or `DataFrame columns` , +- management of the NTV type as a "complementary-object-dtype" +- functional extensions depending on the NTV type + +### Risk to do / risk not to do +The JSON-NTV format and the JSON-TAB format are not (yet) recognized and used formats. The risk for pandas is that this function is not used (no functional impacts). + +On the other hand, the early use by pandas will allow a better consideration of the expectations and needs of pandas as well as a reflection on the evolution of the types supported by pandas. + +## Implementation + +### Modules +Two modules are defined for NTV: + +- json-ntv + + this module manages NTV data without dependency to another module + +- ntvconnector + + those modules manage the conversion between objects and JSON data. They have dependency with objects modules (e.g. connectors with shapely location have dependency with shapely). + +The pandas integration of the JSON interface requires importing only the json-ntv module. + +### Implementation options +The interface can be implemented as NTV connector (`SeriesConnector` and `DataFrameConnector`) and as a new pandas JSON interface `orient` option. +```mermaid +flowchart TB + subgraph H [conversion pandas-NTV] + direction LR + K(mapping NTV type / dtype\nNTV / pandas) ~~~ I(object dtype conversion\nNTV Connector) + I ~~~ J(non-object dtype conversion\npandas) + end + + direction TB + D{{pandas}} ~~~ F(interface JSON\npandas) + E{{NTV}} ~~~ G(pandas NTV Connector\nNTV) + F --> H + G --> H +``` +Several pandas implementations are possible: + +1. External: + + In this implementation, the interface is available only in the NTV side. + This option means that this evolution of the JSON interface is not useful or strategic for pandas. + +2. NTV side: + + In this implementation, the interface is available in the both sides and the conversion is located inside NTV. + This option is the one that minimizes the impacts on the pandas side + +3. pandas side: + + In this implementation, the interface is available in the both sides and the conversion is located inside pandas. + This option allows pandas to keep control of this evolution + +4. pandas restricted: + + In this implementation, the pandas interface and the conversion are located inside pandas and only for non-object `dtype`. + This option makes it possible to offer a compact and reversible interface while prohibiting the introduction of types incompatible with the existing `dtype` + +## F.A.Q. + +Tbd + +## Core team decision +Implementation option : xxxx + +## Timeline +Tbd + +## PDEP History + +- 16 June 2023: Initial draft From 09ed538b59249c64cb452a0c251963294bf81397 Mon Sep 17 00:00:00 2001 From: Philippe THOMY Date: Mon, 19 Jun 2023 10:56:38 +0200 Subject: [PATCH 02/22] change PDEP number (7 -> 12) --- ...terface.md => 0012-compact-and-reversible-JSON-interface.md} | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename web/pandas/pdeps/{0007-compact-and-reversible-JSON-interface.md => 0012-compact-and-reversible-JSON-interface.md} (99%) diff --git a/web/pandas/pdeps/0007-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md similarity index 99% rename from web/pandas/pdeps/0007-compact-and-reversible-JSON-interface.md rename to web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md index 82374eca1ac5a..1b9633c099859 100644 --- a/web/pandas/pdeps/0007-compact-and-reversible-JSON-interface.md +++ b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md @@ -1,4 +1,4 @@ -# PDEP-7: Compact and reversible JSON interface +# PDEP-12: Compact and reversible JSON interface - Created: 16 June 2023 - Status: Under discussions From f4d1f5e2a6f011e8fa2a8cb014b1c31d74fcbdf3 Mon Sep 17 00:00:00 2001 From: Philippe THOMY Date: Sat, 22 Jul 2023 23:20:52 +0200 Subject: [PATCH 03/22] Add FAQ to the PDEPS 0012 --- ...2-compact-and-reversible-JSON-interface.md | 50 ++++++++++++++++++- 1 file changed, 49 insertions(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md index 1b9633c099859..58def62e63372 100644 --- a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md +++ b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md @@ -15,6 +15,7 @@ - [Motivation](./pandas_PDEP.md/#Motivation) - [Why is it important to have a compact and reversible JSON interface ?](./pandas_PDEP.md/#Why-is-it-important-to-have-a-compact-and-reversible-JSON-interface-?) - [Is it relevant to take an extended type into account ?](./pandas_PDEP.md/#Is-it-relevant-to-take-an-extended-type-into-account-?) + - [Is this only useful for pandas ?](./pandas_PDEP.md/#Is-this-only-useful-for-pandas-?) - [Description](./pandas_PDEP.md/#Description) - [data typing](./pandas_PDEP.md/#Data-typing) - [JSON format](./pandas_PDEP.md/#JSON-format) @@ -124,6 +125,9 @@ The proposed interface is compatible with existing data. - it increases the semantic scope of the data processed by pandas - the use of a complementary type avoids having to modify the pandas data model +### Is this only useful for pandas ? +- the JSON-TAB format is applicable to tabular data and multi-dimensional data. +- this JSON interface can therefore be used for any application using tabular or multi-dimensional data. This would allow for example reversible data exchanges between pandas - DataFrame and Xarray - DataArray (Xarray issue under construction) [see example DataFrame / DataArray](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#Multidimensional-data). ## Description @@ -274,7 +278,50 @@ Several pandas implementations are possible: ## F.A.Q. -Tbd +**Q: Does `orient="table"` not do what you are proposing already?** + +**A**: In principle, yes, this option takes into account the notion of type. + +But this is very limited (see examples added in the [Notebook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb)) : +- **Types and Json interface** + - the only way to keep the types in the json interface is to use the orient='table' option + - only few types are allowed in json-table interface : int64, float64, bool, datetime64, timedelta64, categorical + - allowed types are not always kept in json-table interface + - data with 'object' dtype is kept only id data is string + - with categorical dtype, the underlying dtype is not included in json interface +- **Data compactness** + - json-table interface is not compact (in this example the size is double or triple the size of the compact format +- **Reversibility** + - Interface is reversible only with json dtype +- **External types** + - the interface does not accept external types + - to integrate external types, it is necessary to first create ExtensionArray and ExtensionDtype objects + +The proposal made meets these limitations whithout complex code. + +**Q: In general, we should only have 1 `"table"` format for pandas in read_json/to_json. There is also the issue of backwards compatibility if we do change the format. Can the existing format be adapted in a way that fixes the type issues/issues with roundtripping?** + +**A**: I will add two additional remarks: +- the types defined in Tableschema are only partially taken into account (examples of types not taken into account in the interface: string-uri, array, date, time, year, geopoint): +- the `read_json()` interface works with the following data: `{'simple': [1,2,3] }` (contrary to what is indicated in the documentation) but it is impossible with `to_json()` to recreate this json ( yet basic). + +I think that the problem cannot be limited to bug fixes and that a clear strategy must be defined for the Json interface in particular with the gradual abandonment in open-data solutions of the obsolete CSV format in favor of a Json format. + +As stated, the proposed solution addresses several shortcomings of the current interface and could simply fit into the pandas environment (the other option would be to consider that the Json interface is a peripheral function of pandas and can remain external to pandas) regardless of the `orient='table'` option. + +**Q: As far as I can tell, JSON NTV is not in any form a standardised JSON format. I believe that pandas (and geopandas, which is where I came from to this issue) should try to follow either de facto or de jure standards and do not opt in for a file format that does not have any community support at this moment. This can obviously change in the future and that is where this PR should be revised. Why would pandas use this standard?** + +**A**: As indicated in the issue (and detailed in [the attached Notebook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb)), the json interface is not reversible (`to_json` then `read_json` does not always return the initial object) and several shortcomings and bugs are present. The main cause of this problem is that the data type is not taken into account in the JSON format (or very partially with the `orient='table'` option). + +The proposal made answers this problem ([the example at the beginning of Notebook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#0---Simple-example) simply and clearly illustrates the interest of the proposal). + +Regarding the underlying JSON-NTV format, its impact is quite low for tabular data (it is limited to adding the type in the field name). +Nevertheless, the question is relevant: The JSON-NTV format is indeed a shared, documented, supported and implemented format, but indeed the community support is for the moment reduced but it only asks to expand!! + +To conclude, +- if it is important (or strategic) to have a reversible JSON interface for any type of data, the proposal can be allowed, +- if not, a third-party package that reads/writes this format to/from pandas DataFrames listed in the [ecosystem](https://pandas.pydata.org/community/ecosystem.html) should be considered + ## Core team decision Implementation option : xxxx @@ -285,3 +332,4 @@ Tbd ## PDEP History - 16 June 2023: Initial draft +- 22 July 2023: Add F.A.Q. From 8d0f2f4912fa45291eaae927f76f9b5b7843daf7 Mon Sep 17 00:00:00 2001 From: Philippe THOMY Date: Fri, 4 Aug 2023 15:06:50 +0200 Subject: [PATCH 04/22] Update 0012-compact-and-reversible-JSON-interface.md --- ...012-compact-and-reversible-JSON-interface.md | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md index 58def62e63372..e4647b40bb053 100644 --- a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md +++ b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md @@ -285,29 +285,32 @@ Several pandas implementations are possible: But this is very limited (see examples added in the [Notebook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb)) : - **Types and Json interface** - the only way to keep the types in the json interface is to use the orient='table' option - - only few types are allowed in json-table interface : int64, float64, bool, datetime64, timedelta64, categorical + - few dtypes are not allowed in json-table interface : period, timedelta64, interval - allowed types are not always kept in json-table interface - data with 'object' dtype is kept only id data is string - with categorical dtype, the underlying dtype is not included in json interface - **Data compactness** - - json-table interface is not compact (in this example the size is double or triple the size of the compact format + - json-table interface is not compact (in the example in the [Notebook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#data-compactness))the size is triple or quadruple the size of the compact format - **Reversibility** - - Interface is reversible only with json dtype + - Interface is reversible only with few dtypes : int64, float64, bool, string, datetime64 and partially categorical - **External types** - the interface does not accept external types + - Table-schema defines 20 data types but the `orient="table"` interface takes into account 5 data types (see [table](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#Converting-table-schema-type-to-pandas-dtype)) - to integrate external types, it is necessary to first create ExtensionArray and ExtensionDtype objects -The proposal made meets these limitations whithout complex code. +The current interface is not compatible with the data structure defined by table-schema. For this to be possible, it is necessary to integrate a "type extension" like the one proposed (this has moreover been partially achieved with the notion of `extDtype` found in the interface for several formats). -**Q: In general, we should only have 1 `"table"` format for pandas in read_json/to_json. There is also the issue of backwards compatibility if we do change the format. Can the existing format be adapted in a way that fixes the type issues/issues with roundtripping?** +**Q: In general, we should only have 1 `"table"` format for pandas in read_json/to_json. There is also the issue of backwards compatibility if we do change the format. The fact that the table interface is buggy is not a reason to add a new interface (I'd rather fix those bugs). Can the existing format be adapted in a way that fixes the type issues/issues with roundtripping?** **A**: I will add two additional remarks: -- the types defined in Tableschema are only partially taken into account (examples of types not taken into account in the interface: string-uri, array, date, time, year, geopoint): -- the `read_json()` interface works with the following data: `{'simple': [1,2,3] }` (contrary to what is indicated in the documentation) but it is impossible with `to_json()` to recreate this json ( yet basic). +- the types defined in Tableschema are partially (only 5 out of 20) taken into account (examples of types not taken into account in the interface: string-uri, array, date, time, year, geopoint): +- the `read_json()` interface works too with the following data: `{'simple': [1,2,3] }` (contrary to what is indicated in the documentation) but it is impossible with `to_json()` to recreate this simple json. I think that the problem cannot be limited to bug fixes and that a clear strategy must be defined for the Json interface in particular with the gradual abandonment in open-data solutions of the obsolete CSV format in favor of a Json format. As stated, the proposed solution addresses several shortcomings of the current interface and could simply fit into the pandas environment (the other option would be to consider that the Json interface is a peripheral function of pandas and can remain external to pandas) regardless of the `orient='table'` option. + +It is nevertheless possible to merge the proposed format and the `orient='table'` format in order to have an explicit management of the notion of `extDtype` **Q: As far as I can tell, JSON NTV is not in any form a standardised JSON format. I believe that pandas (and geopandas, which is where I came from to this issue) should try to follow either de facto or de jure standards and do not opt in for a file format that does not have any community support at this moment. This can obviously change in the future and that is where this PR should be revised. Why would pandas use this standard?** From 3f3aae0f24c7035f0af0be6b59bb81d6a1b1f3b9 Mon Sep 17 00:00:00 2001 From: Philippe THOMY Date: Fri, 4 Aug 2023 15:25:00 +0200 Subject: [PATCH 05/22] Update 0012-compact-and-reversible-JSON-interface.md --- ...2-compact-and-reversible-JSON-interface.md | 102 +++++++++--------- 1 file changed, 51 insertions(+), 51 deletions(-) diff --git a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md index e4647b40bb053..be2674dd0f111 100644 --- a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md +++ b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md @@ -3,7 +3,7 @@ - Created: 16 June 2023 - Status: Under discussions - Discussion: [#53252](https://github.com/pandas-dev/pandas/issues/53252) -- Author: [Philippe THOMY](https://github.com/loco-philippe) +- Author: [Philippe THOMY](https://github.com/loco-philippe) - Revision: 1 @@ -32,13 +32,13 @@ - [Core team decision](./pandas_PDEP.md/#Core-team-decision) - [Timeline](./pandas_PDEP.md/#Timeline) - [PDEP history](./pandas_PDEP.md/#PDEP-history) -------------------------- +------------------------- ## Abstract ### Problem description The `dtype` is not explicitely taken into account in the current JSON interface. To work around this problem, a data schema (e.g. `TableSchema`) must be associated with the JSON file. - + Nevertheless, the current JSON interface is not reversible and has inconsistencies related to the consideration of the `dtype`. Some JSON-interface problems are detailed in the [linked NoteBook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#1---Current-Json-interface) @@ -46,10 +46,10 @@ Some JSON-interface problems are detailed in the [linked NoteBook](https://nbvie ### Feature Description To have a simple, compact and reversible solution, I propose to use the [JSON-NTV format (Named and Typed Value)](https://github.com/loco-philippe/NTV#readme) - which integrates the notion of type - and its JSON-TAB variation for tabular data. - + This solution allows to include a large number of types (not necessarily pandas `dtype`). -In the example below, a DataFrame with several data types is converted to JSON. +In the example below, a DataFrame with several data types is converted to JSON. The DataFrame resulting from this JSON is identical to the initial DataFrame (reversibility). With the existing JSON interface, this conversion is not possible. @@ -59,7 +59,7 @@ In [1]: from shapely.geometry import Point from datetime import date In [2]: data = {'index': [100, 200, 300, 400, 500, 600], - 'dates::date': pd.Series([date(1964,1,1), date(1985,2,5), date(2022,1,21), date(1964,1,1), date(1985,2,5), date(2022,1,21)]), + 'dates::date': pd.Series([date(1964,1,1), date(1985,2,5), date(2022,1,21), date(1964,1,1), date(1985,2,5), date(2022,1,21)]), 'value': [10, 10, 20, 20, 30, 30], 'value32': pd.Series([12, 12, 22, 22, 32, 32], dtype='int32'), 'res': [10, 20, 30, 10, 20, 30], @@ -70,9 +70,9 @@ In [2]: data = {'index': [100, 200, 300, 400, 500, 600], In [3]: df = pd.DataFrame(data).set_index('index') In [4]: df -Out[4]: +Out[4]: dates::date value value32 res coord::point names unique - index + index 100 1964-01-01 10 12 10 POINT (1 2) john True 200 1985-02-05 10 12 20 POINT (3 4) eric True 300 2022-01-21 20 22 30 POINT (5 6) judith True @@ -86,7 +86,7 @@ Out[4]: ```python In [5]: df_json = Ntv.obj(df) pprint(df_json.to_obj(), compact=True, width=120) -Out[5]: +Out[5]: {':tab': {'coord::point': [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0], [3.0, 4.0], [5.0, 6.0]], 'dates::date': ['1964-01-01', '1985-02-05', '2022-01-21', '1964-01-01', '1985-02-05', '2022-01-21'], 'index': [100, 200, 300, 400, 500, 600], @@ -110,7 +110,7 @@ Several other examples are provided in the [linked NoteBook](https://nbviewer.or ## Scope The objective is to make available the proposed JSON interface for any type of data. - + The proposed interface is compatible with existing data. ## Motivation @@ -126,7 +126,7 @@ The proposed interface is compatible with existing data. - the use of a complementary type avoids having to modify the pandas data model ### Is this only useful for pandas ? -- the JSON-TAB format is applicable to tabular data and multi-dimensional data. +- the JSON-TAB format is applicable to tabular data and multi-dimensional data. - this JSON interface can therefore be used for any application using tabular or multi-dimensional data. This would allow for example reversible data exchanges between pandas - DataFrame and Xarray - DataArray (Xarray issue under construction) [see example DataFrame / DataArray](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#Multidimensional-data). ## Description @@ -174,16 +174,16 @@ Note: - the consideration of null type data needs to be clarified The other NTV types are associated with `object` `dtype`. - + ### JSON format The JSON format is defined in [JSON-TAB](https://github.com/loco-philippe/NTV/blob/main/documentation/JSON-TAB-standard.pdf) specification. -It includes the naming rules originally defined in the [JSON-ND project](https://github.com/glenkleidon/JSON-ND) and support for categorical data. +It includes the naming rules originally defined in the [JSON-ND project](https://github.com/glenkleidon/JSON-ND) and support for categorical data. The specification have to be updated to include sparse data. ### Conversion -When data is associated with a non-`object` `dtype`, pandas conversion methods are used. +When data is associated with a non-`object` `dtype`, pandas conversion methods are used. Otherwise, NTV conversion is used. - + #### pandas -> JSON - `NTV type` is not defined : use `to_json()` - `NTV type` is defined and `dtype` is not `object` : use `to_json()` @@ -192,22 +192,22 @@ Otherwise, NTV conversion is used. #### JSON -> pandas - `NTV type` is compatible with a `dtype` : use `read_json()` - `NTV type` is not compatible with a `dtype` : use NTV conversion - + ## Usage and Impact - + ### Usage It seems to me that this proposal responds to important issues: -- having an efficient text format for data exchange - +- having an efficient text format for data exchange + The alternative CSV format is not reversible and obsolete (last revision in 2005). Current CSV tools do not comply with the standard. - + - taking into account "semantic" data in pandas objects ### Compatibility Interface can be used without NTV type (compatibility with existing data - [see examples](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#4---Annexe-:-Series-tests)) - + If the interface is available, throw a new `orient` option in the JSON interface, the use of the feature is decoupled from the other features. - + ### Impacts on the pandas framework Initially, the impacts are very limited: - modification of the `name` of `Series` or `DataFrame columns` (no functional impact), @@ -219,23 +219,23 @@ In later stages, several developments could be considered: - functional extensions depending on the NTV type ### Risk to do / risk not to do -The JSON-NTV format and the JSON-TAB format are not (yet) recognized and used formats. The risk for pandas is that this function is not used (no functional impacts). - +The JSON-NTV format and the JSON-TAB format are not (yet) recognized and used formats. The risk for pandas is that this function is not used (no functional impacts). + On the other hand, the early use by pandas will allow a better consideration of the expectations and needs of pandas as well as a reflection on the evolution of the types supported by pandas. - + ## Implementation ### Modules Two modules are defined for NTV: - + - json-ntv - + this module manages NTV data without dependency to another module - + - ntvconnector - - those modules manage the conversion between objects and JSON data. They have dependency with objects modules (e.g. connectors with shapely location have dependency with shapely). - + + those modules manage the conversion between objects and JSON data. They have dependency with objects modules (e.g. connectors with shapely location have dependency with shapely). + The pandas integration of the JSON interface requires importing only the json-ntv module. ### Implementation options @@ -247,33 +247,33 @@ flowchart TB K(mapping NTV type / dtype\nNTV / pandas) ~~~ I(object dtype conversion\nNTV Connector) I ~~~ J(non-object dtype conversion\npandas) end - + direction TB D{{pandas}} ~~~ F(interface JSON\npandas) - E{{NTV}} ~~~ G(pandas NTV Connector\nNTV) + E{{NTV}} ~~~ G(pandas NTV Connector\nNTV) F --> H G --> H ``` Several pandas implementations are possible: - + 1. External: - - In this implementation, the interface is available only in the NTV side. + + In this implementation, the interface is available only in the NTV side. This option means that this evolution of the JSON interface is not useful or strategic for pandas. - + 2. NTV side: - - In this implementation, the interface is available in the both sides and the conversion is located inside NTV. + + In this implementation, the interface is available in the both sides and the conversion is located inside NTV. This option is the one that minimizes the impacts on the pandas side - + 3. pandas side: - - In this implementation, the interface is available in the both sides and the conversion is located inside pandas. + + In this implementation, the interface is available in the both sides and the conversion is located inside pandas. This option allows pandas to keep control of this evolution - + 4. pandas restricted: - - In this implementation, the pandas interface and the conversion are located inside pandas and only for non-object `dtype`. + + In this implementation, the pandas interface and the conversion are located inside pandas and only for non-object `dtype`. This option makes it possible to offer a compact and reversible interface while prohibiting the introduction of types incompatible with the existing `dtype` ## F.A.Q. @@ -304,13 +304,13 @@ The current interface is not compatible with the data structure defined by table **A**: I will add two additional remarks: - the types defined in Tableschema are partially (only 5 out of 20) taken into account (examples of types not taken into account in the interface: string-uri, array, date, time, year, geopoint): -- the `read_json()` interface works too with the following data: `{'simple': [1,2,3] }` (contrary to what is indicated in the documentation) but it is impossible with `to_json()` to recreate this simple json. +- the `read_json()` interface works too with the following data: `{'simple': [1,2,3] }` (contrary to what is indicated in the documentation) but it is impossible with `to_json()` to recreate this simple json. + +I think that the problem cannot be limited to bug fixes and that a clear strategy must be defined for the Json interface in particular with the gradual abandonment in open-data solutions of the obsolete CSV format in favor of a Json format. -I think that the problem cannot be limited to bug fixes and that a clear strategy must be defined for the Json interface in particular with the gradual abandonment in open-data solutions of the obsolete CSV format in favor of a Json format. - As stated, the proposed solution addresses several shortcomings of the current interface and could simply fit into the pandas environment (the other option would be to consider that the Json interface is a peripheral function of pandas and can remain external to pandas) regardless of the `orient='table'` option. - -It is nevertheless possible to merge the proposed format and the `orient='table'` format in order to have an explicit management of the notion of `extDtype` + +It is nevertheless possible to merge the proposed format and the `orient='table'` format in order to have an explicit management of the notion of `extDtype` **Q: As far as I can tell, JSON NTV is not in any form a standardised JSON format. I believe that pandas (and geopandas, which is where I came from to this issue) should try to follow either de facto or de jure standards and do not opt in for a file format that does not have any community support at this moment. This can obviously change in the future and that is where this PR should be revised. Why would pandas use this standard?** @@ -328,7 +328,7 @@ To conclude, ## Core team decision Implementation option : xxxx - + ## Timeline Tbd From 82b1992583cb7a68beb1a1b59e34c307af47b720 Mon Sep 17 00:00:00 2001 From: Philippe THOMY Date: Fri, 4 Aug 2023 16:27:40 +0200 Subject: [PATCH 06/22] Update 0012-compact-and-reversible-JSON-interface.md --- web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md index be2674dd0f111..5d1488ff0e9f8 100644 --- a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md +++ b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md @@ -1,7 +1,7 @@ # PDEP-12: Compact and reversible JSON interface - Created: 16 June 2023 -- Status: Under discussions +- Status: Under discussion - Discussion: [#53252](https://github.com/pandas-dev/pandas/issues/53252) - Author: [Philippe THOMY](https://github.com/loco-philippe) - Revision: 1 From a051d9cc5d24028d3ef785cc62e88248befd1cd6 Mon Sep 17 00:00:00 2001 From: Philippe THOMY Date: Fri, 4 Aug 2023 16:38:56 +0200 Subject: [PATCH 07/22] pre-commit codespell --- .../pdeps/0012-compact-and-reversible-JSON-interface.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md index 5d1488ff0e9f8..9059a9f1c68fe 100644 --- a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md +++ b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md @@ -36,7 +36,7 @@ ## Abstract ### Problem description -The `dtype` is not explicitely taken into account in the current JSON interface. +The `dtype` is not explicitly taken into account in the current JSON interface. To work around this problem, a data schema (e.g. `TableSchema`) must be associated with the JSON file. Nevertheless, the current JSON interface is not reversible and has inconsistencies related to the consideration of the `dtype`. @@ -170,7 +170,7 @@ JSON types (implicit or explicit) are converted in `dtype` following pandas JSON | null | NaT / NaN / None | Note: -- if an NTV type is defined, the `dtype` is ajusted accordingly +- if an NTV type is defined, the `dtype` is adjusted accordingly - the consideration of null type data needs to be clarified The other NTV types are associated with `object` `dtype`. @@ -211,7 +211,7 @@ If the interface is available, throw a new `orient` option in the JSON interface ### Impacts on the pandas framework Initially, the impacts are very limited: - modification of the `name` of `Series` or `DataFrame columns` (no functional impact), -- added an option in the Json interface (e.g. `orient='ntv'`) and added associated methods (no functionnal interference with the other methods) +- added an option in the Json interface (e.g. `orient='ntv'`) and added associated methods (no functional interference with the other methods) In later stages, several developments could be considered: - validation of the `name` of `Series` or `DataFrame columns` , From 63d92ec39e4242804b49b36a575954d43389c9b2 Mon Sep 17 00:00:00 2001 From: Philippe THOMY Date: Fri, 4 Aug 2023 22:40:24 +0200 Subject: [PATCH 08/22] Update 0012-compact-and-reversible-JSON-interface.md --- web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md index 9059a9f1c68fe..027d2427af596 100644 --- a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md +++ b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md @@ -335,4 +335,4 @@ Tbd ## PDEP History - 16 June 2023: Initial draft -- 22 July 2023: Add F.A.Q. +- 22 July 2023: Add F.A.Q. \ No newline at end of file From aca4a47eb344afa7224f36ec0954bdc14cd033a6 Mon Sep 17 00:00:00 2001 From: Philippe THOMY Date: Fri, 4 Aug 2023 22:53:42 +0200 Subject: [PATCH 09/22] Update 0012-compact-and-reversible-JSON-interface.md --- web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md index 027d2427af596..9059a9f1c68fe 100644 --- a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md +++ b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md @@ -335,4 +335,4 @@ Tbd ## PDEP History - 16 June 2023: Initial draft -- 22 July 2023: Add F.A.Q. \ No newline at end of file +- 22 July 2023: Add F.A.Q. From d0b41a6791399c3de78c1262292751937520cc1f Mon Sep 17 00:00:00 2001 From: Philippe THOMY Date: Sat, 5 Aug 2023 22:42:07 +0200 Subject: [PATCH 10/22] delete summary --- ...2-compact-and-reversible-JSON-interface.md | 26 +------------------ 1 file changed, 1 insertion(+), 25 deletions(-) diff --git a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md index 9059a9f1c68fe..e5269b6423e23 100644 --- a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md +++ b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md @@ -7,31 +7,7 @@ - Revision: 1 -#### Summary -- [Abstract](./pandas_PDEP.md/#Abstract) - - [Problem description](./pandas_PDEP.md/#Problem-description) - - [Feature Description](./pandas_PDEP.md/#Feature-Description) -- [Scope](./pandas_PDEP.md/#Scope) -- [Motivation](./pandas_PDEP.md/#Motivation) - - [Why is it important to have a compact and reversible JSON interface ?](./pandas_PDEP.md/#Why-is-it-important-to-have-a-compact-and-reversible-JSON-interface-?) - - [Is it relevant to take an extended type into account ?](./pandas_PDEP.md/#Is-it-relevant-to-take-an-extended-type-into-account-?) - - [Is this only useful for pandas ?](./pandas_PDEP.md/#Is-this-only-useful-for-pandas-?) -- [Description](./pandas_PDEP.md/#Description) - - [data typing](./pandas_PDEP.md/#Data-typing) - - [JSON format](./pandas_PDEP.md/#JSON-format) - - [Conversion](./pandas_PDEP.md/#Conversion) -- [Usage and impact](./pandas_PDEP.md/#Usage-and-impact) - - [Usage](./pandas_PDEP.md/#Usage) - - [Compatibility](./pandas_PDEP.md/#Compatibility) - - [Impacts on the pandas framework](./pandas_PDEP.md/#Impacts-on-the-pandas-framework) - - [Risk to do / risk not to do](./pandas_PDEP.md/#Risk-to-do-/-risk-not-to-do) -- [Implementation](./pandas_PDEP.md/#Implementation) - - [Modules](./pandas_PDEP.md/#Modules) - - [Implementation options](./pandas_PDEP.md/#Implementation-options) -- [F.A.Q.](./pandas_PDEP.md/#F.A.Q.) -- [Core team decision](./pandas_PDEP.md/#Core-team-decision) -- [Timeline](./pandas_PDEP.md/#Timeline) -- [PDEP history](./pandas_PDEP.md/#PDEP-history) + ------------------------- ## Abstract From 3f7135adf79c9663803998a8c2dcec8f71a82c1a Mon Sep 17 00:00:00 2001 From: Philippe THOMY Date: Sat, 5 Aug 2023 22:53:03 +0200 Subject: [PATCH 11/22] delete mermaid flowchart --- .../0012-compact-and-reversible-JSON-interface.md | 15 +-------------- 1 file changed, 1 insertion(+), 14 deletions(-) diff --git a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md index e5269b6423e23..278c94dad98ca 100644 --- a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md +++ b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md @@ -216,20 +216,7 @@ The pandas integration of the JSON interface requires importing only the json-nt ### Implementation options The interface can be implemented as NTV connector (`SeriesConnector` and `DataFrameConnector`) and as a new pandas JSON interface `orient` option. -```mermaid -flowchart TB - subgraph H [conversion pandas-NTV] - direction LR - K(mapping NTV type / dtype\nNTV / pandas) ~~~ I(object dtype conversion\nNTV Connector) - I ~~~ J(non-object dtype conversion\npandas) - end - - direction TB - D{{pandas}} ~~~ F(interface JSON\npandas) - E{{NTV}} ~~~ G(pandas NTV Connector\nNTV) - F --> H - G --> H -``` + Several pandas implementations are possible: 1. External: From 16d720146da83fe8832814f69a32f150840195cb Mon Sep 17 00:00:00 2001 From: Philippe THOMY Date: Sat, 5 Aug 2023 23:06:44 +0200 Subject: [PATCH 12/22] with summary, without mermaid flowchart --- ...2-compact-and-reversible-JSON-interface.md | 26 ++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md index 278c94dad98ca..5596dcd9eea85 100644 --- a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md +++ b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md @@ -7,7 +7,31 @@ - Revision: 1 - +#### Summary +- [Abstract](./pandas_PDEP.md/#Abstract) + - [Problem description](./pandas_PDEP.md/#Problem-description) + - [Feature Description](./pandas_PDEP.md/#Feature-Description) +- [Scope](./pandas_PDEP.md/#Scope) +- [Motivation](./pandas_PDEP.md/#Motivation) + - [Why is it important to have a compact and reversible JSON interface ?](./pandas_PDEP.md/#Why-is-it-important-to-have-a-compact-and-reversible-JSON-interface-?) + - [Is it relevant to take an extended type into account ?](./pandas_PDEP.md/#Is-it-relevant-to-take-an-extended-type-into-account-?) + - [Is this only useful for pandas ?](./pandas_PDEP.md/#Is-this-only-useful-for-pandas-?) +- [Description](./pandas_PDEP.md/#Description) + - [data typing](./pandas_PDEP.md/#Data-typing) + - [JSON format](./pandas_PDEP.md/#JSON-format) + - [Conversion](./pandas_PDEP.md/#Conversion) +- [Usage and impact](./pandas_PDEP.md/#Usage-and-impact) + - [Usage](./pandas_PDEP.md/#Usage) + - [Compatibility](./pandas_PDEP.md/#Compatibility) + - [Impacts on the pandas framework](./pandas_PDEP.md/#Impacts-on-the-pandas-framework) + - [Risk to do / risk not to do](./pandas_PDEP.md/#Risk-to-do-/-risk-not-to-do) +- [Implementation](./pandas_PDEP.md/#Implementation) + - [Modules](./pandas_PDEP.md/#Modules) + - [Implementation options](./pandas_PDEP.md/#Implementation-options) +- [F.A.Q.](./pandas_PDEP.md/#F.A.Q.) +- [Core team decision](./pandas_PDEP.md/#Core-team-decision) +- [Timeline](./pandas_PDEP.md/#Timeline) +- [PDEP history](./pandas_PDEP.md/#PDEP-history) ------------------------- ## Abstract From 08cf17b6ce1c6291f94340bbdef2a55671befa5b Mon Sep 17 00:00:00 2001 From: Philippe THOMY Date: Fri, 11 Aug 2023 17:03:42 +0200 Subject: [PATCH 13/22] rename Annexe -> Appendix --- web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md index 5596dcd9eea85..7430347893c15 100644 --- a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md +++ b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md @@ -204,7 +204,7 @@ It seems to me that this proposal responds to important issues: - taking into account "semantic" data in pandas objects ### Compatibility -Interface can be used without NTV type (compatibility with existing data - [see examples](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#4---Annexe-:-Series-tests)) +Interface can be used without NTV type (compatibility with existing data - [see examples](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#4---Appendix-:-Series-tests)) If the interface is available, throw a new `orient` option in the JSON interface, the use of the feature is decoupled from the other features. From ec31662f0ef010c6c581b4eee6dcf29b2156a571 Mon Sep 17 00:00:00 2001 From: Philippe THOMY Date: Tue, 5 Sep 2023 23:51:44 +0200 Subject: [PATCH 14/22] add tableschema specification --- ...2-compact-and-reversible-JSON-interface.md | 37 ++++++++++++++++++- 1 file changed, 35 insertions(+), 2 deletions(-) diff --git a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md index 7430347893c15..d84c9dcedc039 100644 --- a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md +++ b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md @@ -47,7 +47,9 @@ Some JSON-interface problems are detailed in the [linked NoteBook](https://nbvie ### Feature Description To have a simple, compact and reversible solution, I propose to use the [JSON-NTV format (Named and Typed Value)](https://github.com/loco-philippe/NTV#readme) - which integrates the notion of type - and its JSON-TAB variation for tabular data. -This solution allows to include a large number of types (not necessarily pandas `dtype`). +This solution allows to include a large number of types (not necessarily pandas `dtype`) which allows to have: +- a JSON `orient="table"` interface which respects the Table Schema specification (going from 5 types to 20 types), +- a JSON interface for all pandas data formats. In the example below, a DataFrame with several data types is converted to JSON. The DataFrame resulting from this JSON is identical to the initial DataFrame (reversibility). @@ -109,12 +111,15 @@ Out[5]: df created from JSON is equal to initial df ? True Several other examples are provided in the [linked NoteBook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#2---Series) ## Scope -The objective is to make available the proposed JSON interface for any type of data. +The objective is to make available the proposed JSON interface for any type of data and for `orient="table"` option. The proposed interface is compatible with existing data. ## Motivation +### Why extend the `orient=table` option to other data types? +- The Table Schema specification defines 24 data types, 6 are taken into account in the pandas interface + ### Why is it important to have a compact and reversible JSON interface ? - a reversible interface provides an exchange format. - a textual exchange format facilitates exchanges between platforms (e.g. OpenData) @@ -133,6 +138,7 @@ The proposed interface is compatible with existing data. The proposed solution is based on several key points: - data typing +- correspondence between TableSchema and pandas - JSON format for tabular data - conversion to and from JSON format @@ -175,6 +181,33 @@ Note: The other NTV types are associated with `object` `dtype`. +### correspondence between TableSchema and pandas +The TableSchema typing is carried by two attributes `format` and `type`. + +The table below shows the correspondence between TableSchema format / type and pandas dtype / NTVtype: + +| **format / type** | **NTV type / dtype** | +|--------------------|----------------------| +| default / datetime | / datetime64[ns] | +| default / number | / float64 | +| default / integer | / int64 | +| default / boolean | / bool | +| default / string | / object | +| default / duration | / timedelta64[ns] | +| email / string | email / string | +| uri / string | uri / string | +| default / object | object / object | +| default / array | array / object | +| default / date | date / object | +| default / time | time / object | +| default / year | year / int64 | +| default / yearmonth| month / int64 | +| array / geopoint | point / object | +| default / geojson | geojson / object | + +Note: +- other TableSchema format are defined and are to be studied (uuid, binary, topojson, specific format for geopoint and datation) + ### JSON format The JSON format is defined in [JSON-TAB](https://github.com/loco-philippe/NTV/blob/main/documentation/JSON-TAB-standard.pdf) specification. It includes the naming rules originally defined in the [JSON-ND project](https://github.com/glenkleidon/JSON-ND) and support for categorical data. From 4dbb82240920ab2dedc41384c377cf228ce11eb3 Mon Sep 17 00:00:00 2001 From: Philippe THOMY Date: Wed, 6 Sep 2023 17:07:40 +0200 Subject: [PATCH 15/22] add orient="table" --- .../0012-compact-and-reversible-JSON-interface.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md index d84c9dcedc039..d52c1e0c7d0f8 100644 --- a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md +++ b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md @@ -36,11 +36,12 @@ ## Abstract ### Problem description -The `dtype` is not explicitly taken into account in the current JSON interface. -To work around this problem, a data schema (e.g. `TableSchema`) must be associated with the JSON file. - -Nevertheless, the current JSON interface is not reversible and has inconsistencies related to the consideration of the `dtype`. - +The `dtype` and "Python type" are not explicitly taken into account in the current JSON interface. + +So, the current JSON interface is not allways reversible and has inconsistencies related to the consideration of the `dtype`. + +Another consequence is the partial application of the TableSchema specification in the `orient="table"` option (6 data types are taken into account out of the 24 defined). + Some JSON-interface problems are detailed in the [linked NoteBook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#1---Current-Json-interface) @@ -184,7 +185,7 @@ The other NTV types are associated with `object` `dtype`. ### correspondence between TableSchema and pandas The TableSchema typing is carried by two attributes `format` and `type`. -The table below shows the correspondence between TableSchema format / type and pandas dtype / NTVtype: +The table below shows the correspondence between TableSchema format / type and pandas NTVtype / dtype: | **format / type** | **NTV type / dtype** | |--------------------|----------------------| From 38e92b2259eda74e92615fa8f8190cf7fd45f0f7 Mon Sep 17 00:00:00 2001 From: Philippe THOMY Date: Wed, 6 Sep 2023 23:24:30 +0200 Subject: [PATCH 16/22] Add Table Schema extension --- ...2-compact-and-reversible-JSON-interface.md | 131 ++++++++++++------ 1 file changed, 90 insertions(+), 41 deletions(-) diff --git a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md index d52c1e0c7d0f8..a711f25c3e7a8 100644 --- a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md +++ b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md @@ -2,56 +2,61 @@ - Created: 16 June 2023 - Status: Under discussion -- Discussion: [#53252](https://github.com/pandas-dev/pandas/issues/53252) +- Discussion: + [#53252](https://github.com/pandas-dev/pandas/issues/53252) + [#55038](https://github.com/pandas-dev/pandas/issues/55038) - Author: [Philippe THOMY](https://github.com/loco-philippe) -- Revision: 1 +- Revision: 2 #### Summary -- [Abstract](./pandas_PDEP.md/#Abstract) - - [Problem description](./pandas_PDEP.md/#Problem-description) - - [Feature Description](./pandas_PDEP.md/#Feature-Description) -- [Scope](./pandas_PDEP.md/#Scope) -- [Motivation](./pandas_PDEP.md/#Motivation) - - [Why is it important to have a compact and reversible JSON interface ?](./pandas_PDEP.md/#Why-is-it-important-to-have-a-compact-and-reversible-JSON-interface-?) - - [Is it relevant to take an extended type into account ?](./pandas_PDEP.md/#Is-it-relevant-to-take-an-extended-type-into-account-?) - - [Is this only useful for pandas ?](./pandas_PDEP.md/#Is-this-only-useful-for-pandas-?) -- [Description](./pandas_PDEP.md/#Description) - - [data typing](./pandas_PDEP.md/#Data-typing) - - [JSON format](./pandas_PDEP.md/#JSON-format) - - [Conversion](./pandas_PDEP.md/#Conversion) -- [Usage and impact](./pandas_PDEP.md/#Usage-and-impact) - - [Usage](./pandas_PDEP.md/#Usage) - - [Compatibility](./pandas_PDEP.md/#Compatibility) - - [Impacts on the pandas framework](./pandas_PDEP.md/#Impacts-on-the-pandas-framework) - - [Risk to do / risk not to do](./pandas_PDEP.md/#Risk-to-do-/-risk-not-to-do) -- [Implementation](./pandas_PDEP.md/#Implementation) - - [Modules](./pandas_PDEP.md/#Modules) - - [Implementation options](./pandas_PDEP.md/#Implementation-options) -- [F.A.Q.](./pandas_PDEP.md/#F.A.Q.) -- [Core team decision](./pandas_PDEP.md/#Core-team-decision) -- [Timeline](./pandas_PDEP.md/#Timeline) -- [PDEP history](./pandas_PDEP.md/#PDEP-history) +- [Abstract](./0012-compact-and-reversible-JSON-interface.md/#Abstract) + - [Problem description](./0012-compact-and-reversible-JSON-interface.md/#Problem-description) + - [Feature Description](./0012-compact-and-reversible-JSON-interface.md/#Feature-Description) +- [Scope](./0012-compact-and-reversible-JSON-interface.md/#Scope) +- [Motivation](./0012-compact-and-reversible-JSON-interface.md/#Motivation) + - [Why is it important to have a compact and reversible JSON interface ?](./0012-compact-and-reversible-JSON-interface.md/#Why-is-it-important-to-have-a-compact-and-reversible-JSON-interface-?) + - [Is it relevant to take an extended type into account ?](./0012-compact-and-reversible-JSON-interface.md/#Is-it-relevant-to-take-an-extended-type-into-account-?) + - [Is this only useful for pandas ?](./0012-compact-and-reversible-JSON-interface.md/#Is-this-only-useful-for-pandas-?) +- [Description](./0012-compact-and-reversible-JSON-interface.md/#Description) + - [Data typing](./0012-compact-and-reversible-JSON-interface.md/#Data-typing) + - [Correspondence between TableSchema and pandas](./panda0012-compact-and-reversible-JSON-interfaces_PDEP.md/#Correspondence-between-TableSchema-and-pandas) + - [JSON format](./0012-compact-and-reversible-JSON-interface.md/#JSON-format) + - [Conversion](./0012-compact-and-reversible-JSON-interface.md/#Conversion) +- [Usage and impact](./0012-compact-and-reversible-JSON-interface.md/#Usage-and-impact) + - [Usage](./0012-compact-and-reversible-JSON-interface.md/#Usage) + - [Compatibility](./0012-compact-and-reversible-JSON-interface.md/#Compatibility) + - [Impacts on the pandas framework](./0012-compact-and-reversible-JSON-interface.md/#Impacts-on-the-pandas-framework) + - [Risk to do / risk not to do](./0012-compact-and-reversible-JSON-interface.md/#Risk-to-do-/-risk-not-to-do) +- [Implementation](./0012-compact-and-reversible-JSON-interface.md/#Implementation) + - [Modules](./0012-compact-and-reversible-JSON-interface.md/#Modules) + - [Implementation options](./0012-compact-and-reversible-JSON-interface.md/#Implementation-options) +- [F.A.Q.](./0012-compact-and-reversible-JSON-interface.md/#F.A.Q.) +- [Synthesis](./0012-compact-and-reversible-JSON-interface.md/Synthesis) +- [Core team decision](./0012-compact-and-reversible-JSON-interface.md/#Core-team-decision) +- [Timeline](./0012-compact-and-reversible-JSON-interface.md/#Timeline) +- [PDEP history](./0012-compact-and-reversible-JSON-interface.md/#PDEP-history) ------------------------- ## Abstract ### Problem description The `dtype` and "Python type" are not explicitly taken into account in the current JSON interface. -So, the current JSON interface is not allways reversible and has inconsistencies related to the consideration of the `dtype`. +So, the JSON interface is not allways reversible and has inconsistencies related to the consideration of the `dtype`. -Another consequence is the partial application of the TableSchema specification in the `orient="table"` option (6 data types are taken into account out of the 24 defined). +Another consequence is the partial application of the Table Schema specification in the `orient="table"` option (6 Table Schema data types are taken into account out of the 24 defined). Some JSON-interface problems are detailed in the [linked NoteBook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#1---Current-Json-interface) ### Feature Description -To have a simple, compact and reversible solution, I propose to use the [JSON-NTV format (Named and Typed Value)](https://github.com/loco-philippe/NTV#readme) - which integrates the notion of type - and its JSON-TAB variation for tabular data. +To have a simple, compact and reversible solution, I propose to use the [JSON-NTV format (Named and Typed Value)](https://github.com/loco-philippe/NTV#readme) - which integrates the notion of type - and its JSON-TAB variation for tabular data (the JSON-NTV format is defined in an [IETF Internet-Draft](https://datatracker.ietf.org/doc/draft-thomy-json-ntv/) (not yet an RFC !!) ). This solution allows to include a large number of types (not necessarily pandas `dtype`) which allows to have: -- a JSON `orient="table"` interface which respects the Table Schema specification (going from 5 types to 20 types), -- a JSON interface for all pandas data formats. +- a Table Schema JSON interface (`orient="table"`) which respects the Table Schema specification (going from 6 types to 20 types), +- a global JSON interface for all pandas data formats. +#### Global JSON interface example In the example below, a DataFrame with several data types is converted to JSON. The DataFrame resulting from this JSON is identical to the initial DataFrame (reversibility). With the existing JSON interface, this conversion is not possible. @@ -100,7 +105,6 @@ Out[5]: 'value32::int32': [12, 12, 22, 22, 32, 32]}} ``` - *Reversibility* ```python @@ -111,6 +115,44 @@ Out[5]: df created from JSON is equal to initial df ? True ``` Several other examples are provided in the [linked NoteBook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#2---Series) +#### Table Schema JSON interface example +In the example below, a DataFrame with several Table Schema data types is converted to JSON. +The DataFrame resulting from this JSON is identical to the initial DataFrame (reversibility). +With the existing Table Schema JSON interface, this conversion is not possible. + +```python +In [1]: from shapely.geometry import Point + from datetime import date + +In [2]: df = pd.DataFrame({ + 'end february::date': ['date(2023,2,28)', 'date(2024,2,29)', 'date(2025,2,28)'], + 'coordinates::point': ['Point([2.3, 48.9])', 'Point([5.4, 43.3])', 'Point([4.9, 45.8])'], + 'contact::email': ['john.doe@table.com', 'lisa.minelli@schema.com', 'walter.white@breaking.com'] + }) + +In [3]: df +Out[3]: + end february::date coordinates::point contact::email + 0 2023-02-28 POINT (2.3 48.9) john.doe@table.com + 1 2024-02-29 POINT (5.4 43.3) lisa.minelli@schema.com + 2 2025-02-28 POINT (4.9 45.8) walter.white@breaking.com +``` + +*JSON representation* + +```python +In [4]: pprint(df.to_json(orient='table'), compact=True, width=140, sort_dicts=False) +Out[4]: + {'schema': {'fields': [{'name': 'index', 'type': 'integer'}, + {'name': 'end february', 'type': 'date'}, + {'name': 'coordinates', 'type': 'geopoint', 'format': 'array'}, + {'name': 'contact', 'type': 'string', 'format': 'email'}], + 'primaryKey': ['index'], + 'pandas_version': '1.4.0'}, + 'data': [{'index': 0, 'end february': '2023-02-28', 'coordinates': [2.3, 48.9], 'contact': 'john.doe@table.com'}, + {'index': 1, 'end february': '2024-02-29', 'coordinates': [5.4, 43.3], 'contact': 'lisa.minelli@schema.com'}, + {'index': 2, 'end february': '2025-02-28', 'coordinates': [4.9, 45.8], 'contact': 'walter.white@breaking.com'}]} +``` ## Scope The objective is to make available the proposed JSON interface for any type of data and for `orient="table"` option. @@ -129,6 +171,7 @@ The proposed interface is compatible with existing data. ### Is it relevant to take an extended type into account ? - it avoids the addition of an additional data schema - it increases the semantic scope of the data processed by pandas +- it is an answer to several issues (e.g. #12997, #14358, #16492, #35420, #35464, #36211, #39537, #49585, #50782, #51375, #52595, #53252) - the use of a complementary type avoids having to modify the pandas data model ### Is this only useful for pandas ? @@ -182,7 +225,7 @@ Note: The other NTV types are associated with `object` `dtype`. -### correspondence between TableSchema and pandas +### Correspondence between TableSchema and pandas The TableSchema typing is carried by two attributes `format` and `type`. The table below shows the correspondence between TableSchema format / type and pandas NTVtype / dtype: @@ -208,9 +251,12 @@ The table below shows the correspondence between TableSchema format / type and p Note: - other TableSchema format are defined and are to be studied (uuid, binary, topojson, specific format for geopoint and datation) +- the first six lines correspond to the existing ### JSON format -The JSON format is defined in [JSON-TAB](https://github.com/loco-philippe/NTV/blob/main/documentation/JSON-TAB-standard.pdf) specification. +The JSON format for the TableSchema interface is the existing. + +The JSON format for the Global interface is defined in [JSON-TAB](https://github.com/loco-philippe/NTV/blob/main/documentation/JSON-TAB-standard.pdf) specification. It includes the naming rules originally defined in the [JSON-ND project](https://github.com/glenkleidon/JSON-ND) and support for categorical data. The specification have to be updated to include sparse data. @@ -221,11 +267,11 @@ Otherwise, NTV conversion is used. #### pandas -> JSON - `NTV type` is not defined : use `to_json()` - `NTV type` is defined and `dtype` is not `object` : use `to_json()` -- `NTV type` is defined and `dtype` is `object` : use NTV conversion +- `NTV type` is defined and `dtype` is `object` : use NTV conversion (if pandas conversion does not exist) #### JSON -> pandas - `NTV type` is compatible with a `dtype` : use `read_json()` -- `NTV type` is not compatible with a `dtype` : use NTV conversion +- `NTV type` is not compatible with a `dtype` : use NTV conversion (if pandas conversion does not exist) ## Usage and Impact @@ -237,6 +283,8 @@ It seems to me that this proposal responds to important issues: - taking into account "semantic" data in pandas objects +- having a complete Table Schema interface + ### Compatibility Interface can be used without NTV type (compatibility with existing data - [see examples](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#4---Appendix-:-Series-tests)) @@ -305,7 +353,7 @@ Several pandas implementations are possible: But this is very limited (see examples added in the [Notebook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb)) : - **Types and Json interface** - - the only way to keep the types in the json interface is to use the orient='table' option + - the only way to keep the types in the json interface is to use the `orient='table'` option - few dtypes are not allowed in json-table interface : period, timedelta64, interval - allowed types are not always kept in json-table interface - data with 'object' dtype is kept only id data is string @@ -324,7 +372,7 @@ The current interface is not compatible with the data structure defined by table **Q: In general, we should only have 1 `"table"` format for pandas in read_json/to_json. There is also the issue of backwards compatibility if we do change the format. The fact that the table interface is buggy is not a reason to add a new interface (I'd rather fix those bugs). Can the existing format be adapted in a way that fixes the type issues/issues with roundtripping?** **A**: I will add two additional remarks: -- the types defined in Tableschema are partially (only 5 out of 20) taken into account (examples of types not taken into account in the interface: string-uri, array, date, time, year, geopoint): +- the types defined in Tableschema are partially taken into account (examples of types not taken into account in the interface: string-uri, array, date, time, year, geopoint, string-email): - the `read_json()` interface works too with the following data: `{'simple': [1,2,3] }` (contrary to what is indicated in the documentation) but it is impossible with `to_json()` to recreate this simple json. I think that the problem cannot be limited to bug fixes and that a clear strategy must be defined for the Json interface in particular with the gradual abandonment in open-data solutions of the obsolete CSV format in favor of a Json format. @@ -340,12 +388,12 @@ It is nevertheless possible to merge the proposed format and the `orient='table' The proposal made answers this problem ([the example at the beginning of Notebook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#0---Simple-example) simply and clearly illustrates the interest of the proposal). Regarding the underlying JSON-NTV format, its impact is quite low for tabular data (it is limited to adding the type in the field name). -Nevertheless, the question is relevant: The JSON-NTV format is indeed a shared, documented, supported and implemented format, but indeed the community support is for the moment reduced but it only asks to expand!! +Nevertheless, the question is relevant: The JSON-NTV format ([IETF Internet-Draft](https://datatracker.ietf.org/doc/draft-thomy-json-ntv/)) is a shared, documented, supported and implemented format, but indeed the community support is for the moment reduced but it only asks to expand !! +## Synthesis To conclude, - if it is important (or strategic) to have a reversible JSON interface for any type of data, the proposal can be allowed, -- if not, a third-party package that reads/writes this format to/from pandas DataFrames listed in the [ecosystem](https://pandas.pydata.org/community/ecosystem.html) should be considered - +- if not, a third-party package listed in the [ecosystem](https://pandas.pydata.org/community/ecosystem.html) that reads/writes this format to/from pandas DataFrames should be considered ## Core team decision Implementation option : xxxx @@ -357,3 +405,4 @@ Tbd - 16 June 2023: Initial draft - 22 July 2023: Add F.A.Q. +- 06 September 2023: Add Table Schema extension From 1e3f793f3cfc9fd00193d05c0cfaad4599624bf7 Mon Sep 17 00:00:00 2001 From: Philippe THOMY Date: Wed, 6 Sep 2023 23:33:20 +0200 Subject: [PATCH 17/22] Update 0012-compact-and-reversible-JSON-interface.md --- web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md index a711f25c3e7a8..ccae1e9ae8a15 100644 --- a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md +++ b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md @@ -42,7 +42,7 @@ ### Problem description The `dtype` and "Python type" are not explicitly taken into account in the current JSON interface. -So, the JSON interface is not allways reversible and has inconsistencies related to the consideration of the `dtype`. +So, the JSON interface is not always reversible and has inconsistencies related to the consideration of the `dtype`. Another consequence is the partial application of the Table Schema specification in the `orient="table"` option (6 Table Schema data types are taken into account out of the 24 defined). From 8dad55596af735081f210a45de15dc51c55c959a Mon Sep 17 00:00:00 2001 From: Philippe THOMY Date: Wed, 6 Sep 2023 23:57:38 +0200 Subject: [PATCH 18/22] Update 0012-compact-and-reversible-JSON-interface.md --- .../0012-compact-and-reversible-JSON-interface.md | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md index ccae1e9ae8a15..66036bd992e85 100644 --- a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md +++ b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md @@ -2,7 +2,7 @@ - Created: 16 June 2023 - Status: Under discussion -- Discussion: +- Discussion: [#53252](https://github.com/pandas-dev/pandas/issues/53252) [#55038](https://github.com/pandas-dev/pandas/issues/55038) - Author: [Philippe THOMY](https://github.com/loco-philippe) @@ -41,11 +41,11 @@ ### Problem description The `dtype` and "Python type" are not explicitly taken into account in the current JSON interface. - + So, the JSON interface is not always reversible and has inconsistencies related to the consideration of the `dtype`. - + Another consequence is the partial application of the Table Schema specification in the `orient="table"` option (6 Table Schema data types are taken into account out of the 24 defined). - + Some JSON-interface problems are detailed in the [linked NoteBook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#1---Current-Json-interface) @@ -58,7 +58,9 @@ This solution allows to include a large number of types (not necessarily pandas #### Global JSON interface example In the example below, a DataFrame with several data types is converted to JSON. + The DataFrame resulting from this JSON is identical to the initial DataFrame (reversibility). + With the existing JSON interface, this conversion is not possible. *data example* @@ -117,7 +119,9 @@ Several other examples are provided in the [linked NoteBook](https://nbviewer.or #### Table Schema JSON interface example In the example below, a DataFrame with several Table Schema data types is converted to JSON. + The DataFrame resulting from this JSON is identical to the initial DataFrame (reversibility). + With the existing Table Schema JSON interface, this conversion is not possible. ```python @@ -254,7 +258,7 @@ Note: - the first six lines correspond to the existing ### JSON format -The JSON format for the TableSchema interface is the existing. +The JSON format for the TableSchema interface is the existing. The JSON format for the Global interface is defined in [JSON-TAB](https://github.com/loco-philippe/NTV/blob/main/documentation/JSON-TAB-standard.pdf) specification. It includes the naming rules originally defined in the [JSON-ND project](https://github.com/glenkleidon/JSON-ND) and support for categorical data. From 7e7d878dff7b43fc0b09c6d715c782744f1bdc6f Mon Sep 17 00:00:00 2001 From: Philippe THOMY Date: Thu, 7 Sep 2023 10:22:01 +0200 Subject: [PATCH 19/22] Update 0012-compact-and-reversible-JSON-interface.md --- .../0012-compact-and-reversible-JSON-interface.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md index 66036bd992e85..b20d5e04e42ac 100644 --- a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md +++ b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md @@ -67,6 +67,8 @@ With the existing JSON interface, this conversion is not possible. ```python In [1]: from shapely.geometry import Point from datetime import date + from json_ntv import read_json as read_json + from json_ntv import to_json as to_json In [2]: data = {'index': [100, 200, 300, 400, 500, 600], 'dates::date': pd.Series([date(1964,1,1), date(1985,2,5), date(2022,1,21), date(1964,1,1), date(1985,2,5), date(2022,1,21)]), @@ -94,8 +96,8 @@ Out[4]: *JSON representation* ```python -In [5]: df_json = Ntv.obj(df) - pprint(df_json.to_obj(), compact=True, width=120) +In [5]: df_to_json = to_json(df) + pprint(df_to_json, compact=True, width=120) Out[5]: {':tab': {'coord::point': [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0], [3.0, 4.0], [5.0, 6.0]], 'dates::date': ['1964-01-01', '1985-02-05', '2022-01-21', '1964-01-01', '1985-02-05', '2022-01-21'], @@ -110,7 +112,7 @@ Out[5]: *Reversibility* ```python -In [5]: df_from_json = df_json.to_obj(format='obj') +In [5]: df_from_json = read_json(df_to_json) print('df created from JSON is equal to initial df ? ', df_from_json.equals(df)) Out[5]: df created from JSON is equal to initial df ? True @@ -118,7 +120,7 @@ Out[5]: df created from JSON is equal to initial df ? True Several other examples are provided in the [linked NoteBook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#2---Series) #### Table Schema JSON interface example -In the example below, a DataFrame with several Table Schema data types is converted to JSON. +In the example below (not yet implemented), a DataFrame with several Table Schema data types is converted to JSON. The DataFrame resulting from this JSON is identical to the initial DataFrame (reversibility). From fbc5fe5d7ae6aa64406756d07897963aa5f96294 Mon Sep 17 00:00:00 2001 From: Philippe THOMY Date: Thu, 7 Sep 2023 10:28:36 +0200 Subject: [PATCH 20/22] Update 0012-compact-and-reversible-JSON-interface.md --- web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md index b20d5e04e42ac..fdd8231170b6b 100644 --- a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md +++ b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md @@ -67,7 +67,7 @@ With the existing JSON interface, this conversion is not possible. ```python In [1]: from shapely.geometry import Point from datetime import date - from json_ntv import read_json as read_json + from json_ntv import read_json as read_json from json_ntv import to_json as to_json In [2]: data = {'index': [100, 200, 300, 400, 500, 600], From 65bee1df28306fc2a300dff710044eec4a762c18 Mon Sep 17 00:00:00 2001 From: Philippe THOMY Date: Sun, 1 Oct 2023 16:45:17 +0200 Subject: [PATCH 21/22] Update 0012-compact-and-reversible-JSON-interface.md --- ...2-compact-and-reversible-JSON-interface.md | 77 ++++++++++++------- 1 file changed, 50 insertions(+), 27 deletions(-) diff --git a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md index fdd8231170b6b..7e24c8ec3dafc 100644 --- a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md +++ b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md @@ -1,12 +1,12 @@ # PDEP-12: Compact and reversible JSON interface - Created: 16 June 2023 -- Status: Under discussion +- Status: Rejected - Discussion: [#53252](https://github.com/pandas-dev/pandas/issues/53252) [#55038](https://github.com/pandas-dev/pandas/issues/55038) - Author: [Philippe THOMY](https://github.com/loco-philippe) -- Revision: 2 +- Revision: 3 #### Summary @@ -46,7 +46,7 @@ So, the JSON interface is not always reversible and has inconsistencies related Another consequence is the partial application of the Table Schema specification in the `orient="table"` option (6 Table Schema data types are taken into account out of the 24 defined). -Some JSON-interface problems are detailed in the [linked NoteBook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#1---Current-Json-interface) +Some JSON-interface problems are detailed in the [linked NoteBook](https://nbviewer.org/github/loco-philippe/ntv-pandas/blob/main/example/example_json_pandas.ipynb#Current-Json-interface) ### Feature Description @@ -63,27 +63,28 @@ The DataFrame resulting from this JSON is identical to the initial DataFrame (re With the existing JSON interface, this conversion is not possible. +This example uses `ntv_pandas` module defined in the [ntv-pandas repository](https://github.com/loco-philippe/ntv-pandas#readme). + *data example* ```python In [1]: from shapely.geometry import Point from datetime import date - from json_ntv import read_json as read_json - from json_ntv import to_json as to_json - + import pandas as pd + import ntv_pandas as npd + In [2]: data = {'index': [100, 200, 300, 400, 500, 600], - 'dates::date': pd.Series([date(1964,1,1), date(1985,2,5), date(2022,1,21), date(1964,1,1), date(1985,2,5), date(2022,1,21)]), + 'dates::date': [date(1964,1,1), date(1985,2,5), date(2022,1,21), date(1964,1,1), date(1985,2,5), date(2022,1,21)], 'value': [10, 10, 20, 20, 30, 30], 'value32': pd.Series([12, 12, 22, 22, 32, 32], dtype='int32'), 'res': [10, 20, 30, 10, 20, 30], - 'coord::point': pd.Series([Point(1,2), Point(3,4), Point(5,6), Point(7,8), Point(3,4), Point(5,6)]), + 'coord::point': [Point(1,2), Point(3,4), Point(5,6), Point(7,8), Point(3,4), Point(5,6)], 'names': pd.Series(['john', 'eric', 'judith', 'mila', 'hector', 'maria'], dtype='string'), 'unique': True } In [3]: df = pd.DataFrame(data).set_index('index') In [4]: df -Out[4]: - dates::date value value32 res coord::point names unique +Out[4]: dates::date value value32 res coord::point names unique index 100 1964-01-01 10 12 10 POINT (1 2) john True 200 1985-02-05 10 12 20 POINT (3 4) eric True @@ -96,10 +97,9 @@ Out[4]: *JSON representation* ```python -In [5]: df_to_json = to_json(df) - pprint(df_to_json, compact=True, width=120) -Out[5]: - {':tab': {'coord::point': [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0], [3.0, 4.0], [5.0, 6.0]], +In [5]: df_to_json = npd.to_json(df) + pprint(df_to_json, width=120) +Out[5]: {':tab': {'coord::point': [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0], [3.0, 4.0], [5.0, 6.0]], 'dates::date': ['1964-01-01', '1985-02-05', '2022-01-21', '1964-01-01', '1985-02-05', '2022-01-21'], 'index': [100, 200, 300, 400, 500, 600], 'names::string': ['john', 'eric', 'judith', 'mila', 'hector', 'maria'], @@ -112,15 +112,14 @@ Out[5]: *Reversibility* ```python -In [5]: df_from_json = read_json(df_to_json) +In [5]: df_from_json = npd.read_json(df_to_json) print('df created from JSON is equal to initial df ? ', df_from_json.equals(df)) - Out[5]: df created from JSON is equal to initial df ? True ``` -Several other examples are provided in the [linked NoteBook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#2---Series) +Several other examples are provided in the [linked NoteBook](https://nbviewer.org/github/loco-philippe/ntv-pandas/blob/main/example/example_ntv_pandas.ipynb) #### Table Schema JSON interface example -In the example below (not yet implemented), a DataFrame with several Table Schema data types is converted to JSON. +In the example below, a DataFrame with several Table Schema data types is converted to JSON. The DataFrame resulting from this JSON is identical to the initial DataFrame (reversibility). @@ -137,8 +136,7 @@ In [2]: df = pd.DataFrame({ }) In [3]: df -Out[3]: - end february::date coordinates::point contact::email +Out[3]: end february::date coordinates::point contact::email 0 2023-02-28 POINT (2.3 48.9) john.doe@table.com 1 2024-02-29 POINT (5.4 43.3) lisa.minelli@schema.com 2 2025-02-28 POINT (4.9 45.8) walter.white@breaking.com @@ -147,9 +145,9 @@ Out[3]: *JSON representation* ```python -In [4]: pprint(df.to_json(orient='table'), compact=True, width=140, sort_dicts=False) -Out[4]: - {'schema': {'fields': [{'name': 'index', 'type': 'integer'}, +In [4]: df_to_table = npd.to_json(df, table=True) + pprint(df_to_table, width=140, sort_dicts=False) +Out[4]: {'schema': {'fields': [{'name': 'index', 'type': 'integer'}, {'name': 'end february', 'type': 'date'}, {'name': 'coordinates', 'type': 'geopoint', 'format': 'array'}, {'name': 'contact', 'type': 'string', 'format': 'email'}], @@ -159,8 +157,18 @@ Out[4]: {'index': 1, 'end february': '2024-02-29', 'coordinates': [5.4, 43.3], 'contact': 'lisa.minelli@schema.com'}, {'index': 2, 'end february': '2025-02-28', 'coordinates': [4.9, 45.8], 'contact': 'walter.white@breaking.com'}]} ``` + +*Reversibility* + +```python +In [5]: df_from_table = npd.read_json(df_to_table) + print('df created from JSON is equal to initial df ? ', df_from_table.equals(df)) +Out[5]: df created from JSON is equal to initial df ? True +``` +Several other examples are provided in the [linked NoteBook](https://nbviewer.org/github/loco-philippe/ntv-pandas/blob/main/example/example_table_pandas.ipynb) + ## Scope -The objective is to make available the proposed JSON interface for any type of data and for `orient="table"` option. +The objective is to make available the proposed JSON interface for any type of data and for `orient="table"` option or a new option `orient="ntv"`. The proposed interface is compatible with existing data. @@ -292,7 +300,7 @@ It seems to me that this proposal responds to important issues: - having a complete Table Schema interface ### Compatibility -Interface can be used without NTV type (compatibility with existing data - [see examples](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#4---Appendix-:-Series-tests)) +Interface can be used without NTV type (compatibility with existing data - [see examples](https://nbviewer.org/github/loco-philippe/ntv-pandas/blob/main/example/example_ntv_pandas.ipynb#Appendix-:-Series-tests)) If the interface is available, throw a new `orient` option in the JSON interface, the use of the feature is decoupled from the other features. @@ -402,13 +410,28 @@ To conclude, - if not, a third-party package listed in the [ecosystem](https://pandas.pydata.org/community/ecosystem.html) that reads/writes this format to/from pandas DataFrames should be considered ## Core team decision -Implementation option : xxxx +Vote was open from september-11 to setpember-26: +- Final tally is 0 approvals, 5 abstentions, 7 disapprove. The quorum has been met. The PDEP fails. + +**Disapprove comments** : +- 1 Given the newness of the proposed JSON NTV format, I would support (as described in the PDEP): "if not, a third-party package listed in the ecosystem that reads/writes this format to/from pandas DataFrames should be considered" +- 2 Same reason as -1-, this should be a third party package for now +- 3 Not mature enough, and not clear what the market size would be. +- 4 for the same reason I left in the PDEP: "I think this (JSON-NTV format) does not meet the bar of being a commonly used format for implementation within pandas" +- 5 agree with -4- +- 6 agree with the other core-dev responders. I think work in the existing json interface is extremely valuable. A number of the original issues raised are just bug fixes / extensions of already existing functionality. Trying to start anew is likely not worth the migration effort. That said if a format is well supported in the community we can reconsider in the future (obviously json is well supported but the actual specification detailed here is too new / not accepted as a standard) +- 7 while I do think having a more comprehensive JSON format would be worthwhile, making a new format part of pandas means an implicit endorsement of a standard that is still being reviewed by the broader community. + +**Decision**: +- add the `ntv-pandas` package in the [ecosystem](https://pandas.pydata.org/community/ecosystem.html) +- revisit again this PDEP at a later stage, for example in 1/2 to 1 year (based on the evolution of the Internet draft [JSON semantic format (JSON-NTV)](https://www.ietf.org/archive/id/draft-thomy-json-ntv-01.html) and the usage of the [ntv-pandas](https://github.com/loco-philippe/ntv-pandas#readme)) ## Timeline -Tbd +Not applicable ## PDEP History - 16 June 2023: Initial draft - 22 July 2023: Add F.A.Q. - 06 September 2023: Add Table Schema extension +- 01 Octobre: Add Core team decision \ No newline at end of file From c56727764119cffa313311dfa408e6de820a8728 Mon Sep 17 00:00:00 2001 From: Philippe THOMY Date: Sun, 1 Oct 2023 17:01:08 +0200 Subject: [PATCH 22/22] Update 0012-compact-and-reversible-JSON-interface.md --- .../pdeps/0012-compact-and-reversible-JSON-interface.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md index 7e24c8ec3dafc..4fe4b935f144b 100644 --- a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md +++ b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md @@ -71,7 +71,7 @@ In [1]: from shapely.geometry import Point from datetime import date import pandas as pd import ntv_pandas as npd - + In [2]: data = {'index': [100, 200, 300, 400, 500, 600], 'dates::date': [date(1964,1,1), date(1985,2,5), date(2022,1,21), date(1964,1,1), date(1985,2,5), date(2022,1,21)], 'value': [10, 10, 20, 20, 30, 30], @@ -434,4 +434,4 @@ Not applicable - 16 June 2023: Initial draft - 22 July 2023: Add F.A.Q. - 06 September 2023: Add Table Schema extension -- 01 Octobre: Add Core team decision \ No newline at end of file +- 01 Octobre: Add Core team decision