-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Python's date and time objects do not roundtrip with JSON orient='table' #44705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Since the Table Schema specification supports date and time, it seems we can support this as well (though our date and time support is quite limited of course, given you can only use it with the generic object dtype). Currently we are using The other question is the exact format we use for the values in the JSON. The specification's default format for "date" is in the YYYY-MM-DD format (while currently the low level C code will convert dates to timestamps and thus end up using something like "2021-12-01T00:00:00.000Z"). Now it seems that the specification allows to specify the exact format being used, so that could be a workaround. Alternatively we could also convert the column to proper YYYY-MM-DD strings before passing it down to the low-level C code to serialize to JSON. |
I am in favor of serializing the Regardless, if I am to work with the C code, guidance as to what/where I should change is very appreciated. |
If we convert to string up front, then no C changes should be necessary. I think the main advantage of doing it in C is that you don't have the full converted column at the same time (so lower memory use), and potentially faster (but that might depend on the way the formatting is done). |
@jmg-duarte In #20612 (comment), you wrote:
I don't understand why you don't just use the pandas datetime type to represent the dates and times. That's not to say that we can't support
|
You now see why I hoped that the |
I think the serialization and deserialization code is in python, and you'd have to modify |
You might be able to do this for all |
Could you be more explicit what you mean with "modifying |
To clarify why I suggested to start with special casing "date" and "time" instead of general (extension) dtype support, is because the Table Schema explicitly supports date and time (while this is not explicitly the case for eg decimal, although you can of course store that as a float number). The python default json serialization might not handle date and time, but it's also not creating a very specific subset of json as is the case here (i.e. the table schema spec).
(I think @Dr-Irv meant that suggestion of where to look for modifying the code if you would do a PR to contribute it upstream, not just for patching it on your side) |
👍 I was writting the same point and you beat me to it.
I'm happy to help, but I need more guidance w.r.t. the extension types // |
Diving into the serialization code, the schema is easy enough to extend. diff --git a/pandas/io/json/_table_schema.py b/pandas/io/json/_table_schema.py
index 75fd950cd6..8d16f668df 100644
--- a/pandas/io/json/_table_schema.py
+++ b/pandas/io/json/_table_schema.py
@@ -18,13 +18,17 @@ from pandas._typing import (
JSONSerializable,
)
+from pandas.api.types import infer_dtype
+
from pandas.core.dtypes.common import (
is_bool_dtype,
is_categorical_dtype,
is_datetime64_dtype,
is_datetime64tz_dtype,
is_integer_dtype,
is_numeric_dtype,
+ is_object_dtype,
is_period_dtype,
is_string_dtype,
is_timedelta64_dtype,
@@ -83,6 +87,8 @@ def as_json_table_type(x: DtypeObj) -> str:
return "duration"
elif is_categorical_dtype(x):
return "any"
+ elif is_object_dtype(x): # first because the string dtype covers this
+ return "object"
elif is_string_dtype(x):
return "string"
else:
@@ -112,17 +118,19 @@ def set_default_names(data):
data.index.name = data.index.name or "index"
return data
-
def convert_pandas_type_to_json_field(arr):
dtype = arr.dtype
if arr.name is None:
name = "values"
else:
name = arr.name
+
field: dict[str, JSONSerializable] = {
"name": name,
"type": as_json_table_type(dtype),
}
+ if field["type"] == "object":
+ field["pytype"] = infer_dtype(arr)
if is_categorical_dtype(dtype):
cats = dtype.categories
@@ -179,7 +187,7 @@ def convert_json_field_to_pandas_type(field):
'datetime64[ns, US/Central]'
"""
typ = field["type"]
- if typ == "string":
+ if typ in ["string", "object"]:
return "object"
elif typ == "integer":
return "int64" Which will be able to generate the following schema: {
"schema":{
"fields":[
{
"name":"index",
"type":"integer"
},
{
"name":"d",
"type":"object",
"pytype":"date"
}
],
"primaryKey":[
"index"
],
"pandas_version":"0.20.0"
},
"data":[
{
"index":0,
"d":"2021-12-02T00:00:00.000Z"
}
]
} However, when reading, given that df = df.astype(dtypes) I could try to convert the |
Just submitted a PR (as you can see above), it ends up taking the approach suggested by @Dr-Irv (quick note, when deserializing, it calls |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
Issue Description
datetime.date
anddatetime.time
objects are not serialized correctly when getting converted to and from JSON.The main issue seems to be the lack of information when (de)serializing the JSON DataFrame.
Given that pandas makes use of the the Table Schema from Frictionless Standards, it should support both, as made evident by:
NOTE: This has been discussed in #20612 and #32037.
Expected Behavior
Both types make the roundtrip.
Installed Versions
The text was updated successfully, but these errors were encountered: