-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
JSON table orient not roundtripping extension types #32037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks @navado I don't believe
|
Have you tried orient=“table”? That does store metadata about the dtypes so is the best bet for round tripping That said, better inferencing is possible generally so I think would accept a PR to do this |
@WillAyd - yes, it's the first print in the issue. @MarcoGorelli - to_parquet (and most of other to_smth) functions require dealing with storage when to_json is easily applicable to prepare dataframe for sending over the network. In general it's kind'a unexpected that metadata is not stored in such flexible format as JSON. It's not a compatibility issue (or I'm not aware of such). Additionally, there are several more issues with round trip serialization of JSON that I realized only now:
|
Ah sorry misread that. At the very least orient="table" should be able to preserve this, so would certainly take a PR to fix that Not sure on the rest just yet - we for most IO right now don't infer extension types like "string" automatically but it is something we have talked about as a team. |
As long as the string dtype is not the default, we can't infer it from a file format that has no type information (like json). There has been discussion about adding a keyword to specify you want the new dtypes, see #29752. Specifically for the JSON table format: that format indeed includes some type information. However, pandas itself doesn't specify the spec, I think, but we follow https://frictionlessdata.io/specs/table-schema/ ? (no specialist here) |
Table should work - we actually write that it is a string dtype just something off with the reader |
So nullable boolean and integer might work, indeed (but only if nulls are present, otherwise we still can't know which type to use). But as explained above, I don't see how we can differentiate string dtype vs object dtype with strings. Both end up in the json schema as type "string". The types are not pandas specific, but following the table-schema specification. |
Probably JSON serialization should be AVRO based or resembling AVRO? i.e. solve the issue by writing schema information to the file with the data? |
An AVRO interface for pandas might be interesting, but I think that is a different discussion? (or would be a new format) The current json output formats do not include schema information, except for the table schema. And for that, there is a specification to follow. |
@jorisvandenbossche Avro is definitely different topic, but saving schema in serialized JSON - should not be a big problem |
Saving schema information in the JSON outputs is something that could certainly be discussed, but I just wanted point out that this is still something else / bigger than how the current issue is framed here as "table orient not correctly roundtripping extension types" So if we want to discuss this further, that warrants a separate issue I think. |
Or just revert renaming the issue :) |
Well, back to the problem.
|
What do you mean with "being ignored inside the DataFrame constructor"? |
No, sorry, Just looked wrong place. It's ignored in case of numpy based parsing which is as I understand is deprecated.
|
Yes, because that is the default dtype for strings. As mentioned above, there has been discussion to add a keyword to opt in (#29752). But currently the schema information that is saved in json table schema is not detailed enough to know the original pandas dtype to do a faithful roundtrip. |
Why not something like
|
That might be a backwards change for people using strings in object dtypes (now, I almost never use this myself, so I can't really assess possible impact). But, what might be a better way forward is to add an additional entry to the field description. The spec allows to add more entries (https://frictionlessdata.io/specs/table-schema/#field-descriptors). And we already do this for eg categorical (ordered keyword) and datetimes (tz keyword). |
I think this has been a pretty enlightening discussion. I for one didn't realize that we just wrote "string" in to_json for object dtypes. Makes sense from a historical perspective, but actually there are a lot of gaps in that design like this: # Writes string for the dtype, but doesn't write strings in JSON
>>> pd.Series([1], dtype=object).to_json(orient="table")
'{"schema":{"fields":[{"name":"index","type":"integer"},{"name":"values","type":"string"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":0,"values":1}]}' And this # Same as above, but doesn't preserve key anyway
>>> pd.Series([{1: 2}]).to_json(orient="table")
'{"schema":{"fields":[{"name":"index","type":"integer"},{"name":"values","type":"string"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":0,"values":{"1":2}}]}' Would be tough to manage backwards compat but would support using |
At least for some of the built-in extension dtypes, you can use import pandas
print(pandas._version.get_versions())
df = pandas.DataFrame(
{
1: ['v1','v2'],
2: ['v3','v4']
},dtype='string'
)
print(df.info())
for orient in ['table','split','records','values']:
rdf = pandas.read_json(df.to_json(orient=orient), orient=orient).convert_dtypes()
print(f'======{orient}======')
print(rdf.info()) produces
|
@jorisvandenbossche moving part of the discussion here and addressing your last point in #20612 (comment) Not only I am interested in making this work for dates, I also need it to. |
Maybe you can create a new issue specifically about supporting "date" for JSON table schema? (because this issue is still a bit more general about other extension types as well. Unless you would like to tackle the general issue for extension types, instead of first focusing on datetime.date?) |
I agree with this. I have some comments/questions in response to #20612 (comment) that would be better handled in a separate issue about dates. |
Code Sample, a copy-pastable example if possible
Problem description
string dtype not preserved with round trip serialization to JSON, so dataframes containing strings cannot be reused transparently
Expected Output
Actual output
Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: