Skip to content

Please consider switch to standard json module #24711

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zhihaoy opened this issue Jan 10, 2019 · 10 comments
Closed

Please consider switch to standard json module #24711

zhihaoy opened this issue Jan 10, 2019 · 10 comments
Labels
IO JSON read_json, to_json, json_normalize

Comments

@zhihaoy
Copy link

zhihaoy commented Jan 10, 2019

Problem description

The ujson we currently using is not well maintained, no activity and no response since last two years:
ultrajson/ultrajson#291
and other projects are switching out of it:
fastavro/fastavro#150
It makes it hard for us to handle new feature requests:
#12213
and has bugs causing us unable to consume standard JSON files:
ultrajson/ultrajson#252

@jreback
Copy link
Contributor

jreback commented Jan 10, 2019

we use a vendored copy, so if you have a patch could be taken directly in pandas. the ujson is much more performant that the standard library and that is why it is used.

@jreback jreback added the IO JSON read_json, to_json, json_normalize label Jan 10, 2019
@zhihaoy
Copy link
Author

zhihaoy commented Jan 10, 2019

There are benchmark showing that in large datasets, ujson doesn't show an advantage against standard library json. https://www.reddit.com/r/Python/comments/3mtswx/benchmark_of_pythons_alternative_json_libraries/ Moreover, we know as a fact that the json library in PyPy is the fastest (because it's optimized for JIT). Being said that, if someone really have concern about it, we can add a parameter to read_json families to let user chose which json-like library they want to use.

@TomAugspurger
Copy link
Contributor

It's not just performance. The stdlib JSON module doesn't serialize things like NumPy arrays or scalars.

Does anyone know the status of Arrows's support for JSON (de)/serialization?

@jreback
Copy link
Contributor

jreback commented Jan 10, 2019

most folks do not use PyPy so not sure that actually matters

if you would like to add an engine kwarg to read_json would be ok

@zhihaoy
Copy link
Author

zhihaoy commented Jan 10, 2019

It's not just performance. The stdlib JSON module doesn't serialize things like NumPy arrays or scalars.

The unreleased ujson upstream dropped this support since may 2016 (ultrajson/ultrajson@53f85b1), we still support it (buggyly ultrajson/ultrajson#221) by accident. If we want to keep this feature, we should do it properly by specializing on numpy.

@jreback
Copy link
Contributor

jreback commented Jan 10, 2019

I don't think arrow yet has support for JSON, cc @wesm @pitrou

@pitrou
Copy link
Contributor

pitrou commented Jan 10, 2019

I'm not sure what you mean by support? Arrow C++ has some understanding of JSON, but it doesn't interoperate with Numpy or Pandas arrays.

@wesm
Copy link
Member

wesm commented Jan 10, 2019

@pitrou I think Jeff means support for JSON sufficient to power pandas.read_json. @bkietz is working on this in https://issues.apache.org/jira/browse/ARROW-694

I expect that we'll have support for reading and writing JSON sufficient to appease most pandas users on the timeline of Arrow 0.13 or 0.14, so most likely either by end of March or end of May

@jreback
Copy link
Contributor

jreback commented Jan 10, 2019

thanks @wesm yep that's what I mean. we have legacy c code in read_json and would be nice to remove it w/o sacrificing perf.

@jreback jreback added this to the No action milestone Jan 1, 2020
@jreback
Copy link
Contributor

jreback commented Jan 1, 2020

the c impl has recently been much refactored, closing this as no-action.

@jreback jreback closed this as completed Jan 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

No branches or pull requests

5 participants