-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add engine
keyword to read_json
to enable reading from pyarrow
#48893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+1 for engine and returning arrow extension types |
Hi! I would like to work on this but this will be my first time contributing to pandas. On the other hand, I use pandas a lot and I am familiar with the Thanks! |
Thanks @abkosar. Just noting that for first time contributors we recommend tackling issues labeled |
Thanks for the reply @mroeschke. Yeah I looked at those too but since it's not my first time contributing to open source I thought I could this one a shot. Cool I will start looking into it. |
I would prefer returning numpy dtypes instead, to be consistent between engines and across IO methods. |
Yeah understandable. I think #48957 would allow returning pyarrow types in a more backward compat & API consistent way |
@mroeschke So I have been looking into the issue. I was mainly looking at the
Also since this is my first time contributing to pandas, do you expect a fully functional PR or should I make a PR with what I have (which is not so far since I wanted to confirm the two questions above) and we iterate back and forth? Thanks! |
Not too familiar with the json code base, so it would be easier to discuss those questions in a PR just to see what it would look like. Generally I would (hope) it would be something like
Where PyArrowParser and ExistingParser might be able to share some code. It's okay to open up a PR at any state, especially since Github has a draft PR status. |
Sounds good then. I will add couple more things and create a PR. |
pyarrow has a
read_json
function that could be used as an alternative parser forpd.read_json
https://arrow.apache.org/docs/python/generated/pyarrow.json.read_json.html#pyarrow.json.read_jsonLike we have for
read_csv
andread_parquet
, I would like to propose aengine
keyword argument to allow users to pick the parsing backendengine="ujson"|"pyarrow"
One change compared to
read_csv
andread_parquet
withengine="pyarrow"
I would like to propose would be to returnArrowExtensionArray
s instead of converting the result to numpy dtypes such that thepyarrow.Table
returned byread_json
still propagates pyarrow objects underneath.Thoughts?
The text was updated successfully, but these errors were encountered: