Skip to content

ENH: Add engine keyword to read_json to enable reading from pyarrow #48893

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mroeschke opened this issue Sep 30, 2022 · 9 comments · Fixed by #49249
Closed

ENH: Add engine keyword to read_json to enable reading from pyarrow #48893

mroeschke opened this issue Sep 30, 2022 · 9 comments · Fixed by #49249
Assignees
Labels
Arrow pyarrow functionality Enhancement IO JSON read_json, to_json, json_normalize

Comments

@mroeschke
Copy link
Member

pyarrow has a read_json function that could be used as an alternative parser for pd.read_json https://arrow.apache.org/docs/python/generated/pyarrow.json.read_json.html#pyarrow.json.read_json

Like we have for read_csv and read_parquet, I would like to propose a engine keyword argument to allow users to pick the parsing backend engine="ujson"|"pyarrow"

One change compared to read_csv and read_parquet with engine="pyarrow" I would like to propose would be to return ArrowExtensionArrays instead of converting the result to numpy dtypes such that the pyarrow.Table returned by read_json still propagates pyarrow objects underneath.

Thoughts?

@mroeschke mroeschke added Enhancement IO JSON read_json, to_json, json_normalize Arrow pyarrow functionality labels Sep 30, 2022
@jreback
Copy link
Contributor

jreback commented Oct 1, 2022

+1 for engine and returning arrow extension types

@abkosar
Copy link
Contributor

abkosar commented Oct 4, 2022

Hi! I would like to work on this but this will be my first time contributing to pandas. On the other hand, I use pandas a lot and I am familiar with the read_json method; however, I may need some help on the way.

Thanks!

@mroeschke
Copy link
Member Author

Thanks @abkosar. Just noting that for first time contributors we recommend tackling issues labeled good first issue, but if you're still interested in this particular issue a pull request would be welcome.

@abkosar
Copy link
Contributor

abkosar commented Oct 4, 2022

Thanks for the reply @mroeschke. Yeah I looked at those too but since it's not my first time contributing to open source I thought I could this one a shot. Cool I will start looking into it.

@lithomas1
Copy link
Member

One change compared to read_csv and read_parquet with engine="pyarrow" I would like to propose would be to return ArrowExtensionArrays instead of converting the result to numpy dtypes such that the pyarrow.Table returned by read_json still propagates pyarrow objects underneath.

Thoughts?

I would prefer returning numpy dtypes instead, to be consistent between engines and across IO methods.

@mroeschke
Copy link
Member Author

I would prefer returning numpy dtypes instead, to be consistent between engines and across IO methods.

Yeah understandable. I think #48957 would allow returning pyarrow types in a more backward compat & API consistent way

@abkosar
Copy link
Contributor

abkosar commented Oct 8, 2022

@mroeschke So I have been looking into the issue. I was mainly looking at the read_csv and see how engine is implemented for read_csv; however, I have couple questions:

  • JsonReader doesn't have the _make_engine method (TextFileReader) implemented so I think I need to implement that too, just confirming?
  • read_csv has its own pyarrow engine wrapper in the pandas.io.parsers.arrow_parser_wrapper.py. For json should I extend that class or create new wrapper for read_json?

Also since this is my first time contributing to pandas, do you expect a fully functional PR or should I make a PR with what I have (which is not so far since I wanted to confirm the two questions above) and we iterate back and forth?

Thanks!

@mroeschke
Copy link
Member Author

Not too familiar with the json code base, so it would be easier to discuss those questions in a PR just to see what it would look like.

Generally I would (hope) it would be something like

def read_json(...):
    # maybe some validation first
    if engine == "ujson":
        return ExistingParser().read()
    elif engine="pyarrow"
        return PyArrowParser().read()

Where PyArrowParser and ExistingParser might be able to share some code.

It's okay to open up a PR at any state, especially since Github has a draft PR status.

@abkosar
Copy link
Contributor

abkosar commented Oct 8, 2022

Sounds good then. I will add couple more things and create a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Enhancement IO JSON read_json, to_json, json_normalize
Projects
None yet
4 participants