ENH: Add `engine` keyword to `read_json` to enable reading from pyarrow #48893

mroeschke · 2022-09-30T16:59:29Z

pyarrow has a read_json function that could be used as an alternative parser for pd.read_json https://arrow.apache.org/docs/python/generated/pyarrow.json.read_json.html#pyarrow.json.read_json

Like we have for read_csv and read_parquet, I would like to propose a engine keyword argument to allow users to pick the parsing backend engine="ujson"|"pyarrow"

One change compared to read_csv and read_parquet with engine="pyarrow" I would like to propose would be to return ArrowExtensionArrays instead of converting the result to numpy dtypes such that the pyarrow.Table returned by read_json still propagates pyarrow objects underneath.

Thoughts?

The text was updated successfully, but these errors were encountered:

jreback · 2022-10-01T00:35:21Z

+1 for engine and returning arrow extension types

abkosar · 2022-10-04T18:58:31Z

Hi! I would like to work on this but this will be my first time contributing to pandas. On the other hand, I use pandas a lot and I am familiar with the read_json method; however, I may need some help on the way.

Thanks!

mroeschke · 2022-10-04T19:15:41Z

Thanks @abkosar. Just noting that for first time contributors we recommend tackling issues labeled good first issue, but if you're still interested in this particular issue a pull request would be welcome.

abkosar · 2022-10-04T19:19:17Z

Thanks for the reply @mroeschke. Yeah I looked at those too but since it's not my first time contributing to open source I thought I could this one a shot. Cool I will start looking into it.

lithomas1 · 2022-10-05T00:19:44Z

One change compared to read_csv and read_parquet with engine="pyarrow" I would like to propose would be to return ArrowExtensionArrays instead of converting the result to numpy dtypes such that the pyarrow.Table returned by read_json still propagates pyarrow objects underneath.

Thoughts?

I would prefer returning numpy dtypes instead, to be consistent between engines and across IO methods.

mroeschke · 2022-10-05T18:00:39Z

I would prefer returning numpy dtypes instead, to be consistent between engines and across IO methods.

Yeah understandable. I think #48957 would allow returning pyarrow types in a more backward compat & API consistent way

abkosar · 2022-10-08T04:23:53Z

@mroeschke So I have been looking into the issue. I was mainly looking at the read_csv and see how engine is implemented for read_csv; however, I have couple questions:

JsonReader doesn't have the _make_engine method (TextFileReader) implemented so I think I need to implement that too, just confirming?
read_csv has its own pyarrow engine wrapper in the pandas.io.parsers.arrow_parser_wrapper.py. For json should I extend that class or create new wrapper for read_json?

Also since this is my first time contributing to pandas, do you expect a fully functional PR or should I make a PR with what I have (which is not so far since I wanted to confirm the two questions above) and we iterate back and forth?

Thanks!

mroeschke · 2022-10-08T17:38:40Z

Not too familiar with the json code base, so it would be easier to discuss those questions in a PR just to see what it would look like.

Generally I would (hope) it would be something like

def read_json(...):
    # maybe some validation first
    if engine == "ujson":
        return ExistingParser().read()
    elif engine="pyarrow"
        return PyArrowParser().read()

Where PyArrowParser and ExistingParser might be able to share some code.

It's okay to open up a PR at any state, especially since Github has a draft PR status.

abkosar · 2022-10-08T21:48:32Z

Sounds good then. I will add couple more things and create a PR.

…andas-dev#48893

mroeschke added Enhancement IO JSON read_json, to_json, json_normalize Arrow pyarrow functionality labels Sep 30, 2022

github-actions bot assigned abkosar Oct 4, 2022

abkosar mentioned this issue Oct 11, 2022

read_json engine argument integration #49041

Closed

5 tasks

abkosar mentioned this issue Oct 22, 2022

read_json engine keyword and pyarrow integration #49249

Merged

5 tasks

abkosar added a commit to abkosar/pandas that referenced this issue Dec 13, 2022

ENH: Add engine keyword to read_json to enable reading from pyarrow p…

b85057e

…andas-dev#48893

abkosar added a commit to abkosar/pandas that referenced this issue Jan 7, 2023

ENH: Add engine keyword to read_json to enable reading from pyarrow p…

a7dd10a

…andas-dev#48893

abkosar added a commit to abkosar/pandas that referenced this issue Jan 20, 2023

ENH: Add engine keyword to read_json to enable reading from pyarrow p…

6aa72ca

…andas-dev#48893

phofl closed this as completed in #49249 Feb 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add `engine` keyword to `read_json` to enable reading from pyarrow #48893

ENH: Add `engine` keyword to `read_json` to enable reading from pyarrow #48893

mroeschke commented Sep 30, 2022

jreback commented Oct 1, 2022

abkosar commented Oct 4, 2022

mroeschke commented Oct 4, 2022

abkosar commented Oct 4, 2022

lithomas1 commented Oct 5, 2022

mroeschke commented Oct 5, 2022

abkosar commented Oct 8, 2022 •

edited

Loading

mroeschke commented Oct 8, 2022

abkosar commented Oct 8, 2022

ENH: Add engine keyword to read_json to enable reading from pyarrow #48893

ENH: Add engine keyword to read_json to enable reading from pyarrow #48893

Comments

mroeschke commented Sep 30, 2022

jreback commented Oct 1, 2022

abkosar commented Oct 4, 2022

mroeschke commented Oct 4, 2022

abkosar commented Oct 4, 2022

lithomas1 commented Oct 5, 2022

mroeschke commented Oct 5, 2022

abkosar commented Oct 8, 2022 • edited Loading

mroeschke commented Oct 8, 2022

abkosar commented Oct 8, 2022

ENH: Add `engine` keyword to `read_json` to enable reading from pyarrow #48893

ENH: Add `engine` keyword to `read_json` to enable reading from pyarrow #48893

abkosar commented Oct 8, 2022 •

edited

Loading