Skip to content

read_json engine argument integration #49041

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

abkosar
Copy link
Contributor

@abkosar abkosar commented Oct 11, 2022

Creating the PR so we can start the discussion. It's not complete yet I will keep updating the PR; however, any feedback of course is appreciated.

My main question is, there is an arrow_parser_wrapper.py file has a ArrowParserWrapper class which mainly serves the engine argument of the read_csv method. I though of extending that classes functionality so it can serve both read_json and read_csv but wanted to make get your opinion about it.

I will also add tests but I am still reading the tests part from the contributing guidelines.

abkosar and others added 2 commits October 10, 2022 23:31
- added JSONEngine to _typing.py
- added engine to `read_json` inputs
- added engine to `read_json` docstring
- added engine logic to `JsonReader`
- added basis of the _make_engine method
@mroeschke mroeschke added IO JSON read_json, to_json, json_normalize Arrow pyarrow functionality labels Oct 11, 2022
@@ -780,6 +792,7 @@ def __init__(
precise_float: bool,
date_unit,
encoding,
engine,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This keyword will need to go at the end of the function to accommodate people passing positional args.

@@ -607,6 +615,9 @@ def read_json(

.. versionadded:: 1.3.0

engine : {{'ujson', 'pyarrow'}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should default to "ujson"

Copy link
Contributor Author

@abkosar abkosar Oct 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually have a question about this. I thought that the engine argument will be optional. What confused me is that read_json already has a parsing logic in place and I thought the engine keyword will provide two additional options. Or did I misunderstand?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing parsing logic is (vendored) ujson code, so that's why the "default" should be "ujson".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah after I made that comment that's what I figured.

@mroeschke
Copy link
Member

I though of extending that classes functionality so it can serve both read_json and read_csv but wanted to make get your opinion about it.

The CSV and JSON reading code is more-or-less distinct so I'd recommend against this.

I recommend just creating a new file in pandas/io/json like arrow_parser.py with the parsing class that can imported and used in read_json

@abkosar
Copy link
Contributor Author

abkosar commented Oct 11, 2022

Yeah that makes sense. Sounds good then, will do that. Thanks for the feedback!

@abkosar
Copy link
Contributor Author

abkosar commented Oct 18, 2022

I just wanted to give an update. I didn't forget about the issue, I'm working on it:) After days of staring (and exploring) between source code and me finally I was able to implement a logic the past two days and wrote a test for it. I have been trying to get the docker interpreter working so I can run the tests. After that I'll update the PR.

@abkosar abkosar force-pushed the read-json-engine-argument branch from 6ff583c to 9f915d7 Compare October 22, 2022 17:47
@abkosar
Copy link
Contributor Author

abkosar commented Oct 22, 2022

@mroeschke I have to close this PR since I messed up something in git flow and history got messed up while trying to squash commits. I have to create a new PR from main branch. Is there a closing process, any commands I should run?

@abkosar
Copy link
Contributor Author

abkosar commented Oct 22, 2022

Opened #49249 instead of this PR.

@mroeschke
Copy link
Member

Closing in favor of #49249

@mroeschke mroeschke closed this Oct 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Add engine keyword to read_json to enable reading from pyarrow
6 participants