-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add Arrow CSV Reader #43072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add Arrow CSV Reader #43072
Conversation
Isn't the |
Yup, that's why the StringIO benchmark is slower( Sorry for being a little ambiguous in the description. |
I think I've addressed all the comments on the previous PR. LMK otherwise. This is ready for a preliminary review now, will ping everyone involved once I get CI to green.(22 tests to go left, but some tests are getting stuck and timing out). |
as green as it gets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks really good, thanks @lithomas1 for reviving this.
some doc-comments / requests
and 1 for code.
are all of the currently raised exceptions tested?
Co-authored-by: gfyoung <[email protected]>
@pandas-dev/pandas-core if any comments. Ideally we could do these in a followup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments
Ok, I think we should try to merge this now, if no other comments(it has been 1 week). Planning on enabling more features and doing some parser cleanup after this. |
thanks @lithomas1 very nice! pls create a follow up issue with checkboxes |
thanks @arw2019 for a lot of the initial work here! |
i created a label just now |
im seeing pyarrow io tests stalling out locally (and not KeyboardInterrupt-able), can anyone else on mac confirm? |
This reverts commit 44e8822.
Maybe try setting the env var |
Ok, looking into this right now. I think I can reproduce locally, but interested in how this doesn't reproduce on CI. |
I fixed it locally by upgrading pyarrow |
Picking up from #38370.
Perf is OK. (StringIO/Text buffer reading gets like 20% overhead b/c of BytesIOWrapper, we should probably look into more efficient ways of doing it/cythonize it. Bytes IO performance looks good though).
Benchmarks 100k rows, 5 col
Tests will still fail tho, working on that rn.
My machine is a 2019 MBP(Intel) with 6 cores. YMMV based on the number of CPUs you have. Surprised that it doesn't quite scale linearly, maybe the file is too small to feel the benefit of all 6 cores.