Skip to content

ENH: Add Arrow CSV Reader #43072

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Aug 27, 2021
Merged

ENH: Add Arrow CSV Reader #43072

merged 13 commits into from
Aug 27, 2021

Conversation

lithomas1
Copy link
Member

@lithomas1 lithomas1 commented Aug 17, 2021

Picking up from #38370.
Perf is OK. (StringIO/Text buffer reading gets like 20% overhead b/c of BytesIOWrapper, we should probably look into more efficient ways of doing it/cythonize it. Bytes IO performance looks good though).

Benchmarks 100k rows, 5 col

[  0.00%] ··· io.csv.ReadCSVEngine.time_read_bytescsv                                                                                                                             ok
[  0.00%] ··· ========= ============
                engine              
              --------- ------------
                  c       22.9±2ms  
                python    259±20ms  
               pyarrow   5.79±0.3ms 
              ========= ============

[  0.00%] ··· io.csv.ReadCSVEngine.time_read_stringcsv                                                                                                                            ok
[  0.00%] ··· ========= ============
                engine              
              --------- ------------
                  c       20.6±1ms  
                python    257±1ms   
               pyarrow   7.06±0.3ms 
              ========= ============

Tests will still fail tho, working on that rn.

My machine is a 2019 MBP(Intel) with 6 cores. YMMV based on the number of CPUs you have. Surprised that it doesn't quite scale linearly, maybe the file is too small to feel the benefit of all 6 cores.

@lithomas1 lithomas1 added Enhancement IO CSV read_csv, to_csv Performance Memory or execution speed performance labels Aug 18, 2021
@twoertwein
Copy link
Member

Picking up from #38370.
Perf is OK. (StringIO gets like 20% overhead b/c of BytesIOWrapper, we should probably look into more efficient ways of doing it/cythonize it).

Isn't the BytesIOWrapper only used when the user provides a text-handle? If the user provides something path-like, get_handle should (hopefully) directly open it in binary mode, or?

@lithomas1
Copy link
Member Author

lithomas1 commented Aug 18, 2021

Yup, that's why the StringIO benchmark is slower(bytescsv is unaffected as you can see above).. Hopefully the bytes benchmark is closer to the true speed of pyarrow, but haven't profiled. Perf there does look reasonable to me.

Sorry for being a little ambiguous in the description.

@lithomas1
Copy link
Member Author

I think I've addressed all the comments on the previous PR. LMK otherwise. This is ready for a preliminary review now, will ping everyone involved once I get CI to green.(22 tests to go left, but some tests are getting stuck and timing out).

@lithomas1 lithomas1 requested a review from jreback August 19, 2021 00:07
@lithomas1
Copy link
Member Author

as green as it gets.

@lithomas1
Copy link
Member Author

also cc @arw2019 @xhochy

@lithomas1 lithomas1 marked this pull request as ready for review August 19, 2021 16:45
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks really good, thanks @lithomas1 for reviving this.

some doc-comments / requests
and 1 for code.

are all of the currently raised exceptions tested?

@lithomas1
Copy link
Member Author

@lithomas1 lithomas1 requested a review from jreback August 20, 2021 20:51
@jreback jreback added this to the 1.4 milestone Aug 20, 2021
@jreback
Copy link
Contributor

jreback commented Aug 20, 2021

@pandas-dev/pandas-core if any comments. Ideally we could do these in a followup.

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments

@lithomas1
Copy link
Member Author

Ok, I think we should try to merge this now, if no other comments(it has been 1 week). Planning on enabling more features and doing some parser cleanup after this.

@lithomas1 lithomas1 requested a review from jreback August 27, 2021 22:46
@jreback jreback merged commit 44e8822 into pandas-dev:master Aug 27, 2021
@jreback
Copy link
Contributor

jreback commented Aug 27, 2021

thanks @lithomas1 very nice!

pls create a follow up issue with checkboxes

@jreback
Copy link
Contributor

jreback commented Aug 27, 2021

thanks @arw2019 for a lot of the initial work here!

@lithomas1
Copy link
Member Author

#38889
#38872

FWIW, is it possible to create an Arrow label for pyarrow things? Arrow issues are still filed under IO Parquet which is wrong. I think this might be relevant for things like the ArrowStringArray too.

@jreback
Copy link
Contributor

jreback commented Aug 28, 2021

i created a label just now

@jreback jreback added the Arrow pyarrow functionality label Aug 28, 2021
@jbrockmendel
Copy link
Member

im seeing pyarrow io tests stalling out locally (and not KeyboardInterrupt-able), can anyone else on mac confirm?

jbrockmendel added a commit that referenced this pull request Aug 30, 2021
@lithomas1
Copy link
Member Author

Maybe try setting the env var OMP_NUM_THREADS to 1? We should be running these tests single threaded right now.
I'll try to reproduce when I get back on my mac, but its interesting how this is not failing on CI.

@lithomas1 lithomas1 deleted the enh-arrow-csv branch August 31, 2021 14:46
@lithomas1
Copy link
Member Author

Ok, looking into this right now. I think I can reproduce locally, but interested in how this doesn't reproduce on CI.

@jbrockmendel
Copy link
Member

I fixed it locally by upgrading pyarrow

feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Enhancement IO CSV read_csv, to_csv Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: allow engine='pyarrow' in read_csv
6 participants