ENH: Add Arrow CSV Reader #43072

lithomas1 · 2021-08-17T01:01:49Z

closes ENH: allow engine='pyarrow' in read_csv #23697
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

Picking up from #38370.
Perf is OK. (StringIO/Text buffer reading gets like 20% overhead b/c of BytesIOWrapper, we should probably look into more efficient ways of doing it/cythonize it. Bytes IO performance looks good though).

Benchmarks 100k rows, 5 col

[  0.00%] ··· io.csv.ReadCSVEngine.time_read_bytescsv                                                                                                                             ok
[  0.00%] ··· ========= ============
                engine              
              --------- ------------
                  c       22.9±2ms  
                python    259±20ms  
               pyarrow   5.79±0.3ms 
              ========= ============

[  0.00%] ··· io.csv.ReadCSVEngine.time_read_stringcsv                                                                                                                            ok
[  0.00%] ··· ========= ============
                engine              
              --------- ------------
                  c       20.6±1ms  
                python    257±1ms   
               pyarrow   7.06±0.3ms 
              ========= ============

Tests will still fail tho, working on that rn.

My machine is a 2019 MBP(Intel) with 6 cores. YMMV based on the number of CPUs you have. Surprised that it doesn't quite scale linearly, maybe the file is too small to feel the benefit of all 6 cores.

twoertwein · 2021-08-18T19:56:44Z

Picking up from #38370.
Perf is OK. (StringIO gets like 20% overhead b/c of BytesIOWrapper, we should probably look into more efficient ways of doing it/cythonize it).

Isn't the BytesIOWrapper only used when the user provides a text-handle? If the user provides something path-like, get_handle should (hopefully) directly open it in binary mode, or?

lithomas1 · 2021-08-18T22:17:53Z

Yup, that's why the StringIO benchmark is slower(bytescsv is unaffected as you can see above).. Hopefully the bytes benchmark is closer to the true speed of pyarrow, but haven't profiled. Perf there does look reasonable to me.

Sorry for being a little ambiguous in the description.

lithomas1 · 2021-08-19T00:07:08Z

I think I've addressed all the comments on the previous PR. LMK otherwise. This is ready for a preliminary review now, will ping everyone involved once I get CI to green.(22 tests to go left, but some tests are getting stuck and timing out).

lithomas1 · 2021-08-19T16:44:03Z

as green as it gets.

lithomas1 · 2021-08-19T16:45:36Z

also cc @arw2019 @xhochy

jreback

looks really good, thanks @lithomas1 for reviving this.

some doc-comments / requests
and 1 for code.

are all of the currently raised exceptions tested?

doc/source/user_guide/io.rst

doc/source/whatsnew/v1.4.0.rst

pandas/io/parsers/arrow_parser_wrapper.py

pandas/io/parsers/readers.py

doc/source/user_guide/io.rst

Co-authored-by: gfyoung <[email protected]>

lithomas1 · 2021-08-20T20:29:14Z

Raised exceptions tested here.
https://github.com/pandas-dev/pandas/pull/43072/files#diff-2a0f93df26fa0ee1d10f216142db735ff227fcfb02f161ad165fdfcb3f767cd1R125-R146.

jreback · 2021-08-20T21:16:05Z

@pandas-dev/pandas-core if any comments. Ideally we could do these in a followup.

Dr-Irv

Some minor comments

doc/source/user_guide/io.rst

pandas/io/parsers/readers.py

lithomas1 · 2021-08-27T22:44:52Z

Ok, I think we should try to merge this now, if no other comments(it has been 1 week). Planning on enabling more features and doing some parser cleanup after this.

jreback · 2021-08-27T23:18:38Z

thanks @lithomas1 very nice!

pls create a follow up issue with checkboxes

jreback · 2021-08-27T23:19:54Z

thanks @arw2019 for a lot of the initial work here!

lithomas1 · 2021-08-28T18:42:42Z

#38889
#38872

FWIW, is it possible to create an Arrow label for pyarrow things? Arrow issues are still filed under IO Parquet which is wrong. I think this might be relevant for things like the ArrowStringArray too.

jreback · 2021-08-28T18:58:06Z

i created a label just now

jbrockmendel · 2021-08-28T20:01:09Z

im seeing pyarrow io tests stalling out locally (and not KeyboardInterrupt-able), can anyone else on mac confirm?

This reverts commit 44e8822.

lithomas1 · 2021-08-31T14:46:45Z

Maybe try setting the env var OMP_NUM_THREADS to 1? We should be running these tests single threaded right now.
I'll try to reproduce when I get back on my mac, but its interesting how this is not failing on CI.

lithomas1 · 2021-09-01T00:20:26Z

Ok, looking into this right now. I think I can reproduce locally, but interested in how this doesn't reproduce on CI.

jbrockmendel · 2021-09-01T01:47:06Z

I fixed it locally by upgrading pyarrow

lithomas1 added 5 commits August 16, 2021 16:47

ENH: Add pyarrow csv reader

dfd83f8

merge master

33c2ba6

address review comments

ba0cc8b

fix some tests

b9d7f8b

xfail/skip more

b6db201

lithomas1 added Enhancement IO CSV read_csv, to_csv Performance Memory or execution speed performance labels Aug 18, 2021

more test xfails/skips

c9ce8ff

lithomas1 requested a review from jreback August 19, 2021 00:07

lithomas1 added 2 commits August 18, 2021 19:52

maybe green?

fa77aaf

green?

9ced2f2

lithomas1 requested review from jorisvandenbossche, TomAugspurger, WillAyd, gfyoung and simonjayhawkins August 19, 2021 16:44

lithomas1 marked this pull request as ready for review August 19, 2021 16:45

jreback requested changes Aug 19, 2021

View reviewed changes

gfyoung reviewed Aug 19, 2021

View reviewed changes

doc/source/user_guide/io.rst Outdated Show resolved Hide resolved

gfyoung reviewed Aug 19, 2021

View reviewed changes

doc/source/user_guide/io.rst Outdated Show resolved Hide resolved

gfyoung reviewed Aug 19, 2021

View reviewed changes

doc/source/user_guide/io.rst Outdated Show resolved Hide resolved

gfyoung reviewed Aug 19, 2021

View reviewed changes

doc/source/user_guide/io.rst Outdated Show resolved Hide resolved

lithomas1 and others added 2 commits August 19, 2021 12:26

Apply suggestions from code review

43495a5

Co-authored-by: gfyoung <[email protected]>

Merge branch 'master' into enh-arrow-csv

83d0c25

code review

3a3352a

lithomas1 requested a review from jreback August 20, 2021 20:51

jreback added this to the 1.4 milestone Aug 20, 2021

jreback approved these changes Aug 20, 2021

View reviewed changes

Dr-Irv reviewed Aug 20, 2021

View reviewed changes

doc/source/user_guide/io.rst Show resolved Hide resolved

pandas/io/parsers/readers.py Show resolved Hide resolved

lithomas1 and others added 2 commits August 20, 2021 16:14

fix typos

3f0568f

Merge branch 'pandas-dev:master' into enh-arrow-csv

3114da5

lithomas1 requested a review from jreback August 27, 2021 22:46

jreback merged commit 44e8822 into pandas-dev:master Aug 27, 2021

jreback added the Arrow pyarrow functionality label Aug 28, 2021

jbrockmendel added a commit that referenced this pull request Aug 30, 2021

Revert "ENH: Add Arrow CSV Reader (#43072)"

cbfce38

This reverts commit 44e8822.

jbrockmendel mentioned this pull request Aug 30, 2021

BUG: Fix some parse dates tests for the Arrow CSV reader #43312

Merged

4 tasks

lithomas1 deleted the enh-arrow-csv branch August 31, 2021 14:46

feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021

ENH: Add Arrow CSV Reader (pandas-dev#43072)

1dcdccf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add Arrow CSV Reader #43072

ENH: Add Arrow CSV Reader #43072

lithomas1 commented Aug 17, 2021 •

edited

Loading

twoertwein commented Aug 18, 2021

lithomas1 commented Aug 18, 2021 •

edited

Loading

lithomas1 commented Aug 19, 2021

lithomas1 commented Aug 19, 2021

lithomas1 commented Aug 19, 2021

jreback left a comment

lithomas1 commented Aug 20, 2021

jreback commented Aug 20, 2021

Dr-Irv left a comment

lithomas1 commented Aug 27, 2021

jreback commented Aug 27, 2021

jreback commented Aug 27, 2021

lithomas1 commented Aug 28, 2021

jreback commented Aug 28, 2021

jbrockmendel commented Aug 28, 2021

lithomas1 commented Aug 31, 2021

lithomas1 commented Sep 1, 2021

jbrockmendel commented Sep 1, 2021

ENH: Add Arrow CSV Reader #43072

ENH: Add Arrow CSV Reader #43072

Conversation

lithomas1 commented Aug 17, 2021 • edited Loading

twoertwein commented Aug 18, 2021

lithomas1 commented Aug 18, 2021 • edited Loading

lithomas1 commented Aug 19, 2021

lithomas1 commented Aug 19, 2021

lithomas1 commented Aug 19, 2021

jreback left a comment

Choose a reason for hiding this comment

lithomas1 commented Aug 20, 2021

jreback commented Aug 20, 2021

Dr-Irv left a comment

Choose a reason for hiding this comment

lithomas1 commented Aug 27, 2021

jreback commented Aug 27, 2021

jreback commented Aug 27, 2021

lithomas1 commented Aug 28, 2021

jreback commented Aug 28, 2021

jbrockmendel commented Aug 28, 2021

lithomas1 commented Aug 31, 2021

lithomas1 commented Sep 1, 2021

jbrockmendel commented Sep 1, 2021

lithomas1 commented Aug 17, 2021 •

edited

Loading

lithomas1 commented Aug 18, 2021 •

edited

Loading