ENH: allow engine='pyarrow' in read_csv #23697

jreback · 2018-11-14T15:06:53Z

we could conditionally use the new pyarrow csv parser as an engine (requires 0.11 IIRC). eventually leading to a replacement path for the existing code. There might be a number of restrictions on what options we can pass as the current parser is more full-featured, but I suspect most of the basic options work.

cc @gfyoung @pitrou @wesm @TomAugspurger

wesm · 2018-11-14T15:09:56Z

I would expect it to be more limited in functionality in general, so maybe not a full replacement for pandas's read_csv, but it could be used as the "fast path" for well behaved files. Also, as soon as we build support for multi-character and regex delimiters, you could deprecate engine='python'

pitrou · 2018-11-14T15:12:27Z

I would be cautious here. Parsing or conversions may be stricter, less datatypes may be recognized, or the resulting datatypes may be different. I think it's better to make this an explicit option (engine="arrow" perhaps?).

wesm · 2018-11-14T15:14:22Z

Yes, I definitely agree with an "opt-in". It would be helpful if the pandas CSV suite could be set up so that we can run those tests using the Arrow reader to test for compatibility (it would be helpful to know for example, if say 50% of the tests work). Compatibility with pandas.read_csv is not our objective at the moment, though

jreback · 2018-11-14T16:07:44Z

yep would be opt in (as also requires newer pyarrow)

lithomas1 · 2020-02-06T03:43:41Z

take

lithomas1 · 2021-04-23T14:44:57Z

Pushing this off 1.3. Current implementation is too buggy to be landed, so 2.0 is a better target. I'm going to try reviving the stale PR for this hopefully in the next month.

jorisvandenbossche · 2021-04-29T16:26:23Z

@lithomas1 I was just looking at those issues/PRs with the idea to revive this as well. Can you give some explanation of what you think is buggy in the current implementation? (assuming this is about #38370, looking at the comments there, it mostly needs an update for latest master / few comments)

If we think #38370 is still a good start, I can take a look at updating it for latest master as a start.

lithomas1 · 2021-04-30T01:49:12Z

@jorisvandenbossche
#38370 is still a good start(Needs a non trivial rebase due to refactoring of parsers in #39217.).
Its buggy because it passes ~ 25% of tests(mostly due to unsupported things) and doesn't pass a lot of tests even for supported things.

If landed individually, I would recommend disabling parse_dates and co. (which worked on my original but was removed for simplicity in this). Theres also a lot of nasty stuff with partially supported parameters that needs to error for unsupported cases and not fail awkwardly. So, I would highly recommend waiting for 2.0, as this should not be landed by itself.

Six more months of development + a pyarrow version bump(to 1.0) would do wonders. :) I can take revive this right after I finish my other PR(#40687) but you are certainly welcome to beat me to it.

lithomas1 · 2021-05-04T21:17:13Z

@jorisvandenbossche Little heads up that I'm going to be implementing #39383 (BytesIOWrapper) which will cause conflicts with #38370.

jreback added Enhancement Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Nov 14, 2018

jreback added this to the Contributions Welcome milestone Nov 14, 2018

github-actions bot assigned lithomas1 Feb 6, 2020

lithomas1 mentioned this issue Feb 9, 2020

ENH: add arrow engine to read_csv #31817

Closed

5 tasks

arw2019 mentioned this issue Dec 8, 2020

[WIP] ENH: add Pyarrow csv engine #38370

Closed

5 tasks

jreback modified the milestones: Contributions Welcome, 1.3 Dec 31, 2020

arw2019 mentioned this issue Jan 1, 2021

Pyarrow CSV Reader Integration Tracker #38872

Open

23 tasks

lithomas1 modified the milestones: 1.3, 2.0 Apr 23, 2021

lithomas1 modified the milestones: 2.0, 1.4 Jul 6, 2021

lithomas1 mentioned this issue Aug 17, 2021

ENH: Add Arrow CSV Reader #43072

Merged

4 tasks

jreback closed this as completed in #43072 Aug 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: allow engine='pyarrow' in read_csv #23697

ENH: allow engine='pyarrow' in read_csv #23697

jreback commented Nov 14, 2018

wesm commented Nov 14, 2018

pitrou commented Nov 14, 2018

wesm commented Nov 14, 2018 •

edited

Loading

jreback commented Nov 14, 2018

lithomas1 commented Feb 6, 2020

lithomas1 commented Apr 23, 2021

jorisvandenbossche commented Apr 29, 2021 •

edited

Loading

lithomas1 commented Apr 30, 2021

lithomas1 commented May 4, 2021

ENH: allow engine='pyarrow' in read_csv #23697

ENH: allow engine='pyarrow' in read_csv #23697

Comments

jreback commented Nov 14, 2018

wesm commented Nov 14, 2018

pitrou commented Nov 14, 2018

wesm commented Nov 14, 2018 • edited Loading

jreback commented Nov 14, 2018

lithomas1 commented Feb 6, 2020

lithomas1 commented Apr 23, 2021

jorisvandenbossche commented Apr 29, 2021 • edited Loading

lithomas1 commented Apr 30, 2021

lithomas1 commented May 4, 2021

wesm commented Nov 14, 2018 •

edited

Loading

jorisvandenbossche commented Apr 29, 2021 •

edited

Loading