Skip to content

ENH: allow engine='pyarrow' in read_csv #23697

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Nov 14, 2018 · 9 comments · Fixed by #43072
Closed

ENH: allow engine='pyarrow' in read_csv #23697

jreback opened this issue Nov 14, 2018 · 9 comments · Fixed by #43072
Assignees
Labels
Enhancement IO CSV read_csv, to_csv Performance Memory or execution speed performance
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Nov 14, 2018

we could conditionally use the new pyarrow csv parser as an engine (requires 0.11 IIRC). eventually leading to a replacement path for the existing code. There might be a number of restrictions on what options we can pass as the current parser is more full-featured, but I suspect most of the basic options work.

cc @gfyoung @pitrou @wesm @TomAugspurger

@jreback jreback added Enhancement Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Nov 14, 2018
@jreback jreback added this to the Contributions Welcome milestone Nov 14, 2018
@wesm
Copy link
Member

wesm commented Nov 14, 2018

I would expect it to be more limited in functionality in general, so maybe not a full replacement for pandas's read_csv, but it could be used as the "fast path" for well behaved files. Also, as soon as we build support for multi-character and regex delimiters, you could deprecate engine='python'

@pitrou
Copy link
Contributor

pitrou commented Nov 14, 2018

I would be cautious here. Parsing or conversions may be stricter, less datatypes may be recognized, or the resulting datatypes may be different. I think it's better to make this an explicit option (engine="arrow" perhaps?).

@wesm
Copy link
Member

wesm commented Nov 14, 2018

Yes, I definitely agree with an "opt-in". It would be helpful if the pandas CSV suite could be set up so that we can run those tests using the Arrow reader to test for compatibility (it would be helpful to know for example, if say 50% of the tests work). Compatibility with pandas.read_csv is not our objective at the moment, though

@jreback
Copy link
Contributor Author

jreback commented Nov 14, 2018

yep would be opt in (as also requires newer pyarrow)

@lithomas1
Copy link
Member

take

@jreback jreback modified the milestones: Contributions Welcome, 1.3 Dec 31, 2020
@lithomas1 lithomas1 modified the milestones: 1.3, 2.0 Apr 23, 2021
@lithomas1
Copy link
Member

Pushing this off 1.3. Current implementation is too buggy to be landed, so 2.0 is a better target. I'm going to try reviving the stale PR for this hopefully in the next month.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 29, 2021

@lithomas1 I was just looking at those issues/PRs with the idea to revive this as well. Can you give some explanation of what you think is buggy in the current implementation? (assuming this is about #38370, looking at the comments there, it mostly needs an update for latest master / few comments)

If we think #38370 is still a good start, I can take a look at updating it for latest master as a start.

@lithomas1
Copy link
Member

@jorisvandenbossche
#38370 is still a good start(Needs a non trivial rebase due to refactoring of parsers in #39217.).
Its buggy because it passes ~ 25% of tests(mostly due to unsupported things) and doesn't pass a lot of tests even for supported things.

If landed individually, I would recommend disabling parse_dates and co. (which worked on my original but was removed for simplicity in this). Theres also a lot of nasty stuff with partially supported parameters that needs to error for unsupported cases and not fail awkwardly. So, I would highly recommend waiting for 2.0, as this should not be landed by itself.

Six more months of development + a pyarrow version bump(to 1.0) would do wonders. :) I can take revive this right after I finish my other PR(#40687) but you are certainly welcome to beat me to it.

@lithomas1
Copy link
Member

@jorisvandenbossche Little heads up that I'm going to be implementing #39383 (BytesIOWrapper) which will cause conflicts with #38370.

@lithomas1 lithomas1 modified the milestones: 2.0, 1.4 Jul 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv Performance Memory or execution speed performance
Projects
None yet
5 participants