-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: allow engine='pyarrow' in read_csv #23697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I would expect it to be more limited in functionality in general, so maybe not a full replacement for pandas's read_csv, but it could be used as the "fast path" for well behaved files. Also, as soon as we build support for multi-character and regex delimiters, you could deprecate |
I would be cautious here. Parsing or conversions may be stricter, less datatypes may be recognized, or the resulting datatypes may be different. I think it's better to make this an explicit option ( |
Yes, I definitely agree with an "opt-in". It would be helpful if the pandas CSV suite could be set up so that we can run those tests using the Arrow reader to test for compatibility (it would be helpful to know for example, if say 50% of the tests work). Compatibility with |
yep would be opt in (as also requires newer pyarrow) |
take |
Pushing this off 1.3. Current implementation is too buggy to be landed, so 2.0 is a better target. I'm going to try reviving the stale PR for this hopefully in the next month. |
@lithomas1 I was just looking at those issues/PRs with the idea to revive this as well. Can you give some explanation of what you think is buggy in the current implementation? (assuming this is about #38370, looking at the comments there, it mostly needs an update for latest master / few comments) If we think #38370 is still a good start, I can take a look at updating it for latest master as a start. |
@jorisvandenbossche If landed individually, I would recommend disabling parse_dates and co. (which worked on my original but was removed for simplicity in this). Theres also a lot of nasty stuff with partially supported parameters that needs to error for unsupported cases and not fail awkwardly. So, I would highly recommend waiting for 2.0, as this should not be landed by itself. Six more months of development + a pyarrow version bump(to 1.0) would do wonders. :) I can take revive this right after I finish my other PR(#40687) but you are certainly welcome to beat me to it. |
@jorisvandenbossche Little heads up that I'm going to be implementing #39383 (BytesIOWrapper) which will cause conflicts with #38370. |
we could conditionally use the new pyarrow csv parser as an engine (requires 0.11 IIRC). eventually leading to a replacement path for the existing code. There might be a number of restrictions on what options we can pass as the current parser is more full-featured, but I suspect most of the basic options work.
cc @gfyoung @pitrou @wesm @TomAugspurger
The text was updated successfully, but these errors were encountered: