-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add engine_kwargs to read_csv #52301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey, thanks for your request. We should add engine_kwargs like for other io methods, which would accommodate this |
I don't think it should have or require engine_kwargs, other existing normal args of pd.read_csv() are translated to pa.read_csv() options, and the existing dtypes arg of pd.read_csv() is a direct analogy of pa.read_csv()'s column types so it can be used directly |
This seems reasonable to me. Implementing this will be slightly tricky, though, since from what I remember, the Also, I am pretty sure that I can have a look at this later this week, but if you want to take a stab at it before me, that's also fine by me. |
@lithomas1 I'm not sure what you mean by column_types not playing well with names, since its documentation says: However you might be right about the column positions or global dtype. At the very least, it seems dtypes could be passed through if provided as a dict with names (or a dict with column indices if "names" parameter is also provided? Since we could do the mapping using that) Otherwise, if there's a cheap way to inspect and get the column names in the CSV file without reading the whole thing, could do that first before pa.read_csv(), which would add support for dtype provided as mapping of column indices or single type. Unfortunately I'm overseas travelling at the moment and not in a good position to develop the capability myself right now |
I meant the renamed column names from the names parameter in read_csv. Re: missing features in pyarrow, I think now is probably a good time to fix things upstream, since a decent chunk of people are using or interested in the arrow parser. I think for now I'll just special case what pyarrow can handle and try to fix the missing cases upstream. |
I'd rather use engine_kwargs here instead of special casing stuff because the arrow engine can not deal with all dtype configurations. This will cause unnecessary confusion later on. Also, hierarchy between application of names and dtype should stay the same for all engines |
#34823 is the other issue related to engine_kwargs, but it's restricted to just the I don't think we should allow users to pass options to the pyarrow read_csv call itself since it's just going to be challenging to avoid conflicts with what we're passing in. The only thing I think is probably useful for users from the read_csv options that we aren't doing ourselves, is the I think I agree on holding off on this for now. I think I kinda forgot about things like third-party ExtensionArrays that will be broken. |
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
Regarding the pyarrow engine for read_csv():
Currently it appears that the column_types parameter is not provided to ConvertOptions of pyarrow.csv.read_csv() , even though it seems very analogous to the dtype option of pd.read_csv().
If provided, it would disable type inference for those columns and improve performance.
Currently the dtype parameter provided to pd.read_csv() is only used to convert data types of the DataFrame after it is produced by pa.read_csv().to_pandas(), so does not improve the performance of pa.read_csv()
Feature Description
All that would be needed is to create a mapping of Pandas dtypes to PyArrow dtypes (maybe this already exists)?
And then use this mapping to create column_types from dtype, and provide to ConvertOptions
Alternative Solutions
Additional Context
No response
The text was updated successfully, but these errors were encountered: