Skip to content

ENH: Add engine_kwargs to read_csv #52301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 3 tasks
Finndersen opened this issue Mar 30, 2023 · 7 comments
Open
1 of 3 tasks

ENH: Add engine_kwargs to read_csv #52301

Finndersen opened this issue Mar 30, 2023 · 7 comments
Labels
Arrow pyarrow functionality Enhancement IO CSV read_csv, to_csv

Comments

@Finndersen
Copy link

Finndersen commented Mar 30, 2023

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Regarding the pyarrow engine for read_csv():

Currently it appears that the column_types parameter is not provided to ConvertOptions of pyarrow.csv.read_csv() , even though it seems very analogous to the dtype option of pd.read_csv().

If provided, it would disable type inference for those columns and improve performance.

Currently the dtype parameter provided to pd.read_csv() is only used to convert data types of the DataFrame after it is produced by pa.read_csv().to_pandas(), so does not improve the performance of pa.read_csv()

Feature Description

All that would be needed is to create a mapping of Pandas dtypes to PyArrow dtypes (maybe this already exists)?
And then use this mapping to create column_types from dtype, and provide to ConvertOptions

Alternative Solutions

Additional Context

No response

@Finndersen Finndersen added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 30, 2023
@DeaMariaLeon DeaMariaLeon added IO CSV read_csv, to_csv Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 31, 2023
@phofl
Copy link
Member

phofl commented Apr 3, 2023

Hey, thanks for your request. We should add engine_kwargs like for other io methods, which would accommodate this

@phofl phofl changed the title ENH: Provide column_types to ConvertOptions of pyarrow.csv.read_csv() for read_csv()'s pyarrow engine ENH: Add engine_kwargs to read_csv Apr 3, 2023
@Finndersen
Copy link
Author

Hey, thanks for your request. We should add engine_kwargs like for other io methods, which would accommodate this

I don't think it should have or require engine_kwargs, other existing normal args of pd.read_csv() are translated to pa.read_csv() options, and the existing dtypes arg of pd.read_csv() is a direct analogy of pa.read_csv()'s column types so it can be used directly

@lithomas1
Copy link
Member

This seems reasonable to me.

Implementing this will be slightly tricky, though, since from what I remember, the column_types parameter doesn't play to well with things such as names.

Also, I am pretty sure that column_types doesn't support passing an int for the column position whose dtype you want to specify, or just a dtype for the whole thing.

I can have a look at this later this week, but if you want to take a stab at it before me, that's also fine by me.

@Finndersen
Copy link
Author

@lithomas1 I'm not sure what you mean by column_types not playing well with names, since its documentation says:
"Explicitly map column names to column types"

However you might be right about the column positions or global dtype.

At the very least, it seems dtypes could be passed through if provided as a dict with names (or a dict with column indices if "names" parameter is also provided? Since we could do the mapping using that)

Otherwise, if there's a cheap way to inspect and get the column names in the CSV file without reading the whole thing, could do that first before pa.read_csv(), which would add support for dtype provided as mapping of column indices or single type.

Unfortunately I'm overseas travelling at the moment and not in a good position to develop the capability myself right now

@lithomas1
Copy link
Member

I meant the renamed column names from the names parameter in read_csv.
(That's one parameter I didn't pass to pyarrow on purpose and I'll need to investigate again to find out why.)

Re: missing features in pyarrow, I think now is probably a good time to fix things upstream, since a decent chunk of people are using or interested in the arrow parser.

I think for now I'll just special case what pyarrow can handle and try to fix the missing cases upstream.

@phofl
Copy link
Member

phofl commented Apr 4, 2023

I'd rather use engine_kwargs here instead of special casing stuff because the arrow engine can not deal with all dtype configurations. This will cause unnecessary confusion later on. Also, hierarchy between application of names and dtype should stay the same for all engines

@lithomas1
Copy link
Member

#34823 is the other issue related to engine_kwargs, but it's restricted to just the to_pandas part of reading.

I don't think we should allow users to pass options to the pyarrow read_csv call itself since it's just going to be challenging to avoid conflicts with what we're passing in.

The only thing I think is probably useful for users from the read_csv options that we aren't doing ourselves, is the auto_dict_encode feature, but I'm inclined to say that people should just make the pyarrow read_csv call themselves if they want it.

I think I agree on holding off on this for now. I think I kinda forgot about things like third-party ExtensionArrays that will be broken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

4 participants