ENH: Add engine_kwargs to read_csv #52301

Finndersen · 2023-03-30T11:56:54Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

Regarding the pyarrow engine for read_csv():

Currently it appears that the column_types parameter is not provided to ConvertOptions of pyarrow.csv.read_csv() , even though it seems very analogous to the dtype option of pd.read_csv().

If provided, it would disable type inference for those columns and improve performance.

Currently the dtype parameter provided to pd.read_csv() is only used to convert data types of the DataFrame after it is produced by pa.read_csv().to_pandas(), so does not improve the performance of pa.read_csv()

Feature Description

All that would be needed is to create a mapping of Pandas dtypes to PyArrow dtypes (maybe this already exists)?
And then use this mapping to create column_types from dtype, and provide to ConvertOptions

Alternative Solutions

Additional Context

No response

phofl · 2023-04-03T21:41:53Z

Hey, thanks for your request. We should add engine_kwargs like for other io methods, which would accommodate this

Finndersen · 2023-04-03T22:06:03Z

Hey, thanks for your request. We should add engine_kwargs like for other io methods, which would accommodate this

I don't think it should have or require engine_kwargs, other existing normal args of pd.read_csv() are translated to pa.read_csv() options, and the existing dtypes arg of pd.read_csv() is a direct analogy of pa.read_csv()'s column types so it can be used directly

lithomas1 · 2023-04-04T11:11:20Z

This seems reasonable to me.

Implementing this will be slightly tricky, though, since from what I remember, the column_types parameter doesn't play to well with things such as names.

Also, I am pretty sure that column_types doesn't support passing an int for the column position whose dtype you want to specify, or just a dtype for the whole thing.

I can have a look at this later this week, but if you want to take a stab at it before me, that's also fine by me.

Finndersen · 2023-04-04T11:25:52Z

@lithomas1 I'm not sure what you mean by column_types not playing well with names, since its documentation says:
"Explicitly map column names to column types"

However you might be right about the column positions or global dtype.

At the very least, it seems dtypes could be passed through if provided as a dict with names (or a dict with column indices if "names" parameter is also provided? Since we could do the mapping using that)

Otherwise, if there's a cheap way to inspect and get the column names in the CSV file without reading the whole thing, could do that first before pa.read_csv(), which would add support for dtype provided as mapping of column indices or single type.

Unfortunately I'm overseas travelling at the moment and not in a good position to develop the capability myself right now

lithomas1 · 2023-04-04T11:34:52Z

I meant the renamed column names from the names parameter in read_csv.
(That's one parameter I didn't pass to pyarrow on purpose and I'll need to investigate again to find out why.)

Re: missing features in pyarrow, I think now is probably a good time to fix things upstream, since a decent chunk of people are using or interested in the arrow parser.

I think for now I'll just special case what pyarrow can handle and try to fix the missing cases upstream.

phofl · 2023-04-04T11:36:45Z

I'd rather use engine_kwargs here instead of special casing stuff because the arrow engine can not deal with all dtype configurations. This will cause unnecessary confusion later on. Also, hierarchy between application of names and dtype should stay the same for all engines

lithomas1 · 2023-04-04T17:27:50Z

#34823 is the other issue related to engine_kwargs, but it's restricted to just the to_pandas part of reading.

I don't think we should allow users to pass options to the pyarrow read_csv call itself since it's just going to be challenging to avoid conflicts with what we're passing in.

The only thing I think is probably useful for users from the read_csv options that we aren't doing ourselves, is the auto_dict_encode feature, but I'm inclined to say that people should just make the pyarrow read_csv call themselves if they want it.

I think I agree on holding off on this for now. I think I kinda forgot about things like third-party ExtensionArrays that will be broken.

Finndersen added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 30, 2023

DeaMariaLeon added IO CSV read_csv, to_csv Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 31, 2023

phofl changed the title ~~ENH: Provide column_types to ConvertOptions of pyarrow.csv.read_csv() for read_csv()'s pyarrow engine~~ ENH: Add engine_kwargs to read_csv Apr 3, 2023

MCRE-BE mentioned this issue Apr 12, 2023

BUG: pd.read_csv does not work with nullable_dtype coercion #52594

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add engine_kwargs to read_csv #52301

ENH: Add engine_kwargs to read_csv #52301

Finndersen commented Mar 30, 2023 •

edited

Loading

phofl commented Apr 3, 2023 •

edited

Loading

Finndersen commented Apr 3, 2023

lithomas1 commented Apr 4, 2023

Finndersen commented Apr 4, 2023

lithomas1 commented Apr 4, 2023

phofl commented Apr 4, 2023

lithomas1 commented Apr 4, 2023

ENH: Add engine_kwargs to read_csv #52301

ENH: Add engine_kwargs to read_csv #52301

Comments

Finndersen commented Mar 30, 2023 • edited Loading

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

phofl commented Apr 3, 2023 • edited Loading

Finndersen commented Apr 3, 2023

lithomas1 commented Apr 4, 2023

Finndersen commented Apr 4, 2023

lithomas1 commented Apr 4, 2023

phofl commented Apr 4, 2023

lithomas1 commented Apr 4, 2023

Finndersen commented Mar 30, 2023 •

edited

Loading

phofl commented Apr 3, 2023 •

edited

Loading