ENH: Add argument "multiprocessing" to pd.read_csv() method #37955

nf78 · 2020-11-19T11:27:49Z

Is your feature request related to a problem?

When the method pd.read_csv() is called, unfortunately this doesn't take advantage of the multiprocessing module, making it inefficient to read multiple datasets, especially when more cores are available to work.

Other modules like modin or dask, already implement this, but I think that pandas should implement by itself, if called for.

Describe the solution you'd like

It should be able to work with the multiprocessing module out of the box, as an initial enhancement, and then in the future support other possible backends like joblib.

A list of filenames should be passed:

An example of application would be:

pd.read_csv(list_of_filenames, multiprocessing=True)

pd.read_csv(glob.glob('table_*.csv'), multiprocessing=True)

API breaking implications

This should not change established behavior, considering that the default value for the "multiprocessing" argument should be "None" by default.

The memory consumption should be the same, it just consumes the memory much faster.

For this method option, the indices will be the same from each file, but likely to be in different order, but the user can reset_index() afterwards if needed.

Describe alternatives you've considered

[this should provide a description of any alternative solutions or features you've considered]

Additional context

I have also considered extra backend options for future enhancements of this implementation, like joblib, ray, dask.

#NOTE: I have already a proof-of-concept for the solution, so I can work a bit further on it, commit and make a pull request.

The text was updated successfully, but these errors were encountered:

jreback · 2020-11-19T11:52:00Z

we have multi threaded tests for this here: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/parser/test_multi_thread.py

we do have an engine= argument for many functions already - but these are designed to not be multi processing

this is introducing a whole level of complexity that pandas does not need

so not completely averse but would have to be convinced that this is a good idea in pandas itself

nf78 · 2020-11-19T12:22:17Z

Thanks for your quick reply.

I had a look at the link provided, but it is just a test file and not a final implementation? Also not further commits since June.

The engine argument is for "c" or "python" in read_csv(), and like you said those are not multithreading.

I use pandas everyday, and I would say that the benefit of having this option without a lower level of coding, would be quite benefitial, in my daily life, so very likely to help a lot more people.

Some of the advantages of this implementation would be:

Faster loading when using multiple csv datasets
No need to use other extra modules like mondin, joblib, dask, etc.
Added support for reading multiple files, which doesn't exist at the moment, without using glob and generating a loop over the list of files.

rhshadrach · 2020-11-29T03:03:39Z

Multiprocessing has significant overhead, and multiprocessing io can harm performance when files are on hdd. It would be useful to know what size/number of files one would need to see performance benefits.

If you want better performance reading/writing, then I would work with parquet if possible. Not only is it much more performant, but works better with pandas dtypes.

nf78 · 2020-12-02T14:13:49Z

Multiprocessing has significant overhead, and multiprocessing io can harm performance when files are on hdd. It would be useful to know what size/number of files one would need to see performance benefits.

If you want better performance reading/writing, then I would work with parquet if possible. Not only is it much more performant, but works better with pandas dtypes.

From personal experience, and when there are many csv files to read, I can tell that there is definitely a speed benefit, since I have benchmarked this myself. I've done it with joblib and multiprocessing, both made significant savings in time.

Now, I am happy to generate a few stats based on file size and number of files and also number of cores available. I just need to know how many file sizes (intervals like 1MB, 10MB, 100MB, 1GB) and number of files, do you think will be required for this stats? Also what minimum and maximum of cores should I test?

I also appreciate your comments about the performance of parquet files, thanks. I'm just trying to contribute to pandas.

jreback · 2020-12-02T19:19:18Z

@nf78 you should use dask for this task

it's very well designed and will just work

building this into pandas introduces a lot of complexity for relatively little gain

AlexKirko · 2020-12-03T08:55:39Z

Building parallelization into pandas seems like the wrong way to go here, when there are specialized tools precisely for parallelizing IO.

nf78 · 2020-12-06T17:55:39Z

The whole purpose of this issue, is to provide the pandas users with a straight solution, rather than relying on a different tool or backend.

@jreback, I appreciate your comments that dask is very well designed, but even if it is considerd to give relatively little gain, I have made my own implementation with joblib and multiprocessing, and I would like to contribute to pandas with this. For this reason I don't think it is an over complex implementation.

@AlexKirko, I also appreciate that there are other "specialised tools", but the implementation I've done is actually prety simple and will give pandas a wider indepedency from other tools or backends like modin or dask.

Since i have already the code implemented, using dask just for the sake of reading csvs files, represents more modules to be installed in a production enviroment, more memory usage with more module imports.

Please let me know if it is still of interest if I make a commit with the implementation I'ce done, or if there are any further suggestions or comments.

rhshadrach · 2020-12-07T22:18:36Z

@nf78 You may be concerned with only reading csvs, but we also need to consider API consistency. "Why does pandas multiprocess reading csvs but not writing?" "Why does pandas multiprocess reading csvs but not reading parquet?" "Why does pandas multiprocess io but not other ops?" See Scope Creep.

It becomes a large complexity when there are already other tools that can do the job.

nf78 · 2020-12-08T19:10:08Z

@rhshadrach I'm not only concerned about reading csvs, I'm just raising a single issue (Rome wasn't built in a day) and if there is engagement I can happily contribute with the other multiprocessings for reading and writing csv/parquet/etc.

However, if there is no intention to implement multiprocessing in the pandas module, I fully understand.

jreback · 2020-12-08T19:55:53Z

@nf78 we do support a engine= keyword to various computation backends, e.g. rolling, groupby, and in some io backends e.g. python, pyarrow and so on.

So while I don't think pandas proper would want to maintain these backends at all (this is a great increase in complexity). We are not averse to providing hooks to enable backends. e.g. engine='joblib|multiprocessing|dask|whatever' could be a target on a csv reader. I would expect this implemented in an external package and then pandas could take a run-time dependency on it to facilitate.

nf78 · 2020-12-10T12:01:42Z

@jreback providing hooks to enable backends seems like a good approach to me. I will look at the implementations made on those io backends like pyarrow, and will get back to you asap.

mroeschke · 2021-08-14T05:33:03Z

I think once a library that provides a "multiprocessing" IO reading has been identified, we can open a new issue to specifically discuss how to implement it in pandas. Otherwise as mentioned, it doesn't seem like implementing native multiprocessing is not ideal. Closing for now.

nf78 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 19, 2020

mroeschke closed this as completed Aug 14, 2021

lithomas1 mentioned this issue Aug 30, 2021

DIS: Keywords for multi-threading capabilities #43313

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add argument "multiprocessing" to pd.read_csv() method #37955

ENH: Add argument "multiprocessing" to pd.read_csv() method #37955

nf78 commented Nov 19, 2020

jreback commented Nov 19, 2020

nf78 commented Nov 19, 2020

rhshadrach commented Nov 29, 2020 •

edited

Loading

nf78 commented Dec 2, 2020

jreback commented Dec 2, 2020

AlexKirko commented Dec 3, 2020 •

edited

Loading

nf78 commented Dec 6, 2020

rhshadrach commented Dec 7, 2020

nf78 commented Dec 8, 2020

jreback commented Dec 8, 2020

nf78 commented Dec 10, 2020

mroeschke commented Aug 14, 2021

ENH: Add argument "multiprocessing" to pd.read_csv() method #37955

ENH: Add argument "multiprocessing" to pd.read_csv() method #37955

Comments

nf78 commented Nov 19, 2020

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

jreback commented Nov 19, 2020

nf78 commented Nov 19, 2020

rhshadrach commented Nov 29, 2020 • edited Loading

nf78 commented Dec 2, 2020

jreback commented Dec 2, 2020

AlexKirko commented Dec 3, 2020 • edited Loading

nf78 commented Dec 6, 2020

rhshadrach commented Dec 7, 2020

nf78 commented Dec 8, 2020

jreback commented Dec 8, 2020

nf78 commented Dec 10, 2020

mroeschke commented Aug 14, 2021

rhshadrach commented Nov 29, 2020 •

edited

Loading

AlexKirko commented Dec 3, 2020 •

edited

Loading