Skip to content

ENH: Add argument "multiprocessing" to pd.read_csv() method #37955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nf78 opened this issue Nov 19, 2020 · 12 comments
Closed

ENH: Add argument "multiprocessing" to pd.read_csv() method #37955

nf78 opened this issue Nov 19, 2020 · 12 comments
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@nf78
Copy link

nf78 commented Nov 19, 2020

Is your feature request related to a problem?

When the method pd.read_csv() is called, unfortunately this doesn't take advantage of the multiprocessing module, making it inefficient to read multiple datasets, especially when more cores are available to work.

Other modules like modin or dask, already implement this, but I think that pandas should implement by itself, if called for.

Describe the solution you'd like

It should be able to work with the multiprocessing module out of the box, as an initial enhancement, and then in the future support other possible backends like joblib.

A list of filenames should be passed:

An example of application would be:

pd.read_csv(list_of_filenames, multiprocessing=True)

pd.read_csv(glob.glob('table_*.csv'), multiprocessing=True)

API breaking implications

This should not change established behavior, considering that the default value for the "multiprocessing" argument should be "None" by default.

The memory consumption should be the same, it just consumes the memory much faster.

For this method option, the indices will be the same from each file, but likely to be in different order, but the user can reset_index() afterwards if needed.

Describe alternatives you've considered

[this should provide a description of any alternative solutions or features you've considered]

Additional context

I have also considered extra backend options for future enhancements of this implementation, like joblib, ray, dask.

#NOTE: I have already a proof-of-concept for the solution, so I can work a bit further on it, commit and make a pull request.

@nf78 nf78 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 19, 2020
@jreback
Copy link
Contributor

jreback commented Nov 19, 2020

we have multi threaded tests for this here: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/parser/test_multi_thread.py

we do have an engine= argument for many functions already - but these are designed to not be multi processing

this is introducing a whole level of complexity that pandas does not need

so not completely averse but would have to be convinced that this is a good idea in pandas itself

@nf78
Copy link
Author

nf78 commented Nov 19, 2020

Thanks for your quick reply.

I had a look at the link provided, but it is just a test file and not a final implementation? Also not further commits since June.

The engine argument is for "c" or "python" in read_csv(), and like you said those are not multithreading.

I use pandas everyday, and I would say that the benefit of having this option without a lower level of coding, would be quite benefitial, in my daily life, so very likely to help a lot more people.

Some of the advantages of this implementation would be:

  • Faster loading when using multiple csv datasets
  • No need to use other extra modules like mondin, joblib, dask, etc.
  • Added support for reading multiple files, which doesn't exist at the moment, without using glob and generating a loop over the list of files.

@rhshadrach
Copy link
Member

rhshadrach commented Nov 29, 2020

Multiprocessing has significant overhead, and multiprocessing io can harm performance when files are on hdd. It would be useful to know what size/number of files one would need to see performance benefits.

If you want better performance reading/writing, then I would work with parquet if possible. Not only is it much more performant, but works better with pandas dtypes.

@nf78
Copy link
Author

nf78 commented Dec 2, 2020

Multiprocessing has significant overhead, and multiprocessing io can harm performance when files are on hdd. It would be useful to know what size/number of files one would need to see performance benefits.

If you want better performance reading/writing, then I would work with parquet if possible. Not only is it much more performant, but works better with pandas dtypes.

From personal experience, and when there are many csv files to read, I can tell that there is definitely a speed benefit, since I have benchmarked this myself. I've done it with joblib and multiprocessing, both made significant savings in time.

Now, I am happy to generate a few stats based on file size and number of files and also number of cores available. I just need to know how many file sizes (intervals like 1MB, 10MB, 100MB, 1GB) and number of files, do you think will be required for this stats? Also what minimum and maximum of cores should I test?

I also appreciate your comments about the performance of parquet files, thanks. I'm just trying to contribute to pandas.

@jreback
Copy link
Contributor

jreback commented Dec 2, 2020

@nf78 you should use dask for this task

it's very well designed and will just work

building this into pandas introduces a lot of complexity for relatively little gain

@AlexKirko
Copy link
Member

AlexKirko commented Dec 3, 2020

Building parallelization into pandas seems like the wrong way to go here, when there are specialized tools precisely for parallelizing IO.

@nf78
Copy link
Author

nf78 commented Dec 6, 2020

The whole purpose of this issue, is to provide the pandas users with a straight solution, rather than relying on a different tool or backend.

@jreback, I appreciate your comments that dask is very well designed, but even if it is considerd to give relatively little gain, I have made my own implementation with joblib and multiprocessing, and I would like to contribute to pandas with this. For this reason I don't think it is an over complex implementation.

@AlexKirko, I also appreciate that there are other "specialised tools", but the implementation I've done is actually prety simple and will give pandas a wider indepedency from other tools or backends like modin or dask.

Since i have already the code implemented, using dask just for the sake of reading csvs files, represents more modules to be installed in a production enviroment, more memory usage with more module imports.

Please let me know if it is still of interest if I make a commit with the implementation I'ce done, or if there are any further suggestions or comments.

@rhshadrach
Copy link
Member

@nf78 You may be concerned with only reading csvs, but we also need to consider API consistency. "Why does pandas multiprocess reading csvs but not writing?" "Why does pandas multiprocess reading csvs but not reading parquet?" "Why does pandas multiprocess io but not other ops?" See Scope Creep.

It becomes a large complexity when there are already other tools that can do the job.

@nf78
Copy link
Author

nf78 commented Dec 8, 2020

@rhshadrach I'm not only concerned about reading csvs, I'm just raising a single issue (Rome wasn't built in a day) and if there is engagement I can happily contribute with the other multiprocessings for reading and writing csv/parquet/etc.

However, if there is no intention to implement multiprocessing in the pandas module, I fully understand.

@jreback
Copy link
Contributor

jreback commented Dec 8, 2020

@nf78 we do support a engine= keyword to various computation backends, e.g. rolling, groupby, and in some io backends e.g. python, pyarrow and so on.

So while I don't think pandas proper would want to maintain these backends at all (this is a great increase in complexity). We are not averse to providing hooks to enable backends. e.g. engine='joblib|multiprocessing|dask|whatever' could be a target on a csv reader. I would expect this implemented in an external package and then pandas could take a run-time dependency on it to facilitate.

@nf78
Copy link
Author

nf78 commented Dec 10, 2020

@jreback providing hooks to enable backends seems like a good approach to me. I will look at the implementations made on those io backends like pyarrow, and will get back to you asap.

@mroeschke
Copy link
Member

I think once a library that provides a "multiprocessing" IO reading has been identified, we can open a new issue to specifically discuss how to implement it in pandas. Otherwise as mentioned, it doesn't seem like implementing native multiprocessing is not ideal. Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

5 participants