-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add argument "multiprocessing" to pd.read_csv() method #37955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
we have multi threaded tests for this here: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/parser/test_multi_thread.py we do have an engine= argument for many functions already - but these are designed to not be multi processing this is introducing a whole level of complexity that pandas does not need so not completely averse but would have to be convinced that this is a good idea in pandas itself |
Thanks for your quick reply. I had a look at the link provided, but it is just a test file and not a final implementation? Also not further commits since June. The engine argument is for "c" or "python" in read_csv(), and like you said those are not multithreading. I use pandas everyday, and I would say that the benefit of having this option without a lower level of coding, would be quite benefitial, in my daily life, so very likely to help a lot more people. Some of the advantages of this implementation would be:
|
Multiprocessing has significant overhead, and multiprocessing io can harm performance when files are on hdd. It would be useful to know what size/number of files one would need to see performance benefits. If you want better performance reading/writing, then I would work with parquet if possible. Not only is it much more performant, but works better with pandas dtypes. |
From personal experience, and when there are many csv files to read, I can tell that there is definitely a speed benefit, since I have benchmarked this myself. I've done it with joblib and multiprocessing, both made significant savings in time. Now, I am happy to generate a few stats based on file size and number of files and also number of cores available. I just need to know how many file sizes (intervals like 1MB, 10MB, 100MB, 1GB) and number of files, do you think will be required for this stats? Also what minimum and maximum of cores should I test? I also appreciate your comments about the performance of parquet files, thanks. I'm just trying to contribute to pandas. |
@nf78 you should use dask for this task it's very well designed and will just work building this into pandas introduces a lot of complexity for relatively little gain |
Building parallelization into pandas seems like the wrong way to go here, when there are specialized tools precisely for parallelizing IO. |
The whole purpose of this issue, is to provide the pandas users with a straight solution, rather than relying on a different tool or backend. @jreback, I appreciate your comments that dask is very well designed, but even if it is considerd to give relatively little gain, I have made my own implementation with joblib and multiprocessing, and I would like to contribute to pandas with this. For this reason I don't think it is an over complex implementation. @AlexKirko, I also appreciate that there are other "specialised tools", but the implementation I've done is actually prety simple and will give pandas a wider indepedency from other tools or backends like modin or dask. Since i have already the code implemented, using dask just for the sake of reading csvs files, represents more modules to be installed in a production enviroment, more memory usage with more module imports. Please let me know if it is still of interest if I make a commit with the implementation I'ce done, or if there are any further suggestions or comments. |
@nf78 You may be concerned with only reading csvs, but we also need to consider API consistency. "Why does pandas multiprocess reading csvs but not writing?" "Why does pandas multiprocess reading csvs but not reading parquet?" "Why does pandas multiprocess io but not other ops?" See Scope Creep. It becomes a large complexity when there are already other tools that can do the job. |
@rhshadrach I'm not only concerned about reading csvs, I'm just raising a single issue (Rome wasn't built in a day) and if there is engagement I can happily contribute with the other multiprocessings for reading and writing csv/parquet/etc. However, if there is no intention to implement multiprocessing in the pandas module, I fully understand. |
@nf78 we do support a So while I don't think pandas proper would want to maintain these backends at all (this is a great increase in complexity). We are not averse to providing hooks to enable backends. e.g. |
@jreback providing hooks to enable backends seems like a good approach to me. I will look at the implementations made on those io backends like pyarrow, and will get back to you asap. |
I think once a library that provides a "multiprocessing" IO reading has been identified, we can open a new issue to specifically discuss how to implement it in pandas. Otherwise as mentioned, it doesn't seem like implementing native multiprocessing is not ideal. Closing for now. |
Is your feature request related to a problem?
When the method pd.read_csv() is called, unfortunately this doesn't take advantage of the multiprocessing module, making it inefficient to read multiple datasets, especially when more cores are available to work.
Other modules like modin or dask, already implement this, but I think that pandas should implement by itself, if called for.
Describe the solution you'd like
It should be able to work with the multiprocessing module out of the box, as an initial enhancement, and then in the future support other possible backends like joblib.
A list of filenames should be passed:
An example of application would be:
pd.read_csv(list_of_filenames, multiprocessing=True)
pd.read_csv(glob.glob('table_*.csv'), multiprocessing=True)
API breaking implications
This should not change established behavior, considering that the default value for the "multiprocessing" argument should be "None" by default.
The memory consumption should be the same, it just consumes the memory much faster.
For this method option, the indices will be the same from each file, but likely to be in different order, but the user can reset_index() afterwards if needed.
Describe alternatives you've considered
[this should provide a description of any alternative solutions or features you've considered]
Additional context
I have also considered extra backend options for future enhancements of this implementation, like joblib, ray, dask.
#NOTE: I have already a proof-of-concept for the solution, so I can work a bit further on it, commit and make a pull request.
The text was updated successfully, but these errors were encountered: