DIS: Keywords for multi-threading capabilities #43313

lithomas1 · 2021-08-30T21:18:11Z

With the addition of the new pyarrow engine, we now have the option to use multiple threads to read a CSV file. (This is also controllable through the pyarrow.set_cpu_count option).

Should we expose a keyword(such as num_threads maybe) to the user as a keyword, or just add an example in the docs(for this case, redirecting to pyarrow.set_cpu_count? In the case of read_csv, this keyword would probably only apply to the pyarrow engines, however it is worth noting that we have had multiple feature requests for parallel CSV reading (e.g. #37955), and it is probably worth it to be configure the number of threads used if we offer multithreading.

Personally, I would prefer having a keyword, as if we decide to add more I/O engines with multithreading capabilities, it would be more convenient to be able to control this option through a keyword.

cc @pandas-dev/pandas-core

The text was updated successfully, but these errors were encountered:

mroeschke · 2021-08-30T23:08:57Z

What is done for numba capabilities (engine="numba") and is the most flexible IMO is to pair engine with a engine_kwargs keywords which accepts a dict with engine configurations e.g. {"set_cpu_count": 2}

This way the pandas API isn't tied to engine specific constructs while catering to whatever capabilities each engine supports.

jreback · 2021-08-31T00:07:50Z

we pass thru n_threads
in the parquet reader

so wouldn't object to a matching keyword or @mroeschke suggestion

lithomas1 · 2021-08-31T04:14:14Z

Personally, I think it might be better to just use a standard keyword argument, since it could be confusing if this changes between engines. I think this option would probably be popular enough to warrant this.

Having a standard keyword to control multi-threading could also pave the way for a global option to config this for everything.

(btw, n_threads doesn't exist in pyarrow anymore, you have to use pyarrow.set_cpu_count)

WDYT?

rhshadrach · 2021-09-02T01:31:24Z

I think we should prefer usage of engine_kwargs when available. This makes it clear to the user that it depends on what engine they are using. It also lessens our technical debt as engines come and go, and change argument names.

brianpenghe · 2022-01-05T22:11:01Z

Has this been solved? Reading a large csv is too slow with only one thread!

phofl · 2022-01-05T22:20:34Z

It does not seem so.

jreback · 2022-01-05T23:33:04Z

you can multi threaded reading with the engine='pyarrow' (in 1,4)

jbrockmendel · 2023-02-14T03:25:04Z

Out of curiosity, in what scenario with multi-threading available would you not want to crank it up to 11? I guess debugging?

bashtage · 2023-02-14T08:06:43Z

Out of curiosity, in what scenario with multi-threading available would you not want to crank it up to 11? I guess debugging?

When you use multi-process you almost always want to limit or eliminate multi-threading. For example, when using linear algebra on a spark cluster, failing to set MKL_NUM_THREADS=1 or OMP_NUM_THREADS=1 can lead to a situation where you end up with ncore * ncore threads competing for ncore resources, and basically crippling congestion.

rhshadrach · 2023-02-14T22:47:46Z

Out of curiosity, in what scenario with multi-threading available would you not want to crank it up to 11? I guess debugging?

I'll also add multi-user servers where you want to be (somewhat) nice to other processes.

jorisvandenbossche · 2023-02-15T00:05:08Z

In general when you have multiple libraries or functions each using multi-threading with their own thread pool (nested parallelism), you can easily get what is called "oversubscription".

Typical example is scikit-learn where you could do several runs of a model with varying parameters (eg for parameter optimization) in parallel, but then the individual model algorithm (eg using MKL or BLAS) might again run things in parallel . They developed a package to better be able control the number of threads that varying libraries use (https://github.com/joblib/threadpoolctl) to cope with such kind of issues.

jbrockmendel · 2023-02-15T15:54:41Z

Thanks for those explanations of use cases for disabling/limiting multi-threading. They all seem like cases where you'd want to do it across the board at a process level, so seem like a use case for pd.set_option as opposed to fine-grained keywords.

WillAyd · 2023-02-16T04:51:14Z

Out of curiosity I wanted to see what callgrind thought about our read_csv timings, using the 5GB csv file of Oct data taken from here:

https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store/discussion

Here are the results:

I am by no means an expert on the topic, but my initial thought in reading these results is that I/O itself is not much of a bottleneck, and by extension I'm not sure adding threads to the mix would buy us a lot. The big bottlenecks appear to be the tokenize_bytes and convert_with_dtype functions (the latter especially with strings).

jorisvandenbossche · 2023-02-16T07:11:53Z

I/O itself is not much of a bottleneck, and by extension I'm not sure adding threads to the mix would buy us a lot

But also the parsing/tokenizer step can in theory be parallelized?

Dr-Irv · 2023-02-16T16:14:12Z

But also the parsing/tokenizer step can in theory be parallelized?

Beware of the GIL! I was thinking you could do something by chopping up the file horizontally (e.g., if you have 1M rows, split it into 4 pieces of 250K rows each), process the splits in parallel into 4 DataFrames, then concat all of them.

But I think (although I'm not sure) that you might not get speedups because the different threads would get locks due to the GIL. Even if the GIL isn't involved, if they are sharing a malloc() operation way down under the hood, there will be a lock there that could end up constraining any speedup. I ran into this many years ago when writing some code in C that parallelized an algorithm.

WillAyd · 2023-02-16T16:47:42Z

I think it can be parallelized but I also think we are mixing threads and processes. Assuming the tokenizer is CPU bound (which I believe it is - don't see any IO therein) adding threads on the same process isn't going to help the processing time, only hurt it.

Multiprocessing would seem a better fit, but the disadvantage I think we have is a lack of well defined IPC like Arrow has. The Python multiprocessing library has an Array object that might help

lithomas1 added API Design IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action labels Aug 30, 2021

lithomas1 mentioned this issue Feb 14, 2023

PERF: parallelize libjoin calls #51364

Open

jbrockmendel added the Multithreading Parallelism in pandas label Feb 14, 2023

mroeschke mentioned this issue Aug 27, 2024

ENH: pd.DataFrame.groupby().apply: parallel #59635

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DIS: Keywords for multi-threading capabilities #43313

DIS: Keywords for multi-threading capabilities #43313

lithomas1 commented Aug 30, 2021

mroeschke commented Aug 30, 2021

jreback commented Aug 31, 2021

lithomas1 commented Aug 31, 2021

rhshadrach commented Sep 2, 2021

brianpenghe commented Jan 5, 2022

phofl commented Jan 5, 2022

jreback commented Jan 5, 2022

jbrockmendel commented Feb 14, 2023

bashtage commented Feb 14, 2023

rhshadrach commented Feb 14, 2023

jorisvandenbossche commented Feb 15, 2023

jbrockmendel commented Feb 15, 2023

WillAyd commented Feb 16, 2023

jorisvandenbossche commented Feb 16, 2023

Dr-Irv commented Feb 16, 2023

WillAyd commented Feb 16, 2023

DIS: Keywords for multi-threading capabilities #43313

DIS: Keywords for multi-threading capabilities #43313

Comments

lithomas1 commented Aug 30, 2021

mroeschke commented Aug 30, 2021

jreback commented Aug 31, 2021

lithomas1 commented Aug 31, 2021

rhshadrach commented Sep 2, 2021

brianpenghe commented Jan 5, 2022

phofl commented Jan 5, 2022

jreback commented Jan 5, 2022

jbrockmendel commented Feb 14, 2023

bashtage commented Feb 14, 2023

rhshadrach commented Feb 14, 2023

jorisvandenbossche commented Feb 15, 2023

jbrockmendel commented Feb 15, 2023

WillAyd commented Feb 16, 2023

jorisvandenbossche commented Feb 16, 2023

Dr-Irv commented Feb 16, 2023

WillAyd commented Feb 16, 2023