-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: parallelize libjoin calls #51364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Do we know that our current GIL-releasing usage actually improves performance? AFAIK releasing the GIL allows for the use of multiple threads, but I'm not sure that helps most of our calculations which are single-threaded and CPU-bound? |
ive wondered this myself |
WIthout opening a can of worms, if you wanted concurrency here I think you could map the two chunks to different proceses and communicate the result back to the parent process. Of course makes the code more complicated and there's a break even point where it may not be worth it, but that is my best guess on how to tackle |
https://github.com/jcrist/ptime is handy for checking if stuff is being hampered by the GIL.
I'm not aware of anywhere in pandas that's currently multi-threaded, but I haven't followed things closely in a bit. GIL-releasing is crucial for libraries that use pandas in parallel, like Dask.
Just to confirm: we're probably talking about threads here, not OS-level processes. Is there a larger discussion around parallelizing things within pandas itself, rather than deferring to libraries like Dask? It is indeed a can of worms. |
Nice thanks for the insight @TomAugspurger . Yea sorry meant to say parallelism not concurrency |
Not really. I've been looking into what we can add to EAs to make it more appealing for dask/modin/cudf/etc to implement EAs. That's what got me looking at this chunk of code. |
Interesting conversation though - so do we think the current GIL-releasing stuff we have in pandas definitely helps Dask downstream? Is that tested / benchmarked there? There are definitely some cases where we have a strange mix of Tempita / fused types that I think came back to a lack of support for conditional GIL releasing for object types. If someone took it up as a passion project we might be able to revisit that and simplify a good deal |
IIRC we got rid of most of the tempita. Most of what's left were cases where there were two separate dtyped-arguments that weren't necessarily the same dtype e.g. The conditional GIL i think is mostly already using fused types, will have a wave of additional cleanup once cython3 is released. |
xref #43313 for I/O. We should really start thinking of a keyword to control parallelism in pandas. I believe Arrow operations are parallel by default. For cython, I wouldn't mind if we used Cython's OpenMP capabilities which seems like just using prange vs range (would make my life a lot harder? with packaging pandas). Personally, I think it's about time we added some sort of parallel capability to pandas (not the fancy out-of-core stuff etc. that dask does though). |
+1. Aside from prange you mentioned, I don't have a clear picture of what is available. |
there is an effort to standardize the parallel user facing apis across numpy, scikit-learn and pandas (this is nasa funded ; @MarcoGorelli can hopefully point to the docs that exist) so certainly experiment - but we should tread carefully here generally +1 in internal parallelism where we can and that is appropriate |
@lithomas1 here's your first request for help on this topic. trying to add prange to libalgos.nancorr, added "-fopenmp" to extra_compile_args and extra_link_args in setup.py, getting |
Install the conda-forge compilers package. Does that help? I don't think conda-forge/brew play nicely with each other. |
@jbrockmendel macOS confusingly aliases gcc to clang, so you may think you are running the former when you are actually running the latter. It looks like clang might have a different command line argument to make that work |
Is there a chance that with numba this might Just Work? |
yup, here you go: https://thomasjpfan.github.io/parallelism-python-libraries-design/ |
The chunk1/chunk2 calls are each 1.47 s ± 40 ms and the concat is 1.05 s ± 179 ms. In a pyarrow or otherwise-chunked context the concat might not be needed.
For the non-object-dtype case, I'm pretty sure we could release the GIL in inner_join_indexer. Could we then do the chunk1/chunk2 calls in parallel? If so, could we split it further? cc @WillAyd?
The text was updated successfully, but these errors were encountered: