docs: Expand docs on when and why allow_threads is necessary (#4767)

ngoldbaum · mejrs · davidhewitt · commit 325218f1520f · 2025-01-11T21:30:20.000Z
* Expand docs on when and why allow_threads is necessary

* spelling

* simplify example a little

* use less indirection in the example

* Update guide/src/parallelism.md

* Add note about the GIL preventing parallelism

* Update guide/src/free-threading.md

Co-authored-by: Bruno Kolenbrander &lt;59372212+mejrs@users.noreply.github.com&gt;

* pared down text about need to use with_gil

* rearrange slightly

---------

Co-authored-by: Bruno Kolenbrander &lt;59372212+mejrs@users.noreply.github.com&gt;
diff --git a/guide/src/free-threading.md b/guide/src/free-threading.md
@@ -156,20 +156,40 @@ freethreaded build, holding a `'py` lifetime means only that the thread is
 currently attached to the Python interpreter -- other threads can be
 simultaneously interacting with the interpreter.
 
-The main reason for obtaining a `'py` lifetime is to interact with Python
+You still need to obtain a `'py` lifetime is to interact with Python
 objects or call into the CPython C API. If you are not yet attached to the
 Python runtime, you can register a thread using the [`Python::with_gil`]
 function. Threads created via the Python [`threading`] module do not not need to
-do this, but all other OS threads that interact with the Python runtime must
-explicitly attach using `with_gil` and obtain a `'py` liftime.
-
-Since there is no GIL in the free-threaded build, releasing the GIL for
-long-running tasks is no longer necessary to ensure other threads run, but you
-should still detach from the interpreter runtime using [`Python::allow_threads`]
-when doing long-running tasks that do not require the CPython runtime. The
-garbage collector can only run if all threads are detached from the runtime (in
-a stop-the-world state), so detaching from the runtime allows freeing unused
-memory.
+do this, and pyo3 will handle setting up the [`Python<'py>`] token when CPython
+calls into your extension.
+
+### Global synchronization events can cause hangs and deadlocks
+
+The free-threaded build triggers global synchronization events in the following
+situations:
+
+* During garbage collection in order to get a globally consistent view of
+  reference counts and references between objects
+* In Python 3.13, when the first background thread is started in
+  order to mark certain objects as immortal
+* When either `sys.settrace` or `sys.setprofile` are called in order to
+  instrument running code objects and threads
+* Before `os.fork()` is called.
+
+This is a non-exhaustive list and there may be other situations in future Python
+versions that can trigger global synchronization events.
+
+This means that you should detach from the interpreter runtime using
+[`Python::allow_threads`] in exactly the same situations as you should detach
+from the runtime in the GIL-enabled build: when doing long-running tasks that do
+not require the CPython runtime or when doing any task that needs to re-attach
+to the runtime (see the [guide
+section](parallelism.md#sharing-python-objects-between-rust-threads) that
+covers this). In the former case, you would observe a hang on threads that are
+waiting on the long-running task to complete, and in the latter case you would
+see a deadlock while a thread tries to attach after the runtime triggers a
+global synchronization event, but the spawning thread prevents the
+synchronization event from completing.
 
 ### Exceptions and panics for multithreaded access of mutable `pyclass` instances
 
diff --git a/guide/src/parallelism.md b/guide/src/parallelism.md
@@ -1,6 +1,6 @@
 # Parallelism
 
-CPython has the infamous [Global Interpreter Lock](https://docs.python.org/3/glossary.html#term-global-interpreter-lock), which prevents several threads from executing Python bytecode in parallel. This makes threading in Python a bad fit for [CPU-bound](https://en.wikipedia.org/wiki/CPU-bound) tasks and often forces developers to accept the overhead of multiprocessing.
+CPython has the infamous [Global Interpreter Lock](https://docs.python.org/3/glossary.html#term-global-interpreter-lock) (GIL), which prevents several threads from executing Python bytecode in parallel. This makes threading in Python a bad fit for [CPU-bound](https://en.wikipedia.org/wiki/CPU-bound) tasks and often forces developers to accept the overhead of multiprocessing. There is an experimental "free-threaded" version of CPython 3.13 that does not have a GIL, see the PyO3 docs on [free-threaded Python](./free-threading.md) for more information about that.
 
 In PyO3 parallelism can be easily achieved in Rust-only code. Let's take a look at our [word-count](https://github.com/PyO3/pyo3/blob/main/examples/word-count/src/lib.rs) example, where we have a `search` function that utilizes the [rayon](https://github.com/rayon-rs/rayon) crate to count words in parallel.
 ```rust,no_run
@@ -117,4 +117,61 @@ test_word_count_python_sequential                      27.3985 (15.82)    45.452
 
 You can see that the Python threaded version is not much slower than the Rust sequential version, which means compared to an execution on a single CPU core the speed has doubled.
 
+## Sharing Python objects between Rust threads
+
+In the example above we made a Python interface to a low-level rust function,
+and then leveraged the python `threading` module to run the low-level function
+in parallel. It is also possible to spawn threads in Rust that acquire the GIL
+and operate on Python objects. However, care must be taken to avoid writing code
+that deadlocks with the GIL in these cases.
+
+* Note: This example is meant to illustrate how to drop and re-acquire the GIL
+        to avoid creating deadlocks. Unless the spawned threads subsequently
+        release the GIL or you are using the free-threaded build of CPython, you
+        will not see any speedups due to multi-threaded parallelism using `rayon`
+        to parallelize code that acquires and holds the GIL for the entire
+        execution of the spawned thread.
+
+In the example below, we share a `Vec` of User ID objects defined using the
+`pyclass` macro and spawn threads to process the collection of data into a `Vec`
+of booleans based on a predicate using a rayon parallel iterator:
+
+```rust,no_run
+use pyo3::prelude::*;
+
+// These traits let us use int_par_iter and map
+use rayon::iter::{IntoParallelRefIterator, ParallelIterator};
+
+#[pyclass]
+struct UserID {
+    id: i64,
+}
+
+let allowed_ids: Vec<bool> = Python::with_gil(|outer_py| {
+    let instances: Vec<Py<UserID>> = (0..10).map(|x| Py::new(outer_py, UserID { id: x }).unwrap()).collect();
+    outer_py.allow_threads(|| {
+        instances.par_iter().map(|instance| {
+            Python::with_gil(|inner_py| {
+                instance.borrow(inner_py).id > 5
+            })
+        }).collect()
+    })
+});
+assert!(allowed_ids.into_iter().filter(|b| *b).count() == 4);
+```
+
+It's important to note that there is an `outer_py` GIL lifetime token as well as
+an `inner_py` token. Sharing GIL lifetime tokens between threads is not allowed
+and threads must individually acquire the GIL to access data wrapped by a python
+object.
+
+It's also important to see that this example uses [`Python::allow_threads`] to
+wrap the code that spawns OS threads via `rayon`. If this example didn't use
+`allow_threads`, a rayon worker thread would block on acquiring the GIL while a
+thread that owns the GIL spins forever waiting for the result of the rayon
+thread. Calling `allow_threads` allows the GIL to be released in the thread
+collecting the results from the worker threads. You should always call
+`allow_threads` in situations that spawn worker threads, but especially so in
+cases where worker threads need to acquire the GIL, to prevent deadlocks.
+
 [`Python::allow_threads`]: {{#PYO3_DOCS_URL}}/pyo3/marker/struct.Python.html#method.allow_threads