Backport PR #42439: DOC: Refactor Numba enhancing performance and add parallelism caveat (#42490)

meeseeksmachine · mroeschke · web-flow · commit 45aeaac6f8cb · 2021-07-11T21:27:13.000-04:00
Co-authored-by: Matthew Roeschke &lt;emailformattr@gmail.com&gt;
diff --git a/doc/source/user_guide/enhancingperf.rst b/doc/source/user_guide/enhancingperf.rst
@@ -302,28 +302,63 @@ For more about ``boundscheck`` and ``wraparound``, see the Cython docs on
 
 .. _enhancingperf.numba:
 
-Using Numba
------------
+Numba (JIT compilation)
+-----------------------
 
-A recent alternative to statically compiling Cython code, is to use a *dynamic jit-compiler*, Numba.
+An alternative to statically compiling Cython code is to use a dynamic just-in-time (JIT) compiler with `Numba <https://numba.pydata.org/>`__.
 
-Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.
+Numba allows you to write a pure Python function which can be JIT compiled to native machine instructions, similar in performance to C, C++ and Fortran,
+by decorating your function with ``@jit``.
 
-Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack.
+Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool).
+Numba supports compilation of Python to run on either CPU or GPU hardware and is designed to integrate with the Python scientific software stack.
 
 .. note::
 
-    You will need to install Numba. This is easy with ``conda``, by using: ``conda install numba``, see :ref:`installing using miniconda<install.miniconda>`.
+    The ``@jit`` compilation will add overhead to the runtime of the function, so performance benefits may not be realized especially when using small data sets.
+    Consider `caching <https://numba.readthedocs.io/en/stable/developer/caching.html>`__ your function to avoid compilation overhead each time your function is run.
 
-.. note::
+Numba can be used in 2 ways with pandas:
+
+#. Specify the ``engine="numba"`` keyword in select pandas methods
+#. Define your own Python function decorated with ``@jit`` and pass the underlying NumPy array of :class:`Series` or :class:`Dataframe` (using ``to_numpy()``) into the function
+
+pandas Numba Engine
+~~~~~~~~~~~~~~~~~~~
+
+If Numba is installed, one can specify ``engine="numba"`` in select pandas methods to execute the method using Numba.
+Methods that support ``engine="numba"`` will also have an ``engine_kwargs`` keyword that accepts a dictionary that allows one to specify
+``"nogil"``, ``"nopython"`` and ``"parallel"`` keys with boolean values to pass into the ``@jit`` decorator.
+If ``engine_kwargs`` is not specified, it defaults to ``{"nogil": False, "nopython": True, "parallel": False}`` unless otherwise specified.
+
+In terms of performance, **the first time a function is run using the Numba engine will be slow**
+as Numba will have some function compilation overhead. However, the JIT compiled functions are cached,
+and subsequent calls will be fast. In general, the Numba engine is performant with
+a larger amount of data points (e.g. 1+ million).
 
-    As of Numba version 0.20, pandas objects cannot be passed directly to Numba-compiled functions. Instead, one must pass the NumPy array underlying the pandas object to the Numba-compiled function as demonstrated below.
+.. code-block:: ipython
+
+   In [1]: data = pd.Series(range(1_000_000))  # noqa: E225
+
+   In [2]: roll = data.rolling(10)
 
-Jit
-~~~
+   In [3]: def f(x):
+      ...:     return np.sum(x) + 5
+   # Run the first time, compilation time will affect performance
+   In [4]: %timeit -r 1 -n 1 roll.apply(f, engine='numba', raw=True)
+   1.23 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
+   # Function is cached and performance will improve
+   In [5]: %timeit roll.apply(f, engine='numba', raw=True)
+   188 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
 
-We demonstrate how to use Numba to just-in-time compile our code. We simply
-take the plain Python code from above and annotate with the ``@jit`` decorator.
+   In [6]: %timeit roll.apply(f, engine='cython', raw=True)
+   3.92 s ± 59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+
+Custom Function Examples
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+A custom Python function decorated with ``@jit`` can be used with pandas objects by passing their NumPy array
+representations with ``to_numpy()``.
 
 .. code-block:: python
 
@@ -360,8 +395,6 @@ take the plain Python code from above and annotate with the ``@jit`` decorator.
        )
        return pd.Series(result, index=df.index, name="result")
 
-Note that we directly pass NumPy arrays to the Numba function. ``compute_numba`` is just a wrapper that provides a
-nicer interface by passing/returning pandas objects.
 
 .. code-block:: ipython
 
@@ -370,19 +403,9 @@ nicer interface by passing/returning pandas objects.
 
 In this example, using Numba was faster than Cython.
 
-Numba as an argument
-~~~~~~~~~~~~~~~~~~~~
-
-Additionally, we can leverage the power of `Numba <https://numba.pydata.org/>`__
-by calling it as an argument in :meth:`~Rolling.apply`. See :ref:`Computation tools
-<window.numba_engine>` for an extensive example.
-
-Vectorize
-~~~~~~~~~
-
 Numba can also be used to write vectorized functions that do not require the user to explicitly
 loop over the observations of a vector; a vectorized function will be applied to each row automatically.
-Consider the following toy example of doubling each observation:
+Consider the following example of doubling each observation:
 
 .. code-block:: python
 
@@ -414,25 +437,23 @@ Consider the following toy example of doubling each observation:
 Caveats
 ~~~~~~~
 
-.. note::
-
-    Numba will execute on any function, but can only accelerate certain classes of functions.
-
 Numba is best at accelerating functions that apply numerical functions to NumPy
-arrays. When passed a function that only uses operations it knows how to
-accelerate, it will execute in ``nopython`` mode.
-
-If Numba is passed a function that includes something it doesn't know how to
-work with -- a category that currently includes sets, lists, dictionaries, or
-string functions -- it will revert to ``object mode``. In ``object mode``,
-Numba will execute but your code will not speed up significantly. If you would
+arrays. If you try to ``@jit`` a function that contains unsupported `Python <https://numba.readthedocs.io/en/stable/reference/pysupported.html>`__
+or `NumPy <https://numba.readthedocs.io/en/stable/reference/numpysupported.html>`__
+code, compilation will revert `object mode <https://numba.readthedocs.io/en/stable/glossary.html#term-object-mode>`__ which
+will mostly likely not speed up your function. If you would
 prefer that Numba throw an error if it cannot compile a function in a way that
 speeds up your code, pass Numba the argument
-``nopython=True`` (e.g.  ``@numba.jit(nopython=True)``). For more on
+``nopython=True`` (e.g.  ``@jit(nopython=True)``). For more on
 troubleshooting Numba modes, see the `Numba troubleshooting page
 <https://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#the-compiled-code-is-too-slow>`__.
 
-Read more in the `Numba docs <https://numba.pydata.org/>`__.
+Using ``parallel=True`` (e.g. ``@jit(parallel=True)``) may result in a ``SIGABRT`` if the threading layer leads to unsafe
+behavior. You can first `specify a safe threading layer <https://numba.readthedocs.io/en/stable/user/threading-layer.html#selecting-a-threading-layer-for-safe-parallel-execution>`__
+before running a JIT function with ``parallel=True``.
+
+Generally if the you encounter a segfault (``SIGSEGV``) while using Numba, please report the issue
+to the `Numba issue tracker. <https://github.com/numba/numba/issues/new/choose>`__
 
 .. _enhancingperf.eval:
 
diff --git a/doc/source/user_guide/groupby.rst b/doc/source/user_guide/groupby.rst
@@ -1106,11 +1106,9 @@ Numba Accelerated Routines
 .. versionadded:: 1.1
 
 If `Numba <https://numba.pydata.org/>`__ is installed as an optional dependency, the ``transform`` and
-``aggregate`` methods support ``engine='numba'`` and ``engine_kwargs`` arguments. The ``engine_kwargs``
-argument is a dictionary of keyword arguments that will be passed into the
-`numba.jit decorator <https://numba.pydata.org/numba-doc/latest/reference/jit-compilation.html#numba.jit>`__.
-These keyword arguments will be applied to the passed function. Currently only ``nogil``, ``nopython``,
-and ``parallel`` are supported, and their default values are set to ``False``, ``True`` and ``False`` respectively.
+``aggregate`` methods support ``engine='numba'`` and ``engine_kwargs`` arguments.
+See :ref:`enhancing performance with Numba <enhancingperf.numba>` for general usage of the arguments
+and performance considerations.
 
 The function signature must start with ``values, index`` **exactly** as the data belonging to each group
 will be passed into ``values``, and the group index will be passed into ``index``.
@@ -1121,52 +1119,6 @@ will be passed into ``values``, and the group index will be passed into ``index`
    data and group index will be passed as NumPy arrays to the JITed user defined function, and no
    alternative execution attempts will be tried.
 
-.. note::
-
-   In terms of performance, **the first time a function is run using the Numba engine will be slow**
-   as Numba will have some function compilation overhead. However, the compiled functions are cached,
-   and subsequent calls will be fast. In general, the Numba engine is performant with
-   a larger amount of data points (e.g. 1+ million).
-
-.. code-block:: ipython
-
-   In [1]: N = 10 ** 3
-
-   In [2]: data = {0: [str(i) for i in range(100)] * N, 1: list(range(100)) * N}
-
-   In [3]: df = pd.DataFrame(data, columns=[0, 1])
-
-   In [4]: def f_numba(values, index):
-      ...:     total = 0
-      ...:     for i, value in enumerate(values):
-      ...:         if i % 2:
-      ...:             total += value + 5
-      ...:         else:
-      ...:             total += value * 2
-      ...:     return total
-      ...:
-
-   In [5]: def f_cython(values):
-      ...:     total = 0
-      ...:     for i, value in enumerate(values):
-      ...:         if i % 2:
-      ...:             total += value + 5
-      ...:         else:
-      ...:             total += value * 2
-      ...:     return total
-      ...:
-
-   In [6]: groupby = df.groupby(0)
-   # Run the first time, compilation time will affect performance
-   In [7]: %timeit -r 1 -n 1 groupby.aggregate(f_numba, engine='numba')  # noqa: E225
-   2.14 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
-   # Function is cached and performance will improve
-   In [8]: %timeit groupby.aggregate(f_numba, engine='numba')
-   4.93 ms ± 32.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
-
-   In [9]: %timeit groupby.aggregate(f_cython, engine='cython')
-   18.6 ms ± 84.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
-
 Other useful features
 ---------------------
 
diff --git a/doc/source/user_guide/window.rst b/doc/source/user_guide/window.rst
@@ -262,26 +262,24 @@ and we want to use an expanding window where ``use_expanding`` is ``True`` other
 .. code-block:: ipython
 
    In [2]: from pandas.api.indexers import BaseIndexer
-   ...:
-   ...: class CustomIndexer(BaseIndexer):
-   ...:
-   ...:    def get_window_bounds(self, num_values, min_periods, center, closed):
-   ...:        start = np.empty(num_values, dtype=np.int64)
-   ...:        end = np.empty(num_values, dtype=np.int64)
-   ...:        for i in range(num_values):
-   ...:            if self.use_expanding[i]:
-   ...:                start[i] = 0
-   ...:                end[i] = i + 1
-   ...:            else:
-   ...:                start[i] = i
-   ...:                end[i] = i + self.window_size
-   ...:        return start, end
-   ...:
-
-   In [3]: indexer = CustomIndexer(window_size=1, use_expanding=use_expanding)
-
-   In [4]: df.rolling(indexer).sum()
-   Out[4]:
+
+   In [3]: class CustomIndexer(BaseIndexer):
+      ...:     def get_window_bounds(self, num_values, min_periods, center, closed):
+      ...:         start = np.empty(num_values, dtype=np.int64)
+      ...:         end = np.empty(num_values, dtype=np.int64)
+      ...:         for i in range(num_values):
+      ...:             if self.use_expanding[i]:
+      ...:                 start[i] = 0
+      ...:                 end[i] = i + 1
+      ...:             else:
+      ...:                 start[i] = i
+      ...:                 end[i] = i + self.window_size
+      ...:         return start, end
+
+   In [4]: indexer = CustomIndexer(window_size=1, use_expanding=use_expanding)
+
+   In [5]: df.rolling(indexer).sum()
+   Out[5]:
        values
    0     0.0
    1     1.0
@@ -365,45 +363,21 @@ Numba engine
 Additionally, :meth:`~Rolling.apply` can leverage `Numba <https://numba.pydata.org/>`__
 if installed as an optional dependency. The apply aggregation can be executed using Numba by specifying
 ``engine='numba'`` and ``engine_kwargs`` arguments (``raw`` must also be set to ``True``).
+See :ref:`enhancing performance with Numba <enhancingperf.numba>` for general usage of the arguments and performance considerations.
+
 Numba will be applied in potentially two routines:
 
 #. If ``func`` is a standard Python function, the engine will `JIT <https://numba.pydata.org/numba-doc/latest/user/overview.html>`__ the passed function. ``func`` can also be a JITed function in which case the engine will not JIT the function again.
 #. The engine will JIT the for loop where the apply function is applied to each window.
 
-.. versionadded:: 1.3.0
-
-``mean``, ``median``, ``max``, ``min``, and ``sum`` also support the ``engine`` and ``engine_kwargs`` arguments.
-
 The ``engine_kwargs`` argument is a dictionary of keyword arguments that will be passed into the
 `numba.jit decorator <https://numba.pydata.org/numba-doc/latest/reference/jit-compilation.html#numba.jit>`__.
 These keyword arguments will be applied to *both* the passed function (if a standard Python function)
-and the apply for loop over each window. Currently only ``nogil``, ``nopython``, and ``parallel`` are supported,
-and their default values are set to ``False``, ``True`` and ``False`` respectively.
-
-.. note::
+and the apply for loop over each window.
 
-   In terms of performance, **the first time a function is run using the Numba engine will be slow**
-   as Numba will have some function compilation overhead. However, the compiled functions are cached,
-   and subsequent calls will be fast. In general, the Numba engine is performant with
-   a larger amount of data points (e.g. 1+ million).
-
-.. code-block:: ipython
-
-   In [1]: data = pd.Series(range(1_000_000))
-
-   In [2]: roll = data.rolling(10)
-
-   In [3]: def f(x):
-      ...:     return np.sum(x) + 5
-   # Run the first time, compilation time will affect performance
-   In [4]: %timeit -r 1 -n 1 roll.apply(f, engine='numba', raw=True)  # noqa: E225, E999
-   1.23 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
-   # Function is cached and performance will improve
-   In [5]: %timeit roll.apply(f, engine='numba', raw=True)
-   188 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
+.. versionadded:: 1.3.0
 
-   In [6]: %timeit roll.apply(f, engine='cython', raw=True)
-   3.92 s ± 59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+``mean``, ``median``, ``max``, ``min``, and ``sum`` also support the ``engine`` and ``engine_kwargs`` arguments.
 
 .. _window.cov_corr:
 
diff --git a/pandas/core/window/doc.py b/pandas/core/window/doc.py
@@ -94,8 +94,8 @@ def create_section_header(header: str) -> str:
 ).replace("\n", "", 1)
 
 numba_notes = (
-    "See :ref:`window.numba_engine` for extended documentation "
-    "and performance considerations for the Numba engine.\n\n"
+    "See :ref:`window.numba_engine` and :ref:`enhancingperf.numba` for "
+    "extended documentation and performance considerations for the Numba engine.\n\n"
 )
 
 window_agg_numba_parameters = dedent(

Original file line number	Diff line number	Diff line change
`@@ -94,8 +94,8 @@ def create_section_header(header: str) -> str:`
`94`	`94`	`).replace("\n", "", 1)`
`95`	`95`
`96`	`96`	`numba_notes = (`
`97`		- "See :ref:`window.numba_engine` for extended documentation "
`98`		`- "and performance considerations for the Numba engine.\n\n"`
	`97`	+ "See :ref:`window.numba_engine` and :ref:`enhancingperf.numba` for "
	`98`	`+ "extended documentation and performance considerations for the Numba engine.\n\n"`
`99`	`99`	`)`
`100`	`100`
`101`	`101`	`window_agg_numba_parameters = dedent(`