Skip to content

DOC: Refactor Numba enhancing performance and add parallelism caveat #42439

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jul 12, 2021
97 changes: 59 additions & 38 deletions doc/source/user_guide/enhancingperf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -302,28 +302,63 @@ For more about ``boundscheck`` and ``wraparound``, see the Cython docs on

.. _enhancingperf.numba:

Using Numba
-----------
Numba (JIT compilation)
-----------------------

A recent alternative to statically compiling Cython code, is to use a *dynamic jit-compiler*, Numba.
An alternative to statically compiling Cython code is to use a dynamic just-in-time (JIT) compiler with `Numba <https://numba.pydata.org/>`__.

Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.
Numba allows you to write a pure Python function which can be JIT compiled to native machine instructions, similar in performance to C, C++ and Fortran,
by decorating your function with ``@jit``.

Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack.
Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool).
Numba supports compilation of Python to run on either CPU or GPU hardware and is designed to integrate with the Python scientific software stack.

.. note::

You will need to install Numba. This is easy with ``conda``, by using: ``conda install numba``, see :ref:`installing using miniconda<install.miniconda>`.
The ``@jit`` compilation will add overhead to the runtime of the function, so performance benefits may not be realized especially when using small data sets.
Consider `caching <https://numba.readthedocs.io/en/stable/developer/caching.html>`__ your function to avoid compilation overhead each time your function is run.

.. note::
Numba can be used in 2 ways with pandas:

#. Specify the ``engine="numba"`` keyword in select pandas methods
#. Define your own Python function decorated with ``@jit`` and pass the underlying NumPy array of :class:`Series` or :class:`Dataframe` (using ``to_numpy()``) into the function

pandas Numba Engine
~~~~~~~~~~~~~~~~~~~

If Numba is installed, one can specify ``engine="numba"`` in select pandas methods to execute the method using Numba.
Methods that support ``engine="numba"`` will also have an ``engine_kwargs`` keyword that accepts a dictionary that allows one to specify
``"nogil"``, ``"nopython"`` and ``"parallel"`` keys with boolean values to pass into the ``@jit`` decorator.
If ``engine_kwargs`` is not specified, it defaults to ``{"nogil": False, "nopython": True, "parallel": False}`` unless otherwise specified.

In terms of performance, **the first time a function is run using the Numba engine will be slow**
as Numba will have some function compilation overhead. However, the JIT compiled functions are cached,
and subsequent calls will be fast. In general, the Numba engine is performant with
a larger amount of data points (e.g. 1+ million).

As of Numba version 0.20, pandas objects cannot be passed directly to Numba-compiled functions. Instead, one must pass the NumPy array underlying the pandas object to the Numba-compiled function as demonstrated below.
.. code-block:: ipython

In [1]: data = pd.Series(range(1_000_000)) # noqa: E225

In [2]: roll = data.rolling(10)

Jit
~~~
In [3]: def f(x):
...: return np.sum(x) + 5
# Run the first time, compilation time will affect performance
In [4]: %timeit -r 1 -n 1 roll.apply(f, engine='numba', raw=True)
1.23 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
# Function is cached and performance will improve
In [5]: %timeit roll.apply(f, engine='numba', raw=True)
188 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

We demonstrate how to use Numba to just-in-time compile our code. We simply
take the plain Python code from above and annotate with the ``@jit`` decorator.
In [6]: %timeit roll.apply(f, engine='cython', raw=True)
3.92 s ± 59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Custom Function Examples
~~~~~~~~~~~~~~~~~~~~~~~~

A custom Python function decorated with ``@jit`` can be used with pandas objects by passing their NumPy array
representations with ``to_numpy()``.

.. code-block:: python

Expand Down Expand Up @@ -360,8 +395,6 @@ take the plain Python code from above and annotate with the ``@jit`` decorator.
)
return pd.Series(result, index=df.index, name="result")

Note that we directly pass NumPy arrays to the Numba function. ``compute_numba`` is just a wrapper that provides a
nicer interface by passing/returning pandas objects.

.. code-block:: ipython

Expand All @@ -370,19 +403,9 @@ nicer interface by passing/returning pandas objects.

In this example, using Numba was faster than Cython.

Numba as an argument
~~~~~~~~~~~~~~~~~~~~

Additionally, we can leverage the power of `Numba <https://numba.pydata.org/>`__
by calling it as an argument in :meth:`~Rolling.apply`. See :ref:`Computation tools
<window.numba_engine>` for an extensive example.

Vectorize
~~~~~~~~~

Numba can also be used to write vectorized functions that do not require the user to explicitly
loop over the observations of a vector; a vectorized function will be applied to each row automatically.
Consider the following toy example of doubling each observation:
Consider the following example of doubling each observation:

.. code-block:: python

Expand Down Expand Up @@ -414,25 +437,23 @@ Consider the following toy example of doubling each observation:
Caveats
~~~~~~~

.. note::

Numba will execute on any function, but can only accelerate certain classes of functions.

Numba is best at accelerating functions that apply numerical functions to NumPy
arrays. When passed a function that only uses operations it knows how to
accelerate, it will execute in ``nopython`` mode.

If Numba is passed a function that includes something it doesn't know how to
work with -- a category that currently includes sets, lists, dictionaries, or
string functions -- it will revert to ``object mode``. In ``object mode``,
Numba will execute but your code will not speed up significantly. If you would
arrays. If you try to ``@jit`` a function that contains unsupported `Python <https://numba.readthedocs.io/en/stable/reference/pysupported.html>`__
or `NumPy <https://numba.readthedocs.io/en/stable/reference/numpysupported.html>`__
code, compilation will revert `object mode <https://numba.readthedocs.io/en/stable/glossary.html#term-object-mode>`__ which
will mostly likely not speed up your function. If you would
prefer that Numba throw an error if it cannot compile a function in a way that
speeds up your code, pass Numba the argument
``nopython=True`` (e.g. ``@numba.jit(nopython=True)``). For more on
``nopython=True`` (e.g. ``@jit(nopython=True)``). For more on
troubleshooting Numba modes, see the `Numba troubleshooting page
<https://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#the-compiled-code-is-too-slow>`__.

Read more in the `Numba docs <https://numba.pydata.org/>`__.
Using ``parallel=True`` (e.g. ``@jit(parallel=True)``) may result in a ``SIGABRT`` if the threading layer leads to unsafe
behavior. You can first `specify a safe threading layer <https://numba.readthedocs.io/en/stable/user/threading-layer.html#selecting-a-threading-layer-for-safe-parallel-execution>`__
before running a JIT function with ``parallel=True``.

Generally if the you encounter a segfault (``SIGSEGV``) while using Numba, please report the issue
to the `Numba issue tracker. <https://github.com/numba/numba/issues/new/choose>`__

.. _enhancingperf.eval:

Expand Down
54 changes: 3 additions & 51 deletions doc/source/user_guide/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1106,11 +1106,9 @@ Numba Accelerated Routines
.. versionadded:: 1.1

If `Numba <https://numba.pydata.org/>`__ is installed as an optional dependency, the ``transform`` and
``aggregate`` methods support ``engine='numba'`` and ``engine_kwargs`` arguments. The ``engine_kwargs``
argument is a dictionary of keyword arguments that will be passed into the
`numba.jit decorator <https://numba.pydata.org/numba-doc/latest/reference/jit-compilation.html#numba.jit>`__.
These keyword arguments will be applied to the passed function. Currently only ``nogil``, ``nopython``,
and ``parallel`` are supported, and their default values are set to ``False``, ``True`` and ``False`` respectively.
``aggregate`` methods support ``engine='numba'`` and ``engine_kwargs`` arguments.
See :ref:`enhancing performance with Numba <enhancingperf.numba>` for general usage of the arguments
and performance considerations.

The function signature must start with ``values, index`` **exactly** as the data belonging to each group
will be passed into ``values``, and the group index will be passed into ``index``.
Expand All @@ -1121,52 +1119,6 @@ will be passed into ``values``, and the group index will be passed into ``index`
data and group index will be passed as NumPy arrays to the JITed user defined function, and no
alternative execution attempts will be tried.

.. note::

In terms of performance, **the first time a function is run using the Numba engine will be slow**
as Numba will have some function compilation overhead. However, the compiled functions are cached,
and subsequent calls will be fast. In general, the Numba engine is performant with
a larger amount of data points (e.g. 1+ million).

.. code-block:: ipython

In [1]: N = 10 ** 3

In [2]: data = {0: [str(i) for i in range(100)] * N, 1: list(range(100)) * N}

In [3]: df = pd.DataFrame(data, columns=[0, 1])

In [4]: def f_numba(values, index):
...: total = 0
...: for i, value in enumerate(values):
...: if i % 2:
...: total += value + 5
...: else:
...: total += value * 2
...: return total
...:

In [5]: def f_cython(values):
...: total = 0
...: for i, value in enumerate(values):
...: if i % 2:
...: total += value + 5
...: else:
...: total += value * 2
...: return total
...:

In [6]: groupby = df.groupby(0)
# Run the first time, compilation time will affect performance
In [7]: %timeit -r 1 -n 1 groupby.aggregate(f_numba, engine='numba') # noqa: E225
2.14 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
# Function is cached and performance will improve
In [8]: %timeit groupby.aggregate(f_numba, engine='numba')
4.93 ms ± 32.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit groupby.aggregate(f_cython, engine='cython')
18.6 ms ± 84.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Other useful features
---------------------

Expand Down
72 changes: 23 additions & 49 deletions doc/source/user_guide/window.rst
Original file line number Diff line number Diff line change
Expand Up @@ -262,26 +262,24 @@ and we want to use an expanding window where ``use_expanding`` is ``True`` other
.. code-block:: ipython

In [2]: from pandas.api.indexers import BaseIndexer
...:
...: class CustomIndexer(BaseIndexer):
...:
...: def get_window_bounds(self, num_values, min_periods, center, closed):
...: start = np.empty(num_values, dtype=np.int64)
...: end = np.empty(num_values, dtype=np.int64)
...: for i in range(num_values):
...: if self.use_expanding[i]:
...: start[i] = 0
...: end[i] = i + 1
...: else:
...: start[i] = i
...: end[i] = i + self.window_size
...: return start, end
...:

In [3]: indexer = CustomIndexer(window_size=1, use_expanding=use_expanding)

In [4]: df.rolling(indexer).sum()
Out[4]:

In [3]: class CustomIndexer(BaseIndexer):
...: def get_window_bounds(self, num_values, min_periods, center, closed):
...: start = np.empty(num_values, dtype=np.int64)
...: end = np.empty(num_values, dtype=np.int64)
...: for i in range(num_values):
...: if self.use_expanding[i]:
...: start[i] = 0
...: end[i] = i + 1
...: else:
...: start[i] = i
...: end[i] = i + self.window_size
...: return start, end

In [4]: indexer = CustomIndexer(window_size=1, use_expanding=use_expanding)

In [5]: df.rolling(indexer).sum()
Out[5]:
values
0 0.0
1 1.0
Expand Down Expand Up @@ -365,45 +363,21 @@ Numba engine
Additionally, :meth:`~Rolling.apply` can leverage `Numba <https://numba.pydata.org/>`__
if installed as an optional dependency. The apply aggregation can be executed using Numba by specifying
``engine='numba'`` and ``engine_kwargs`` arguments (``raw`` must also be set to ``True``).
See :ref:`enhancing performance with Numba <enhancingperf.numba>` for general usage of the arguments and performance considerations.

Numba will be applied in potentially two routines:

#. If ``func`` is a standard Python function, the engine will `JIT <https://numba.pydata.org/numba-doc/latest/user/overview.html>`__ the passed function. ``func`` can also be a JITed function in which case the engine will not JIT the function again.
#. The engine will JIT the for loop where the apply function is applied to each window.

.. versionadded:: 1.3.0

``mean``, ``median``, ``max``, ``min``, and ``sum`` also support the ``engine`` and ``engine_kwargs`` arguments.

The ``engine_kwargs`` argument is a dictionary of keyword arguments that will be passed into the
`numba.jit decorator <https://numba.pydata.org/numba-doc/latest/reference/jit-compilation.html#numba.jit>`__.
These keyword arguments will be applied to *both* the passed function (if a standard Python function)
and the apply for loop over each window. Currently only ``nogil``, ``nopython``, and ``parallel`` are supported,
and their default values are set to ``False``, ``True`` and ``False`` respectively.

.. note::
and the apply for loop over each window.

In terms of performance, **the first time a function is run using the Numba engine will be slow**
as Numba will have some function compilation overhead. However, the compiled functions are cached,
and subsequent calls will be fast. In general, the Numba engine is performant with
a larger amount of data points (e.g. 1+ million).

.. code-block:: ipython

In [1]: data = pd.Series(range(1_000_000))

In [2]: roll = data.rolling(10)

In [3]: def f(x):
...: return np.sum(x) + 5
# Run the first time, compilation time will affect performance
In [4]: %timeit -r 1 -n 1 roll.apply(f, engine='numba', raw=True) # noqa: E225, E999
1.23 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
# Function is cached and performance will improve
In [5]: %timeit roll.apply(f, engine='numba', raw=True)
188 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
.. versionadded:: 1.3.0

In [6]: %timeit roll.apply(f, engine='cython', raw=True)
3.92 s ± 59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
``mean``, ``median``, ``max``, ``min``, and ``sum`` also support the ``engine`` and ``engine_kwargs`` arguments.

.. _window.cov_corr:

Expand Down
4 changes: 2 additions & 2 deletions pandas/core/window/doc.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,8 +94,8 @@ def create_section_header(header: str) -> str:
).replace("\n", "", 1)

numba_notes = (
"See :ref:`window.numba_engine` for extended documentation "
"and performance considerations for the Numba engine.\n\n"
"See :ref:`window.numba_engine` and :ref:`enhancingperf.numba` for "
"extended documentation and performance considerations for the Numba engine.\n\n"
)

window_agg_numba_parameters = dedent(
Expand Down