Skip to content

Commit ba4695d

Browse files
committed
Merge branch 'master' of https://github.com/pandas-dev/pandas into ops-kwargs9
2 parents e4e5d50 + bc1d027 commit ba4695d

File tree

13 files changed

+660
-472
lines changed

13 files changed

+660
-472
lines changed

doc/source/enhancingperf.rst

+51-30
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,13 @@
1919
Enhancing Performance
2020
*********************
2121

22+
In this part of the tutorial, we will investigate how to speed up certain
23+
functions operating on pandas ``DataFrames`` using three different techniques:
24+
Cython, Numba and :func:`pandas.eval`. We will see a speed improvement of ~200
25+
when we use Cython and Numba on a test function operating row-wise on the
26+
``DataFrame``. Using :func:`pandas.eval` we will speed up a sum by an order of
27+
~2.
28+
2229
.. _enhancingperf.cython:
2330

2431
Cython (Writing C extensions for pandas)
@@ -29,20 +36,20 @@ computationally heavy applications however, it can be possible to achieve sizeab
2936
speed-ups by offloading work to `cython <http://cython.org/>`__.
3037

3138
This tutorial assumes you have refactored as much as possible in Python, for example
32-
trying to remove for loops and making use of NumPy vectorization, it's always worth
39+
by trying to remove for-loops and making use of NumPy vectorization. It's always worth
3340
optimising in Python first.
3441

3542
This tutorial walks through a "typical" process of cythonizing a slow computation.
36-
We use an `example from the cython documentation <http://docs.cython.org/src/quickstart/cythonize.html>`__
43+
We use an `example from the Cython documentation <http://docs.cython.org/src/quickstart/cythonize.html>`__
3744
but in the context of pandas. Our final cythonized solution is around 100 times
38-
faster than the pure Python.
45+
faster than the pure Python solution.
3946

4047
.. _enhancingperf.pure:
4148

4249
Pure python
4350
~~~~~~~~~~~
4451

45-
We have a DataFrame to which we want to apply a function row-wise.
52+
We have a ``DataFrame`` to which we want to apply a function row-wise.
4653

4754
.. ipython:: python
4855
@@ -91,18 +98,18 @@ hence we'll concentrate our efforts cythonizing these two functions.
9198

9299
.. _enhancingperf.plain:
93100

94-
Plain cython
101+
Plain Cython
95102
~~~~~~~~~~~~
96103

97-
First we're going to need to import the cython magic function to ipython:
104+
First we're going to need to import the Cython magic function to ipython:
98105

99106
.. ipython:: python
100107
:okwarning:
101108
102109
%load_ext Cython
103110
104111
105-
Now, let's simply copy our functions over to cython as is (the suffix
112+
Now, let's simply copy our functions over to Cython as is (the suffix
106113
is here to distinguish between function versions):
107114

108115
.. ipython::
@@ -177,8 +184,8 @@ in Python, so maybe we could minimize these by cythonizing the apply part.
177184

178185
.. note::
179186

180-
We are now passing ndarrays into the cython function, fortunately cython plays
181-
very nicely with numpy.
187+
We are now passing ndarrays into the Cython function, fortunately Cython plays
188+
very nicely with NumPy.
182189

183190
.. ipython::
184191

@@ -213,17 +220,17 @@ the rows, applying our ``integrate_f_typed``, and putting this in the zeros arra
213220
.. warning::
214221

215222
You can **not pass** a ``Series`` directly as a ``ndarray`` typed parameter
216-
to a cython function. Instead pass the actual ``ndarray`` using the
217-
``.values`` attribute of the Series. The reason is that the cython
218-
definition is specific to an ndarray and not the passed Series.
223+
to a Cython function. Instead pass the actual ``ndarray`` using the
224+
``.values`` attribute of the ``Series``. The reason is that the Cython
225+
definition is specific to an ndarray and not the passed ``Series``.
219226

220227
So, do not do this:
221228

222229
.. code-block:: python
223230
224231
apply_integrate_f(df['a'], df['b'], df['N'])
225232
226-
But rather, use ``.values`` to get the underlying ``ndarray``
233+
But rather, use ``.values`` to get the underlying ``ndarray``:
227234

228235
.. code-block:: python
229236
@@ -255,7 +262,7 @@ More advanced techniques
255262
~~~~~~~~~~~~~~~~~~~~~~~~
256263

257264
There is still hope for improvement. Here's an example of using some more
258-
advanced cython techniques:
265+
advanced Cython techniques:
259266

260267
.. ipython::
261268

@@ -289,33 +296,35 @@ advanced cython techniques:
289296
In [4]: %timeit apply_integrate_f_wrap(df['a'].values, df['b'].values, df['N'].values)
290297
1000 loops, best of 3: 987 us per loop
291298
292-
Even faster, with the caveat that a bug in our cython code (an off-by-one error,
299+
Even faster, with the caveat that a bug in our Cython code (an off-by-one error,
293300
for example) might cause a segfault because memory access isn't checked.
294-
301+
For more about ``boundscheck`` and ``wraparound``, see the Cython docs on
302+
`compiler directives <http://cython.readthedocs.io/en/latest/src/reference/compilation.html?highlight=wraparound#compiler-directives>`__.
295303

296304
.. _enhancingperf.numba:
297305

298-
Using numba
306+
Using Numba
299307
-----------
300308

301-
A recent alternative to statically compiling cython code, is to use a *dynamic jit-compiler*, ``numba``.
309+
A recent alternative to statically compiling Cython code, is to use a *dynamic jit-compiler*, Numba.
302310

303311
Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.
304312

305313
Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack.
306314

307315
.. note::
308316

309-
You will need to install ``numba``. This is easy with ``conda``, by using: ``conda install numba``, see :ref:`installing using miniconda<install.miniconda>`.
317+
You will need to install Numba. This is easy with ``conda``, by using: ``conda install numba``, see :ref:`installing using miniconda<install.miniconda>`.
310318

311319
.. note::
312320

313-
As of ``numba`` version 0.20, pandas objects cannot be passed directly to numba-compiled functions. Instead, one must pass the ``numpy`` array underlying the ``pandas`` object to the numba-compiled function as demonstrated below.
321+
As of Numba version 0.20, pandas objects cannot be passed directly to Numba-compiled functions. Instead, one must pass the NumPy array underlying the pandas object to the Numba-compiled function as demonstrated below.
314322

315323
Jit
316324
~~~
317325

318-
Using ``numba`` to just-in-time compile your code. We simply take the plain Python code from above and annotate with the ``@jit`` decorator.
326+
We demonstrate how to use Numba to just-in-time compile our code. We simply
327+
take the plain Python code from above and annotate with the ``@jit`` decorator.
319328

320329
.. code-block:: python
321330
@@ -346,17 +355,19 @@ Using ``numba`` to just-in-time compile your code. We simply take the plain Pyth
346355
result = apply_integrate_f_numba(df['a'].values, df['b'].values, df['N'].values)
347356
return pd.Series(result, index=df.index, name='result')
348357
349-
Note that we directly pass ``numpy`` arrays to the numba function. ``compute_numba`` is just a wrapper that provides a nicer interface by passing/returning pandas objects.
358+
Note that we directly pass NumPy arrays to the Numba function. ``compute_numba`` is just a wrapper that provides a nicer interface by passing/returning pandas objects.
350359

351360
.. code-block:: ipython
352361
353362
In [4]: %timeit compute_numba(df)
354363
1000 loops, best of 3: 798 us per loop
355364
365+
In this example, using Numba was faster than Cython.
366+
356367
Vectorize
357368
~~~~~~~~~
358369

359-
``numba`` can also be used to write vectorized functions that do not require the user to explicitly
370+
Numba can also be used to write vectorized functions that do not require the user to explicitly
360371
loop over the observations of a vector; a vectorized function will be applied to each row automatically.
361372
Consider the following toy example of doubling each observation:
362373

@@ -389,13 +400,23 @@ Caveats
389400

390401
.. note::
391402

392-
``numba`` will execute on any function, but can only accelerate certain classes of functions.
403+
Numba will execute on any function, but can only accelerate certain classes of functions.
393404

394-
``numba`` is best at accelerating functions that apply numerical functions to NumPy arrays. When passed a function that only uses operations it knows how to accelerate, it will execute in ``nopython`` mode.
405+
Numba is best at accelerating functions that apply numerical functions to NumPy
406+
arrays. When passed a function that only uses operations it knows how to
407+
accelerate, it will execute in ``nopython`` mode.
395408

396-
If ``numba`` is passed a function that includes something it doesn't know how to work with -- a category that currently includes sets, lists, dictionaries, or string functions -- it will revert to ``object mode``. In ``object mode``, numba will execute but your code will not speed up significantly. If you would prefer that ``numba`` throw an error if it cannot compile a function in a way that speeds up your code, pass numba the argument ``nopython=True`` (e.g. ``@numba.jit(nopython=True)``). For more on troubleshooting ``numba`` modes, see the `numba troubleshooting page <http://numba.pydata.org/numba-doc/0.20.0/user/troubleshoot.html#the-compiled-code-is-too-slow>`__.
409+
If Numba is passed a function that includes something it doesn't know how to
410+
work with -- a category that currently includes sets, lists, dictionaries, or
411+
string functions -- it will revert to ``object mode``. In ``object mode``,
412+
Numba will execute but your code will not speed up significantly. If you would
413+
prefer that Numba throw an error if it cannot compile a function in a way that
414+
speeds up your code, pass Numba the argument
415+
``nopython=True`` (e.g. ``@numba.jit(nopython=True)``). For more on
416+
troubleshooting Numba modes, see the `Numba troubleshooting page
417+
<http://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#the-compiled-code-is-too-slow>`__.
397418

398-
Read more in the `numba docs <http://numba.pydata.org/>`__.
419+
Read more in the `Numba docs <http://numba.pydata.org/>`__.
399420

400421
.. _enhancingperf.eval:
401422

@@ -448,7 +469,7 @@ These operations are supported by :func:`pandas.eval`:
448469
- Attribute access, e.g., ``df.a``
449470
- Subscript expressions, e.g., ``df[0]``
450471
- Simple variable evaluation, e.g., ``pd.eval('df')`` (this is not very useful)
451-
- Math functions, `sin`, `cos`, `exp`, `log`, `expm1`, `log1p`,
472+
- Math functions: `sin`, `cos`, `exp`, `log`, `expm1`, `log1p`,
452473
`sqrt`, `sinh`, `cosh`, `tanh`, `arcsin`, `arccos`, `arctan`, `arccosh`,
453474
`arcsinh`, `arctanh`, `abs` and `arctan2`.
454475

@@ -581,7 +602,7 @@ on the original ``DataFrame`` or return a copy with the new column.
581602
For backwards compatibility, ``inplace`` defaults to ``True`` if not
582603
specified. This will change in a future version of pandas - if your
583604
code depends on an inplace assignment you should update to explicitly
584-
set ``inplace=True``
605+
set ``inplace=True``.
585606

586607
.. ipython:: python
587608
@@ -780,7 +801,7 @@ Technical Minutia Regarding Expression Evaluation
780801
Expressions that would result in an object dtype or involve datetime operations
781802
(because of ``NaT``) must be evaluated in Python space. The main reason for
782803
this behavior is to maintain backwards compatibility with versions of NumPy <
783-
1.7. In those versions of ``numpy`` a call to ``ndarray.astype(str)`` will
804+
1.7. In those versions of NumPy a call to ``ndarray.astype(str)`` will
784805
truncate any strings that are more than 60 characters in length. Second, we
785806
can't pass ``object`` arrays to ``numexpr`` thus string comparisons must be
786807
evaluated in Python space.

doc/source/sparse.rst

+4-4
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,11 @@ Sparse data structures
1717

1818
.. note:: The ``SparsePanel`` class has been removed in 0.19.0
1919

20-
We have implemented "sparse" versions of Series and DataFrame. These are not sparse
20+
We have implemented "sparse" versions of ``Series`` and ``DataFrame``. These are not sparse
2121
in the typical "mostly 0". Rather, you can view these objects as being "compressed"
2222
where any data matching a specific value (``NaN`` / missing value, though any value
2323
can be chosen) is omitted. A special ``SparseIndex`` object tracks where data has been
24-
"sparsified". This will make much more sense in an example. All of the standard pandas
24+
"sparsified". This will make much more sense with an example. All of the standard pandas
2525
data structures have a ``to_sparse`` method:
2626

2727
.. ipython:: python
@@ -32,15 +32,15 @@ data structures have a ``to_sparse`` method:
3232
sts
3333
3434
The ``to_sparse`` method takes a ``kind`` argument (for the sparse index, see
35-
below) and a ``fill_value``. So if we had a mostly zero Series, we could
35+
below) and a ``fill_value``. So if we had a mostly zero ``Series``, we could
3636
convert it to sparse with ``fill_value=0``:
3737

3838
.. ipython:: python
3939
4040
ts.fillna(0).to_sparse(fill_value=0)
4141
4242
The sparse objects exist for memory efficiency reasons. Suppose you had a
43-
large, mostly NA DataFrame:
43+
large, mostly NA ``DataFrame``:
4444

4545
.. ipython:: python
4646

doc/source/whatsnew/v0.23.0.txt

+1
Original file line numberDiff line numberDiff line change
@@ -343,6 +343,7 @@ Other API Changes
343343
- Addition and subtraction of ``NaN`` from a :class:`Series` with ``dtype='timedelta64[ns]'`` will raise a ``TypeError` instead of treating the ``NaN`` as ``NaT`` (:issue:`19274`)
344344
- Set operations (union, difference...) on :class:`IntervalIndex` with incompatible index types will now raise a ``TypeError`` rather than a ``ValueError`` (:issue:`19329`)
345345
- :class:`DateOffset` objects render more simply, e.g. "<DateOffset: days=1>" instead of "<DateOffset: kwds={'days': 1}>" (:issue:`19403`)
346+
- :func:`pandas.merge` provides a more informative error message when trying to merge on timezone-aware and timezone-naive columns (:issue:`15800`)
346347

347348
.. _whatsnew_0230.deprecations:
348349

pandas/core/frame.py

+8
Original file line numberDiff line numberDiff line change
@@ -3080,6 +3080,14 @@ def fillna(self, value=None, method=None, axis=None, inplace=False,
30803080
inplace=inplace, limit=limit,
30813081
downcast=downcast, **kwargs)
30823082

3083+
@Appender(_shared_docs['replace'] % _shared_doc_kwargs)
3084+
def replace(self, to_replace=None, value=None, inplace=False, limit=None,
3085+
regex=False, method='pad', axis=None):
3086+
return super(DataFrame, self).replace(to_replace=to_replace,
3087+
value=value, inplace=inplace,
3088+
limit=limit, regex=regex,
3089+
method=method, axis=axis)
3090+
30833091
@Appender(_shared_docs['shift'] % _shared_doc_kwargs)
30843092
def shift(self, periods=1, freq=None, axis=0):
30853093
return super(DataFrame, self).shift(periods=periods, freq=freq,

0 commit comments

Comments
 (0)