Skip to content

CLN/ENH: Provide full suite of arithmetic (and flex) methods to all NDFrame objects. #4560

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 58 additions & 8 deletions doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -248,12 +248,30 @@ Binary operator functions
:toctree: generated/

Series.add
Series.div
Series.mul
Series.sub
Series.mul
Series.div
Series.truediv
Series.floordiv
Series.mod
Series.pow
Series.radd
Series.rsub
Series.rmul
Series.rdiv
Series.rtruediv
Series.rfloordiv
Series.rmod
Series.rpow
Series.combine
Series.combine_first
Series.round
Series.lt
Series.gt
Series.le
Series.ge
Series.ne
Series.eq

Function application, GroupBy
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -451,13 +469,27 @@ Binary operator functions
:toctree: generated/

DataFrame.add
DataFrame.div
DataFrame.mul
DataFrame.sub
DataFrame.mul
DataFrame.div
DataFrame.truediv
DataFrame.floordiv
DataFrame.mod
DataFrame.pow
DataFrame.radd
DataFrame.rdiv
DataFrame.rmul
DataFrame.rsub
DataFrame.rmul
DataFrame.rdiv
DataFrame.rtruediv
DataFrame.rfloordiv
DataFrame.rmod
DataFrame.rpow
DataFrame.lt
DataFrame.gt
DataFrame.le
DataFrame.ge
DataFrame.ne
DataFrame.eq
DataFrame.combine
DataFrame.combineAdd
DataFrame.combine_first
Expand Down Expand Up @@ -680,9 +712,27 @@ Binary operator functions
:toctree: generated/

Panel.add
Panel.div
Panel.mul
Panel.sub
Panel.mul
Panel.div
Panel.truediv
Panel.floordiv
Panel.mod
Panel.pow
Panel.radd
Panel.rsub
Panel.rmul
Panel.rdiv
Panel.rtruediv
Panel.rfloordiv
Panel.rmod
Panel.rpow
Panel.lt
Panel.gt
Panel.le
Panel.ge
Panel.ne
Panel.eq

Function application, GroupBy
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
2 changes: 1 addition & 1 deletion doc/source/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -478,7 +478,7 @@ maximum value for each column occurred:

tsdf = DataFrame(randn(1000, 3), columns=['A', 'B', 'C'],
index=date_range('1/1/2000', periods=1000))
tsdf.apply(lambda x: x.index[x.dropna().argmax()])
tsdf.apply(lambda x: x[x.idxmax()])

You may also pass additional arguments and keyword arguments to the ``apply``
method. For instance, consider the following function you would like to apply:
Expand Down
30 changes: 17 additions & 13 deletions doc/source/dsintro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,15 @@ When using pandas, we recommend the following import convention:
Series
------

:class:`Series` is a one-dimensional labeled array (technically a subclass of
ndarray) capable of holding any data type (integers, strings, floating point
numbers, Python objects, etc.). The axis labels are collectively referred to as
the **index**. The basic method to create a Series is to call:
.. warning::

In 0.13.0 ``Series`` has internaly been refactored to no longer sub-class ``ndarray``
but instead subclass ``NDFrame``, similarly to the rest of the pandas containers. This should be
a transparent change with only very limited API implications (See the :ref:`release notes <release.refactoring_0_13_0>`)

:class:`Series` is a one-dimensional labeled array capable of holding any data
type (integers, strings, floating point numbers, Python objects, etc.). The axis
labels are collectively referred to as the **index**. The basic method to create a Series is to call:

::

Expand Down Expand Up @@ -109,9 +114,8 @@ provided. The value will be repeated to match the length of **index**
Series is ndarray-like
~~~~~~~~~~~~~~~~~~~~~~

As a subclass of ndarray, Series is a valid argument to most NumPy functions
and behaves similarly to a NumPy array. However, things like slicing also slice
the index.
``Series`` acts very similary to a ``ndarray``, and is a valid argument to most NumPy functions.
However, things like slicing also slice the index.

.. ipython :: python

Expand Down Expand Up @@ -177,7 +181,7 @@ labels.

The result of an operation between unaligned Series will have the **union** of
the indexes involved. If a label is not found in one Series or the other, the
result will be marked as missing (NaN). Being able to write code without doing
result will be marked as missing ``NaN``. Being able to write code without doing
any explicit data alignment grants immense freedom and flexibility in
interactive data analysis and research. The integrated data alignment features
of the pandas data structures set pandas apart from the majority of related
Expand Down Expand Up @@ -924,11 +928,11 @@ Here we slice to a Panel4D.
from pandas.core import panelnd
Panel5D = panelnd.create_nd_panel_factory(
klass_name = 'Panel5D',
axis_orders = [ 'cool', 'labels','items','major_axis','minor_axis'],
axis_slices = { 'labels' : 'labels', 'items' : 'items',
'major_axis' : 'major_axis', 'minor_axis' : 'minor_axis' },
slicer = Panel4D,
axis_aliases = { 'major' : 'major_axis', 'minor' : 'minor_axis' },
orders = [ 'cool', 'labels','items','major_axis','minor_axis'],
slices = { 'labels' : 'labels', 'items' : 'items',
'major_axis' : 'major_axis', 'minor_axis' : 'minor_axis' },
slicer = Panel4D,
aliases = { 'major' : 'major_axis', 'minor' : 'minor_axis' },
stat_axis = 2)

p5d = Panel5D(dict(C1 = p4d))
Expand Down
34 changes: 26 additions & 8 deletions doc/source/enhancingperf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Enhancing Performance
Cython (Writing C extensions for pandas)
----------------------------------------

For many use cases writing pandas in pure python and numpy is sufficient. In some
For many use cases writing pandas in pure python and numpy is sufficient. In some
computationally heavy applications however, it can be possible to achieve sizeable
speed-ups by offloading work to `cython <http://cython.org/>`__.

Expand Down Expand Up @@ -68,7 +68,7 @@ Here's the function in pure python:
We achieve our result by by using ``apply`` (row-wise):

.. ipython:: python

%timeit df.apply(lambda x: integrate_f(x['a'], x['b'], x['N']), axis=1)

But clearly this isn't fast enough for us. Let's take a look and see where the
Expand All @@ -83,7 +83,7 @@ By far the majority of time is spend inside either ``integrate_f`` or ``f``,
hence we'll concentrate our efforts cythonizing these two functions.

.. note::

In python 2 replacing the ``range`` with its generator counterpart (``xrange``)
would mean the ``range`` line would vanish. In python 3 range is already a generator.

Expand Down Expand Up @@ -125,7 +125,7 @@ is here to distinguish between function versions):

%timeit df.apply(lambda x: integrate_f_plain(x['a'], x['b'], x['N']), axis=1)

Already this has shaved a third off, not too bad for a simple copy and paste.
Already this has shaved a third off, not too bad for a simple copy and paste.

.. _enhancingperf.type:

Expand Down Expand Up @@ -175,7 +175,7 @@ in python, so maybe we could minimise these by cythonizing the apply part.
We are now passing ndarrays into the cython function, fortunately cython plays
very nicely with numpy.

.. ipython::
.. ipython::

In [4]: %%cython
...: cimport numpy as np
Expand Down Expand Up @@ -205,20 +205,38 @@ The implementation is simple, it creates an array of zeros and loops over
the rows, applying our ``integrate_f_typed``, and putting this in the zeros array.


.. warning::

In 0.13.0 since ``Series`` has internaly been refactored to no longer sub-class ``ndarray``
but instead subclass ``NDFrame``, you can **not pass** a ``Series`` directly as a ``ndarray`` typed parameter
to a cython function. Instead pass the actual ``ndarray`` using the ``.values`` attribute of the Series.

Prior to 0.13.0

.. code-block:: python

apply_integrate_f(df['a'], df['b'], df['N'])

Use ``.values`` to get the underlying ``ndarray``

.. code-block:: python

apply_integrate_f(df['a'].values, df['b'].values, df['N'].values)

.. note::

Loop like this would be *extremely* slow in python, but in cython looping over
numpy arrays is *fast*.

.. ipython:: python

%timeit apply_integrate_f(df['a'], df['b'], df['N'])
%timeit apply_integrate_f(df['a'].values, df['b'].values, df['N'].values)

We've gone another three times faster! Let's check again where the time is spent:

.. ipython:: python

%prun -l 4 apply_integrate_f(df['a'], df['b'], df['N'])
%prun -l 4 apply_integrate_f(df['a'].values, df['b'].values, df['N'].values)

As one might expect, the majority of the time is now spent in ``apply_integrate_f``,
so if we wanted to make anymore efficiencies we must continue to concentrate our
Expand Down Expand Up @@ -261,7 +279,7 @@ advanced cython techniques:

.. ipython:: python

%timeit apply_integrate_f_wrap(df['a'], df['b'], df['N'])
%timeit apply_integrate_f_wrap(df['a'].values, df['b'].values, df['N'].values)

This shaves another third off!

Expand Down
71 changes: 71 additions & 0 deletions doc/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,77 @@ pandas 0.13
- ``MultiIndex.astype()`` now only allows ``np.object_``-like dtypes and
now returns a ``MultiIndex`` rather than an ``Index``. (:issue:`4039`)

**Internal Refactoring**

.. _release.refactoring_0_13_0:

In 0.13.0 there is a major refactor primarily to subclass ``Series`` from ``NDFrame``,
which is the base class currently for ``DataFrame`` and ``Panel``, to unify methods
and behaviors. Series formerly subclassed directly from ``ndarray``. (:issue:`4080`,:issue:`3862`,:issue:`816`)

- Refactor of series.py/frame.py/panel.py to move common code to generic.py
- added _setup_axes to created generic NDFrame structures
- moved methods

- from_axes,_wrap_array,axes,ix,loc,iloc,shape,empty,swapaxes,transpose,pop
- __iter__,keys,__contains__,__len__,__neg__,__invert__
- convert_objects,as_blocks,as_matrix,values
- __getstate__,__setstate__ (though compat remains in frame/panel)
- __getattr__,__setattr__
- _indexed_same,reindex_like,reindex,align,where,mask
- fillna,replace
- filter (also added axis argument to selectively filter on a different axis)
- reindex,reindex_axis (which was the biggest change to make generic)
- truncate (moved to become part of ``NDFrame``)

- These are API changes which make ``Panel`` more consistent with ``DataFrame``
- swapaxes on a Panel with the same axes specified now return a copy
- support attribute access for setting
- filter supports same api as original DataFrame filter

- Reindex called with no arguments will now return a copy of the input object

- Series now inherits from ``NDFrame`` rather than directly from ``ndarray``.
There are several minor changes that affect the API.

- numpy functions that do not support the array interface will now
return ``ndarrays`` rather than series, e.g. ``np.diff`` and ``np.where``
- ``Series(0.5)`` would previously return the scalar ``0.5``, this is no
longer supported
- several methods from frame/series have moved to ``NDFrame``
(convert_objects,where,mask)
- ``TimeSeries`` is now an alias for ``Series``. the property ``is_time_series``
can be used to distinguish (if desired)

- Refactor of Sparse objects to use BlockManager

- Created a new block type in internals, ``SparseBlock``, which can hold multi-dtypes
and is non-consolidatable. ``SparseSeries`` and ``SparseDataFrame`` now inherit
more methods from there hierarchy (Series/DataFrame), and no longer inherit
from ``SparseArray`` (which instead is the object of the ``SparseBlock``)
- Sparse suite now supports integration with non-sparse data. Non-float sparse
data is supportable (partially implemented)
- Operations on sparse structures within DataFrames should preserve sparseness,
merging type operations will convert to dense (and back to sparse), so might
be somewhat inefficient
- enable setitem on ``SparseSeries`` for boolean/integer/slices
- ``SparsePanels`` implementation is unchanged (e.g. not using BlockManager, needs work)

- added ``ftypes`` method to Series/DataFame, similar to ``dtypes``, but indicates
if the underlying is sparse/dense (as well as the dtype)

- All ``NDFrame`` objects now have a ``_prop_attributes``, which can be used to indcated various
values to propogate to a new object from an existing (e.g. name in ``Series`` will follow
more automatically now)

- Internal type checking is now done via a suite of generated classes, allowing ``isinstance(value, klass)``
without having to directly import the klass, courtesy of @jtratner

- Bug in Series update where the parent frame is not updating its cache based on
changes (:issue:`4080`) or types (:issue:`3217`), fillna (:issue:`3386`)

- Indexing with dtype conversions fixed (:issue:`4463`, :issue:`4204`)

**Experimental Features**

**Bug Fixes**
Expand Down
Loading