Skip to content

ENH/REF: More options for interpolation and fillna #4915

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 87 additions & 2 deletions doc/source/missing_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -271,8 +271,13 @@ examined :ref:`in the API <api.dataframe.missing>`.
Interpolation
~~~~~~~~~~~~~

A linear **interpolate** method has been implemented on Series. The default
interpolation assumes equally spaced points.
.. versionadded:: 0.13.0

DataFrame now has the interpolation method.
:meth:`~pandas.Series.interpolate` also gained some additional methods.

Both Series and Dataframe objects have an ``interpolate`` method that, by default,
performs linear interpolation at missing datapoints.

.. ipython:: python
:suppress:
Expand Down Expand Up @@ -328,6 +333,86 @@ For a floating-point index, use ``method='values'``:

ser.interpolate(method='values')

You can also interpolate with a DataFrame:

.. ipython:: python

df = DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
'B': [.25, np.nan, np.nan, 4, 12.2, 14.4]})
df.interpolate()

The ``method`` argument gives access to fancier interpolation methods.
If you have scipy_ installed, you can set pass the name of a 1-d interpolation routine to ``method``.
You'll want to consult the full scipy interpolation documentation_ and reference guide_ for details.
The appropriate interpolation method will depend on the type of data you are working with.
For example, if you are dealing with a time series that is growing at an increasing rate,
``method='quadratic'`` may be appropriate. If you have values approximating a cumulative
distribution function, then ``method='pchip'`` should work well.

.. warning::

These methods require ``scipy``.

.. ipython:: python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe put a 1-liner explaining the methods/uses?


df.interpolate(method='barycentric')

df.interpolate(method='pchip')

When interpolating via a polynomial or spline approximation, you must also specify
the degree or order of the approximation:

.. ipython:: python

df.interpolate(method='spline', order=2)

df.interpolate(method='polynomial', order=2)

Compare several methods:

.. ipython:: python

np.random.seed(2)

ser = Series(np.arange(1, 10.1, .25)**2 + np.random.randn(37))
bad = np.array([4, 13, 14, 15, 16, 17, 18, 20, 29, 34, 35, 36])
ser[bad] = np.nan
methods = ['linear', 'quadratic', 'cubic']

df = DataFrame({m: s.interpolate(method=m) for m in methods})
@savefig compare_interpolations.png
df.plot()

Another use case is interpolation at *new* values.
Suppose you have 100 observations from some distribution. And let's suppose
that you're particularly interested in what's happening around the middle.
You can mix pandas' ``reindex`` and ``interpolate`` methods to interpolate
at the new values.

.. ipython:: python

ser = Series(np.sort(np.random.uniform(size=100)))

# interpolate at new_index
new_index = ser.index + Index([49.25, 49.5, 49.75, 50.25, 50.5, 50.75])

interp_s = ser.reindex(new_index).interpolate(method='pchip')

interp_s[49:51]

.. _scipy: http://www.scipy.org
.. _documentation: http://docs.scipy.org/doc/scipy/reference/interpolate.html#univariate-interpolation
.. _guide: http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html


Like other pandas fill methods, ``interpolate`` accepts a ``limit`` keyword argument.
Use this to limit the number of consecutive interpolations, keeping ``NaN``s for interpolations that are too far from the last valid observation:

.. ipython:: python

ser = Series([1, 3, np.nan, np.nan, np.nan, 11])
ser.interpolate(limit=2)

.. _missing_data.replace:

Replacing Generic Values
Expand Down
2 changes: 2 additions & 0 deletions doc/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,8 @@ Improvements to existing features
- :meth:`~pandas.io.json.json_normalize` is a new method to allow you to create a flat table
from semi-structured JSON data. :ref:`See the docs<io.json_normalize>` (:issue:`1067`)
- ``DataFrame.from_records()`` will now accept generators (:issue:`4910`)
- ``DataFrame.interpolate()`` and ``Series.interpolate()`` have been expanded to include
interpolation methods from scipy. (:issue:`4915`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this to issues 4434, 1892 as that's what this is actually closing


API Changes
~~~~~~~~~~~
Expand Down
28 changes: 28 additions & 0 deletions doc/source/v0.13.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -614,6 +614,34 @@ Experimental

- Added PySide support for the qtpandas DataFrameModel and DataFrameWidget.

- DataFrame has a new ``interpolate`` method, similar to Series:

.. ipython:: python

df = DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
'B': [.25, np.nan, np.nan, 4, 12.2, 14.4]})
df.interpolate()

Additionally, the ``method`` argument to ``interpolate`` has been expanded
to include 'nearest', 'zero', 'slinear', 'quadratic', 'cubic',
'barycentric', 'krogh', 'piecewise_polynomial', 'pchip' or "polynomial" or 'spline'
and an integer representing the degree or order of the approximation. The new methods
require scipy_. Consult the Scipy reference guide_ and documentation_ for more information
about when the various methods are appropriate. See also the :ref:`pandas interpolation docs<missing_data.interpolate:>`.

Interpolate now also accepts a ``limit`` keyword argument.
This works similar to ``fillna``'s limit:

.. ipython:: python

ser = Series([1, 3, np.nan, np.nan, np.nan, 11])
ser.interpolate(limit=2)

.. _scipy: http://www.scipy.org
.. _documentation: http://docs.scipy.org/doc/scipy/reference/interpolate.html#univariate-interpolation
.. _guide: http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put closes issues 4434, 1892 here as well

.. _whatsnew_0130.refactoring:

Internal Refactoring
Expand Down
147 changes: 147 additions & 0 deletions pandas/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -1244,6 +1244,153 @@ def backfill_2d(values, limit=None, mask=None):
return values


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the these routines to core/algorithms.py (not that we have more than one). in case you need helper functions will be easier

def _clean_interp_method(method, order=None, **kwargs):
valid = ['linear', 'time', 'values', 'nearest', 'zero', 'slinear',
'quadratic', 'cubic', 'barycentric', 'polynomial',
'krogh', 'piecewise_polynomial',
'pchip', 'spline']
if method in ('spline', 'polynomial') and order is None:
raise ValueError("You must specify the order of the spline or "
"polynomial.")
if method not in valid:
raise ValueError("method must be one of {0}."
"Got '{1}' instead.".format(valid, method))
return method


def interpolate_1d(xvalues, yvalues, method='linear', limit=None,
fill_value=None, bounds_error=False, **kwargs):
"""
Logic for the 1-d interpolation. The result should be 1-d, inputs
xvalues and yvalues will each be 1-d arrays of the same length.

Bounds_error is currently hardcoded to False since non-scipy ones don't
take it as an argumnet.
"""
# Treat the original, non-scipy methods first.

invalid = isnull(yvalues)
valid = ~invalid

valid_y = yvalues[valid]
valid_x = xvalues[valid]
new_x = xvalues[invalid]

if method == 'time':
if not getattr(xvalues, 'is_all_dates', None):
# if not issubclass(xvalues.dtype.type, np.datetime64):
raise ValueError('time-weighted interpolation only works '
'on Series or DataFrames with a '
'DatetimeIndex')
method = 'values'

def _interp_limit(invalid, limit):
"""mask off values that won't be filled since they exceed the limit"""
all_nans = np.where(invalid)[0]
violate = [invalid[x:x + limit + 1] for x in all_nans]
violate = np.array([x.all() & (x.size > limit) for x in violate])
return all_nans[violate] + limit

xvalues = getattr(xvalues, 'values', xvalues)
yvalues = getattr(yvalues, 'values', yvalues)

if limit:
violate_limit = _interp_limit(invalid, limit)
if valid.any():
firstIndex = valid.argmax()
valid = valid[firstIndex:]
invalid = invalid[firstIndex:]
result = yvalues.copy()
if valid.all():
return yvalues
else:
# have to call np.array(xvalues) since xvalues could be an Index
# which cant be mutated
result = np.empty_like(np.array(xvalues), dtype=np.float64)
result.fill(np.nan)
return result

if method in ['linear', 'time', 'values']:
if method in ('values', 'index'):
inds = np.asarray(xvalues)
# hack for DatetimeIndex, #1646
if issubclass(inds.dtype.type, np.datetime64):
inds = inds.view(pa.int64)

if inds.dtype == np.object_:
inds = lib.maybe_convert_objects(inds)
else:
inds = xvalues

inds = inds[firstIndex:]

result[firstIndex:][invalid] = np.interp(inds[invalid], inds[valid],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u do this without chaining assignment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm not sure what chaining assignment is. Is it doing the operation on the RHS and assigning it to result in the same line?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by chaining i just mean that you're doing a slice then an index in one expression

forget my first comment ... this (and the one below) is ok since firstindex will be an int and a slice in numpy always returns a view

yvalues[firstIndex:][valid])

if limit:
result[violate_limit] = np.nan
return result

sp_methods = ['nearest', 'zero', 'slinear', 'quadratic', 'cubic',
'barycentric', 'krogh', 'spline', 'polynomial',
'piecewise_polynomial', 'pchip']
if method in sp_methods:
new_x = new_x[firstIndex:]
xvalues = xvalues[firstIndex:]

result[firstIndex:][invalid] = _interpolate_scipy_wrapper(valid_x,
valid_y, new_x, method=method, fill_value=fill_value,
bounds_error=bounds_error, **kwargs)
if limit:
result[violate_limit] = np.nan
return result


def _interpolate_scipy_wrapper(x, y, new_x, method, fill_value=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add order=None, **kwargs

bounds_error=False, order=None, **kwargs):
"""
passed off to scipy.interpolate.interp1d. method is scipy's kind.
Returns an array interpolated at new_x. Add any new methods to
the list in _clean_interp_method
"""
try:
from scipy import interpolate
except ImportError:
raise ImportError('{0} interpolation requires Scipy'.format(method))

new_x = np.asarray(new_x)

# ignores some kwargs that could be passed along.
alt_methods = {
'barycentric': interpolate.barycentric_interpolate,
'krogh': interpolate.krogh_interpolate,
'piecewise_polynomial': interpolate.piecewise_polynomial_interpolate,
}

try:
alt_methods['pchip'] = interpolate.pchip_interpolate
except AttributeError:
if method == 'pchip':
raise ImportError("Your version of scipy does not support "
"PCHIP interpolation.")

interp1d_methods = ['nearest', 'zero', 'slinear', 'quadratic', 'cubic',
'polynomial']
if method in interp1d_methods:
if method == 'polynomial':
method = order
terp = interpolate.interp1d(x, y, kind=method, fill_value=fill_value,
bounds_error=bounds_error)
new_y = terp(new_x)
elif method == 'spline':
terp = interpolate.UnivariateSpline(x, y, k=order)
new_y = terp(new_x)
else:
method = alt_methods[method]
new_y = method(x, y, new_x)
return new_y


def interpolate_2d(values, method='pad', axis=0, limit=None, fill_value=None):
""" perform an actual interpolation of values, values will be make 2-d if needed
fills inplace, returns the result """
Expand Down
Loading