Skip to content

Commit 35812ea

Browse files
WBarejreback
authored andcommitted
ENH limit_area added to interpolate1d
closes #16284
1 parent 09307dd commit 35812ea

File tree

7 files changed

+208
-73
lines changed

7 files changed

+208
-73
lines changed

doc/source/missing_data.rst

+39-14
Original file line numberDiff line numberDiff line change
@@ -190,7 +190,7 @@ Sum/Prod of Empties/Nans
190190
.. warning::
191191

192192
This behavior is now standard as of v0.21.0; previously sum/prod would give different
193-
results if the ``bottleneck`` package was installed.
193+
results if the ``bottleneck`` package was installed.
194194
See the :ref:`v0.21.0 whatsnew <whatsnew_0210.api_breaking.bottleneck>`.
195195

196196
With ``sum`` or ``prod`` on an empty or all-``NaN`` ``Series``, or columns of a ``DataFrame``, the result will be all-``NaN``.
@@ -353,7 +353,11 @@ examined :ref:`in the API <api.dataframe.missing>`.
353353
Interpolation
354354
~~~~~~~~~~~~~
355355

356-
Both Series and DataFrame objects have an :meth:`~DataFrame.interpolate` method
356+
.. versionadded:: 0.21.0
357+
358+
The ``limit_area`` keyword argument was added.
359+
360+
Both Series and DataFrame objects have an :meth:`~DataFrame.interpolate` method
357361
that, by default, performs linear interpolation at missing datapoints.
358362

359363
.. ipython:: python
@@ -477,33 +481,54 @@ at the new values.
477481
.. _documentation: http://docs.scipy.org/doc/scipy/reference/interpolate.html#univariate-interpolation
478482
.. _guide: http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html
479483

484+
.. _missing_data.interp_limits:
485+
480486
Interpolation Limits
481487
^^^^^^^^^^^^^^^^^^^^
482488

483489
Like other pandas fill methods, ``interpolate`` accepts a ``limit`` keyword
484-
argument. Use this argument to limit the number of consecutive interpolations,
485-
keeping ``NaN`` values for interpolations that are too far from the last valid
486-
observation:
490+
argument. Use this argument to limit the number of consecutive ``NaN`` values
491+
filled since the last valid observation:
487492

488493
.. ipython:: python
489494
490-
ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13])
491-
ser.interpolate(limit=2)
495+
ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13, np.nan, np.nan])
492496
493-
By default, ``limit`` applies in a forward direction, so that only ``NaN``
494-
values after a non-``NaN`` value can be filled. If you provide ``'backward'`` or
495-
``'both'`` for the ``limit_direction`` keyword argument, you can fill ``NaN``
496-
values before non-``NaN`` values, or both before and after non-``NaN`` values,
497-
respectively:
497+
# fill all consecutive values in a forward direction
498+
ser.interpolate()
498499
499-
.. ipython:: python
500+
# fill one consecutive value in a forward direction
501+
ser.interpolate(limit=1)
502+
503+
By default, ``NaN`` values are filled in a ``forward`` direction. Use
504+
``limit_direction`` parameter to fill ``backward`` or from ``both`` directions.
500505

501-
ser.interpolate(limit=1) # limit_direction == 'forward'
506+
.. ipython:: python
502507
508+
# fill one consecutive value backwards
503509
ser.interpolate(limit=1, limit_direction='backward')
504510
511+
# fill one consecutive value in both directions
505512
ser.interpolate(limit=1, limit_direction='both')
506513
514+
# fill all consecutive values in both directions
515+
ser.interpolate(limit_direction='both')
516+
517+
By default, ``NaN`` values are filled whether they are inside (surrounded by)
518+
existing valid values, or outside existing valid values. Introduced in v0.23
519+
the ``limit_area`` parameter restricts filling to either inside or outside values.
520+
521+
.. ipython:: python
522+
523+
# fill one consecutive inside value in both directions
524+
ser.interpolate(limit_direction='both', limit_area='inside', limit=1)
525+
526+
# fill all consecutive outside values backward
527+
ser.interpolate(limit_direction='backward', limit_area='outside')
528+
529+
# fill all consecutive outside values in both directions
530+
ser.interpolate(limit_direction='both', limit_area='outside')
531+
507532
.. _missing_data.replace:
508533

509534
Replacing Generic Values

doc/source/whatsnew/v0.23.0.txt

+32-3
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,38 @@ version.
1313
New features
1414
~~~~~~~~~~~~
1515

16-
-
17-
-
18-
-
16+
.. _whatsnew_0210.enhancements.limit_area:
17+
18+
``DataFrame.interpolate`` has gained the ``limit_area`` kwarg
19+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
20+
21+
:meth:`DataFrame.interpolate` has gained a ``limit_area`` parameter to allow further control of which ``NaN`` s are replaced.
22+
Use `limit_area='inside'` to fill only NaNs surrounded by valid values or use `limit_area='outside'` to fill only ``NaN`` s
23+
outside the existing valid values while preserving those inside. (:issue:`16284`) See the :ref:`full documentation here <missing_data.interp_limits>`.
24+
1925

26+
.. ipython:: python
27+
28+
ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13, np.nan, np.nan])
29+
ser
30+
31+
Fill one consecutive inside value in both directions
32+
33+
.. ipython:: python
34+
35+
ser.interpolate(limit_direction='both', limit_area='inside', limit=1)
36+
37+
Fill all consecutive outside values backward
38+
39+
.. ipython:: python
40+
41+
ser.interpolate(limit_direction='backward', limit_area='outside')
42+
43+
Fill all consecutive outside values in both directions
44+
45+
.. ipython:: python
46+
47+
ser.interpolate(limit_direction='both', limit_area='outside')
2048

2149
.. _whatsnew_0210.enhancements.get_dummies_dtype:
2250

@@ -207,6 +235,7 @@ Other Enhancements
207235
:func:`pandas.api.extensions.register_index_accessor`, accessor for libraries downstream of pandas
208236
to register custom accessors like ``.cat`` on pandas objects. See
209237
:ref:`Registering Custom Accessors <developer.register-accessors>` for more (:issue:`14781`).
238+
210239
- ``IntervalIndex.astype`` now supports conversions between subtypes when passed an ``IntervalDtype`` (:issue:`19197`)
211240
- :class:`IntervalIndex` and its associated constructor methods (``from_arrays``, ``from_breaks``, ``from_tuples``) have gained a ``dtype`` parameter (:issue:`19262`)
212241

pandas/core/generic.py

+9-1
Original file line numberDiff line numberDiff line change
@@ -5085,6 +5085,12 @@ def replace(self, to_replace=None, value=None, inplace=False, limit=None,
50855085
limit : int, default None.
50865086
Maximum number of consecutive NaNs to fill. Must be greater than 0.
50875087
limit_direction : {'forward', 'backward', 'both'}, default 'forward'
5088+
limit_area : {'inside', 'outside'}, default None
5089+
* None: (default) no fill restriction
5090+
* 'inside' Only fill NaNs surrounded by valid values (interpolate).
5091+
* 'outside' Only fill NaNs outside valid values (extrapolate).
5092+
.. versionadded:: 0.21.0
5093+
50885094
If limit is specified, consecutive NaNs will be filled in this
50895095
direction.
50905096
inplace : bool, default False
@@ -5118,7 +5124,8 @@ def replace(self, to_replace=None, value=None, inplace=False, limit=None,
51185124

51195125
@Appender(_shared_docs['interpolate'] % _shared_doc_kwargs)
51205126
def interpolate(self, method='linear', axis=0, limit=None, inplace=False,
5121-
limit_direction='forward', downcast=None, **kwargs):
5127+
limit_direction='forward', limit_area=None,
5128+
downcast=None, **kwargs):
51225129
"""
51235130
Interpolate values according to different methods.
51245131
"""
@@ -5167,6 +5174,7 @@ def interpolate(self, method='linear', axis=0, limit=None, inplace=False,
51675174
new_data = data.interpolate(method=method, axis=ax, index=index,
51685175
values=_maybe_transposed_self, limit=limit,
51695176
limit_direction=limit_direction,
5177+
limit_area=limit_area,
51705178
inplace=inplace, downcast=downcast,
51715179
**kwargs)
51725180

pandas/core/internals.py

+6-4
Original file line numberDiff line numberDiff line change
@@ -1073,8 +1073,8 @@ def coerce_to_target_dtype(self, other):
10731073

10741074
def interpolate(self, method='pad', axis=0, index=None, values=None,
10751075
inplace=False, limit=None, limit_direction='forward',
1076-
fill_value=None, coerce=False, downcast=None, mgr=None,
1077-
**kwargs):
1076+
limit_area=None, fill_value=None, coerce=False,
1077+
downcast=None, mgr=None, **kwargs):
10781078

10791079
inplace = validate_bool_kwarg(inplace, 'inplace')
10801080

@@ -1115,6 +1115,7 @@ def check_int_bool(self, inplace):
11151115
return self._interpolate(method=m, index=index, values=values,
11161116
axis=axis, limit=limit,
11171117
limit_direction=limit_direction,
1118+
limit_area=limit_area,
11181119
fill_value=fill_value, inplace=inplace,
11191120
downcast=downcast, mgr=mgr, **kwargs)
11201121

@@ -1148,8 +1149,8 @@ def _interpolate_with_fill(self, method='pad', axis=0, inplace=False,
11481149

11491150
def _interpolate(self, method=None, index=None, values=None,
11501151
fill_value=None, axis=0, limit=None,
1151-
limit_direction='forward', inplace=False, downcast=None,
1152-
mgr=None, **kwargs):
1152+
limit_direction='forward', limit_area=None,
1153+
inplace=False, downcast=None, mgr=None, **kwargs):
11531154
""" interpolate using scipy wrappers """
11541155

11551156
inplace = validate_bool_kwarg(inplace, 'inplace')
@@ -1177,6 +1178,7 @@ def func(x):
11771178
# i.e. not an arg to missing.interpolate_1d
11781179
return missing.interpolate_1d(index, x, method=method, limit=limit,
11791180
limit_direction=limit_direction,
1181+
limit_area=limit_area,
11801182
fill_value=fill_value,
11811183
bounds_error=False, **kwargs)
11821184

pandas/core/missing.py

+80-50
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,7 @@ def clean_interp_method(method, **kwargs):
111111

112112

113113
def interpolate_1d(xvalues, yvalues, method='linear', limit=None,
114-
limit_direction='forward', fill_value=None,
114+
limit_direction='forward', limit_area=None, fill_value=None,
115115
bounds_error=False, order=None, **kwargs):
116116
"""
117117
Logic for the 1-d interpolation. The result should be 1-d, inputs
@@ -151,28 +151,12 @@ def interpolate_1d(xvalues, yvalues, method='linear', limit=None,
151151
raise ValueError(msg.format(valid=valid_limit_directions,
152152
invalid=limit_direction))
153153

154-
from pandas import Series
155-
ys = Series(yvalues)
156-
start_nans = set(range(ys.first_valid_index()))
157-
end_nans = set(range(1 + ys.last_valid_index(), len(valid)))
158-
159-
# violate_limit is a list of the indexes in the series whose yvalue is
160-
# currently NaN, and should still be NaN after the interpolation.
161-
# Specifically:
162-
#
163-
# If limit_direction='forward' or None then the list will contain NaNs at
164-
# the beginning of the series, and NaNs that are more than 'limit' away
165-
# from the prior non-NaN.
166-
#
167-
# If limit_direction='backward' then the list will contain NaNs at
168-
# the end of the series, and NaNs that are more than 'limit' away
169-
# from the subsequent non-NaN.
170-
#
171-
# If limit_direction='both' then the list will contain NaNs that
172-
# are more than 'limit' away from any non-NaN.
173-
#
174-
# If limit=None, then use default behavior of filling an unlimited number
175-
# of NaNs in the direction specified by limit_direction
154+
if limit_area is not None:
155+
valid_limit_areas = ['inside', 'outside']
156+
limit_area = limit_area.lower()
157+
if limit_area not in valid_limit_areas:
158+
raise ValueError('Invalid limit_area: expecting one of {}, got '
159+
'{}.'.format(valid_limit_areas, limit_area))
176160

177161
# default limit is unlimited GH #16282
178162
if limit is None:
@@ -183,22 +167,43 @@ def interpolate_1d(xvalues, yvalues, method='linear', limit=None,
183167
elif limit < 1:
184168
raise ValueError('Limit must be greater than 0')
185169

186-
# each possible limit_direction
187-
# TODO: do we need sorted?
188-
if limit_direction == 'forward' and limit is not None:
189-
violate_limit = sorted(start_nans |
190-
set(_interp_limit(invalid, limit, 0)))
191-
elif limit_direction == 'forward':
192-
violate_limit = sorted(start_nans)
193-
elif limit_direction == 'backward' and limit is not None:
194-
violate_limit = sorted(end_nans |
195-
set(_interp_limit(invalid, 0, limit)))
170+
from pandas import Series
171+
ys = Series(yvalues)
172+
173+
# These are sets of index pointers to invalid values... i.e. {0, 1, etc...
174+
all_nans = set(np.flatnonzero(invalid))
175+
start_nans = set(range(ys.first_valid_index()))
176+
end_nans = set(range(1 + ys.last_valid_index(), len(valid)))
177+
mid_nans = all_nans - start_nans - end_nans
178+
179+
# Like the sets above, preserve_nans contains indices of invalid values,
180+
# but in this case, it is the final set of indices that need to be
181+
# preserved as NaN after the interpolation.
182+
183+
# For example if limit_direction='forward' then preserve_nans will
184+
# contain indices of NaNs at the beginning of the series, and NaNs that
185+
# are more than'limit' away from the prior non-NaN.
186+
187+
# set preserve_nans based on direction using _interp_limit
188+
if limit_direction == 'forward':
189+
preserve_nans = start_nans | set(_interp_limit(invalid, limit, 0))
196190
elif limit_direction == 'backward':
197-
violate_limit = sorted(end_nans)
198-
elif limit_direction == 'both' and limit is not None:
199-
violate_limit = sorted(_interp_limit(invalid, limit, limit))
191+
preserve_nans = end_nans | set(_interp_limit(invalid, 0, limit))
200192
else:
201-
violate_limit = []
193+
# both directions... just use _interp_limit
194+
preserve_nans = set(_interp_limit(invalid, limit, limit))
195+
196+
# if limit_area is set, add either mid or outside indices
197+
# to preserve_nans GH #16284
198+
if limit_area == 'inside':
199+
# preserve NaNs on the outside
200+
preserve_nans |= start_nans | end_nans
201+
elif limit_area == 'outside':
202+
# preserve NaNs on the inside
203+
preserve_nans |= mid_nans
204+
205+
# sort preserve_nans and covert to list
206+
preserve_nans = sorted(preserve_nans)
202207

203208
xvalues = getattr(xvalues, 'values', xvalues)
204209
yvalues = getattr(yvalues, 'values', yvalues)
@@ -215,7 +220,7 @@ def interpolate_1d(xvalues, yvalues, method='linear', limit=None,
215220
else:
216221
inds = xvalues
217222
result[invalid] = np.interp(inds[invalid], inds[valid], yvalues[valid])
218-
result[violate_limit] = np.nan
223+
result[preserve_nans] = np.nan
219224
return result
220225

221226
sp_methods = ['nearest', 'zero', 'slinear', 'quadratic', 'cubic',
@@ -234,7 +239,7 @@ def interpolate_1d(xvalues, yvalues, method='linear', limit=None,
234239
fill_value=fill_value,
235240
bounds_error=bounds_error,
236241
order=order, **kwargs)
237-
result[violate_limit] = np.nan
242+
result[preserve_nans] = np.nan
238243
return result
239244

240245

@@ -646,8 +651,24 @@ def fill_zeros(result, x, y, name, fill):
646651

647652

648653
def _interp_limit(invalid, fw_limit, bw_limit):
649-
"""Get idx of values that won't be filled b/c they exceed the limits.
654+
"""
655+
Get indexers of values that won't be filled
656+
because they exceed the limits.
657+
658+
Parameters
659+
----------
660+
invalid : boolean ndarray
661+
fw_limit : int or None
662+
forward limit to index
663+
bw_limit : int or None
664+
backward limit to index
665+
666+
Returns
667+
-------
668+
set of indexers
650669
670+
Notes
671+
-----
651672
This is equivalent to the more readable, but slower
652673
653674
.. code-block:: python
@@ -660,6 +681,8 @@ def _interp_limit(invalid, fw_limit, bw_limit):
660681
# 1. operate on the reversed array
661682
# 2. subtract the returned indicies from N - 1
662683
N = len(invalid)
684+
f_idx = set()
685+
b_idx = set()
663686

664687
def inner(invalid, limit):
665688
limit = min(limit, N)
@@ -668,18 +691,25 @@ def inner(invalid, limit):
668691
set(np.where((~invalid[:limit + 1]).cumsum() == 0)[0]))
669692
return idx
670693

671-
if fw_limit == 0:
672-
f_idx = set(np.where(invalid)[0])
673-
else:
674-
f_idx = inner(invalid, fw_limit)
694+
if fw_limit is not None:
675695

676-
if bw_limit == 0:
677-
# then we don't even need to care about backwards, just use forwards
678-
return f_idx
679-
else:
680-
b_idx = set(N - 1 - np.asarray(list(inner(invalid[::-1], bw_limit))))
681696
if fw_limit == 0:
682-
return b_idx
697+
f_idx = set(np.where(invalid)[0])
698+
else:
699+
f_idx = inner(invalid, fw_limit)
700+
701+
if bw_limit is not None:
702+
703+
if bw_limit == 0:
704+
# then we don't even need to care about backwards
705+
# just use forwards
706+
return f_idx
707+
else:
708+
b_idx = list(inner(invalid[::-1], bw_limit))
709+
b_idx = set(N - 1 - np.asarray(b_idx))
710+
if fw_limit == 0:
711+
return b_idx
712+
683713
return f_idx & b_idx
684714

685715

0 commit comments

Comments
 (0)