Skip to content

Commit 4f7cb74

Browse files
cbpygitMarcoGorellimroeschke
authored
Fix/time series interpolation is wrong 21351 (#56515)
* fix: Fixes wrong doctest output in `pandas.core.resample.Resampler.interpolate` and the related explanation about consideration of anchor points when interpolating downsampled series with non-aligned result index. * Resolved merge conflicts * fix: Fixes wrong test case assumption for interpolation Fixes assumption in `test_interp_basic_with_non_range_index`. If the index is [1, 2, 3, 5] and values are [1, 2, np.nan, 4], it is wrong to expect that interpolation will result in 3 for the missing value in case of linear interpolation. It will rather be 2.666... * fix: Make sure frequency indexes are preserved with new interpolation approach * fix: Fixes new-style up-sampling interpolation for MultiIndexes resulting from groupby-operations * fix: Fixes wrong test case assumption when using linear interpolation on series with datetime index using business days only (test case `pandas.tests.series.methods.test_interpolate.TestSeriesInterpolateData.test_interpolate`). * fix: Fixes wrong test case assumption when using linear interpolation on irregular index (test case `pandas.tests.series.methods.test_interpolate.TestSeriesInterpolateData.test_nan_irregular_index`). * fix: Adds test skips for interpolation methods that require scipy if scipy is not installed * fix: Makes sure keyword arguments "downcast" is not passed to scipy interpolation methods that are not using `interp1d` or spline. * fix: Adjusted expected warning type in `test_groupby_resample_interpolate_off_grid`. * fix: Fixes failing interpolation on groupby if the index has `name`=None. Adds this check to an existing test case. * Trigger Actions * feat: Raise error on attempt to interpolate a MultiIndex data frame, providing a useful error message that describes a working alternative syntax. Fixed related test cases and added test that makes sure the error is raised. * Apply suggestions from code review Co-authored-by: Matthew Roeschke <[email protected]> * refactor: Adjusted error type assertion in test case * refactor: Removed unused parametrization definitions and switched to direct parametrization for interpolation methods in tests. * fix: Adds forgotten "@" before pytest.mark.parametrize * refactor: Apply suggestions from code review * refactor: Switched to ficture params syntax for test case parametrization * Update pandas/tests/resample/test_time_grouper.py Co-authored-by: Matthew Roeschke <[email protected]> * Update pandas/tests/resample/test_base.py Co-authored-by: Matthew Roeschke <[email protected]> * refactor: Fixes too long line * tests: Fixes test that fails due to unimportant index name comparison * docs: Added entry in whatsnew * Empty-Commit * Empty-Commit * Empty-Commit * docs: Sorted whatsnew * docs: Adjusted bug fix note and moved it to the right section --------- Co-authored-by: Marco Edward Gorelli <[email protected]> Co-authored-by: Matthew Roeschke <[email protected]>
1 parent 661d7f0 commit 4f7cb74

File tree

7 files changed

+239
-48
lines changed

7 files changed

+239
-48
lines changed

doc/source/whatsnew/v3.0.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -438,6 +438,7 @@ Groupby/resample/rolling
438438
- Bug in :meth:`.DataFrameGroupBy.groups` and :meth:`.SeriesGroupby.groups` that would not respect groupby argument ``dropna`` (:issue:`55919`)
439439
- Bug in :meth:`.DataFrameGroupBy.median` where nat values gave an incorrect result. (:issue:`57926`)
440440
- Bug in :meth:`.DataFrameGroupBy.quantile` when ``interpolation="nearest"`` is inconsistent with :meth:`DataFrame.quantile` (:issue:`47942`)
441+
- Bug in :meth:`.Resampler.interpolate` on a :class:`DataFrame` with non-uniform sampling and/or indices not aligning with the resulting resampled index would result in wrong interpolation (:issue:`21351`)
441442
- Bug in :meth:`DataFrame.ewm` and :meth:`Series.ewm` when passed ``times`` and aggregation functions other than mean (:issue:`51695`)
442443
- Bug in :meth:`DataFrameGroupBy.apply` that was returning a completely empty DataFrame when all return values of ``func`` were ``None`` instead of returning an empty DataFrame with the original columns and dtypes. (:issue:`57775`)
443444

pandas/core/missing.py

+13-1
Original file line numberDiff line numberDiff line change
@@ -314,7 +314,16 @@ def get_interp_index(method, index: Index) -> Index:
314314
# prior default
315315
from pandas import Index
316316

317-
index = Index(np.arange(len(index)))
317+
if isinstance(index.dtype, DatetimeTZDtype) or lib.is_np_dtype(
318+
index.dtype, "mM"
319+
):
320+
# Convert datetime-like indexes to int64
321+
index = Index(index.view("i8"))
322+
323+
elif not is_numeric_dtype(index.dtype):
324+
# We keep behavior consistent with prior versions of pandas for
325+
# non-numeric, non-datetime indexes
326+
index = Index(range(len(index)))
318327
else:
319328
methods = {"index", "values", "nearest", "time"}
320329
is_numeric_or_datetime = (
@@ -616,6 +625,9 @@ def _interpolate_scipy_wrapper(
616625
terp = alt_methods.get(method, None)
617626
if terp is None:
618627
raise ValueError(f"Can not interpolate with method={method}.")
628+
629+
# Make sure downcast is not in kwargs for alt methods
630+
kwargs.pop("downcast", None)
619631
new_y = terp(x, y, new_x, **kwargs)
620632
return new_y
621633

pandas/core/resample.py

+55-13
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@
8080
TimedeltaIndex,
8181
timedelta_range,
8282
)
83+
from pandas.core.reshape.concat import concat
8384

8485
from pandas.tseries.frequencies import (
8586
is_subperiod,
@@ -885,30 +886,59 @@ def interpolate(
885886
Freq: 500ms, dtype: float64
886887
887888
Internal reindexing with ``asfreq()`` prior to interpolation leads to
888-
an interpolated timeseries on the basis the reindexed timestamps (anchors).
889-
Since not all datapoints from original series become anchors,
890-
it can lead to misleading interpolation results as in the following example:
889+
an interpolated timeseries on the basis of the reindexed timestamps
890+
(anchors). It is assured that all available datapoints from original
891+
series become anchors, so it also works for resampling-cases that lead
892+
to non-aligned timestamps, as in the following example:
891893
892894
>>> series.resample("400ms").interpolate("linear")
893895
2023-03-01 07:00:00.000 1.0
894-
2023-03-01 07:00:00.400 1.2
895-
2023-03-01 07:00:00.800 1.4
896-
2023-03-01 07:00:01.200 1.6
897-
2023-03-01 07:00:01.600 1.8
896+
2023-03-01 07:00:00.400 0.2
897+
2023-03-01 07:00:00.800 -0.6
898+
2023-03-01 07:00:01.200 -0.4
899+
2023-03-01 07:00:01.600 0.8
898900
2023-03-01 07:00:02.000 2.0
899-
2023-03-01 07:00:02.400 2.2
900-
2023-03-01 07:00:02.800 2.4
901-
2023-03-01 07:00:03.200 2.6
902-
2023-03-01 07:00:03.600 2.8
901+
2023-03-01 07:00:02.400 1.6
902+
2023-03-01 07:00:02.800 1.2
903+
2023-03-01 07:00:03.200 1.4
904+
2023-03-01 07:00:03.600 2.2
903905
2023-03-01 07:00:04.000 3.0
904906
Freq: 400ms, dtype: float64
905907
906-
Note that the series erroneously increases between two anchors
908+
Note that the series correctly decreases between two anchors
907909
``07:00:00`` and ``07:00:02``.
908910
"""
909911
assert downcast is lib.no_default # just checking coverage
910912
result = self._upsample("asfreq")
911-
return result.interpolate(
913+
914+
# If the original data has timestamps which are not aligned with the
915+
# target timestamps, we need to add those points back to the data frame
916+
# that is supposed to be interpolated. This does not work with
917+
# PeriodIndex, so we skip this case. GH#21351
918+
obj = self._selected_obj
919+
is_period_index = isinstance(obj.index, PeriodIndex)
920+
921+
# Skip this step for PeriodIndex
922+
if not is_period_index:
923+
final_index = result.index
924+
if isinstance(final_index, MultiIndex):
925+
raise NotImplementedError(
926+
"Direct interpolation of MultiIndex data frames is not "
927+
"supported. If you tried to resample and interpolate on a "
928+
"grouped data frame, please use:\n"
929+
"`df.groupby(...).apply(lambda x: x.resample(...)."
930+
"interpolate(...), include_groups=False)`"
931+
"\ninstead, as resampling and interpolation has to be "
932+
"performed for each group independently."
933+
)
934+
935+
missing_data_points_index = obj.index.difference(final_index)
936+
if len(missing_data_points_index) > 0:
937+
result = concat(
938+
[result, obj.loc[missing_data_points_index]]
939+
).sort_index()
940+
941+
result_interpolated = result.interpolate(
912942
method=method,
913943
axis=axis,
914944
limit=limit,
@@ -919,6 +949,18 @@ def interpolate(
919949
**kwargs,
920950
)
921951

952+
# No further steps if the original data has a PeriodIndex
953+
if is_period_index:
954+
return result_interpolated
955+
956+
# Make sure that original data points which do not align with the
957+
# resampled index are removed
958+
result_interpolated = result_interpolated.loc[final_index]
959+
960+
# Make sure frequency indexes are preserved
961+
result_interpolated.index = final_index
962+
return result_interpolated
963+
922964
@final
923965
def asfreq(self, fill_value=None):
924966
"""

pandas/tests/frame/methods/test_interpolate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,7 @@ def test_interp_basic_with_non_range_index(self, using_infer_string):
109109
else:
110110
result = df.set_index("C").interpolate()
111111
expected = df.set_index("C")
112-
expected.loc[3, "A"] = 3
112+
expected.loc[3, "A"] = 2.66667
113113
expected.loc[5, "B"] = 9
114114
tm.assert_frame_equal(result, expected)
115115

pandas/tests/resample/test_base.py

+73
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,29 @@
2525
from pandas.core.resample import _asfreq_compat
2626

2727

28+
@pytest.fixture(
29+
params=[
30+
"linear",
31+
"time",
32+
"index",
33+
"values",
34+
"nearest",
35+
"zero",
36+
"slinear",
37+
"quadratic",
38+
"cubic",
39+
"barycentric",
40+
"krogh",
41+
"from_derivatives",
42+
"piecewise_polynomial",
43+
"pchip",
44+
"akima",
45+
],
46+
)
47+
def all_1d_no_arg_interpolation_methods(request):
48+
return request.param
49+
50+
2851
@pytest.mark.parametrize("freq", ["2D", "1h"])
2952
@pytest.mark.parametrize(
3053
"index",
@@ -91,6 +114,56 @@ def test_resample_interpolate(index):
91114
tm.assert_frame_equal(result, expected)
92115

93116

117+
def test_resample_interpolate_regular_sampling_off_grid(
118+
all_1d_no_arg_interpolation_methods,
119+
):
120+
pytest.importorskip("scipy")
121+
# GH#21351
122+
index = date_range("2000-01-01 00:01:00", periods=5, freq="2h")
123+
ser = Series(np.arange(5.0), index)
124+
125+
method = all_1d_no_arg_interpolation_methods
126+
# Resample to 1 hour sampling and interpolate with the given method
127+
ser_resampled = ser.resample("1h").interpolate(method)
128+
129+
# Check that none of the resampled values are NaN, except the first one
130+
# which lies 1 minute before the first actual data point
131+
assert np.isnan(ser_resampled.iloc[0])
132+
assert not ser_resampled.iloc[1:].isna().any()
133+
134+
if method not in ["nearest", "zero"]:
135+
# Check that the resampled values are close to the expected values
136+
# except for methods with known inaccuracies
137+
assert np.all(
138+
np.isclose(ser_resampled.values[1:], np.arange(0.5, 4.5, 0.5), rtol=1.0e-1)
139+
)
140+
141+
142+
def test_resample_interpolate_irregular_sampling(all_1d_no_arg_interpolation_methods):
143+
pytest.importorskip("scipy")
144+
# GH#21351
145+
ser = Series(
146+
np.linspace(0.0, 1.0, 5),
147+
index=DatetimeIndex(
148+
[
149+
"2000-01-01 00:00:03",
150+
"2000-01-01 00:00:22",
151+
"2000-01-01 00:00:24",
152+
"2000-01-01 00:00:31",
153+
"2000-01-01 00:00:39",
154+
]
155+
),
156+
)
157+
158+
# Resample to 5 second sampling and interpolate with the given method
159+
ser_resampled = ser.resample("5s").interpolate(all_1d_no_arg_interpolation_methods)
160+
161+
# Check that none of the resampled values are NaN, except the first one
162+
# which lies 3 seconds before the first actual data point
163+
assert np.isnan(ser_resampled.iloc[0])
164+
assert not ser_resampled.iloc[1:].isna().any()
165+
166+
94167
def test_raises_on_non_datetimelike_index():
95168
# this is a non datetimelike index
96169
xp = DataFrame()

pandas/tests/resample/test_time_grouper.py

+89-31
Original file line numberDiff line numberDiff line change
@@ -333,26 +333,98 @@ def test_upsample_sum(method, method_args, expected_values):
333333
tm.assert_series_equal(result, expected)
334334

335335

336-
def test_groupby_resample_interpolate():
336+
@pytest.fixture
337+
def groupy_test_df():
338+
return DataFrame(
339+
{"price": [10, 11, 9], "volume": [50, 60, 50]},
340+
index=date_range("01/01/2018", periods=3, freq="W"),
341+
)
342+
343+
344+
def test_groupby_resample_interpolate_raises(groupy_test_df):
345+
# GH 35325
346+
347+
# Make a copy of the test data frame that has index.name=None
348+
groupy_test_df_without_index_name = groupy_test_df.copy()
349+
groupy_test_df_without_index_name.index.name = None
350+
351+
dfs = [groupy_test_df, groupy_test_df_without_index_name]
352+
353+
for df in dfs:
354+
msg = "DataFrameGroupBy.resample operated on the grouping columns"
355+
with tm.assert_produces_warning(DeprecationWarning, match=msg):
356+
with pytest.raises(
357+
NotImplementedError,
358+
match="Direct interpolation of MultiIndex data frames is "
359+
"not supported",
360+
):
361+
df.groupby("volume").resample("1D").interpolate(method="linear")
362+
363+
364+
def test_groupby_resample_interpolate_with_apply_syntax(groupy_test_df):
337365
# GH 35325
338-
d = {"price": [10, 11, 9], "volume": [50, 60, 50]}
339366

340-
df = DataFrame(d)
367+
# Make a copy of the test data frame that has index.name=None
368+
groupy_test_df_without_index_name = groupy_test_df.copy()
369+
groupy_test_df_without_index_name.index.name = None
341370

342-
df["week_starting"] = date_range("01/01/2018", periods=3, freq="W")
371+
dfs = [groupy_test_df, groupy_test_df_without_index_name]
343372

344-
msg = "DataFrameGroupBy.resample operated on the grouping columns"
345-
with tm.assert_produces_warning(DeprecationWarning, match=msg):
346-
result = (
347-
df.set_index("week_starting")
348-
.groupby("volume")
349-
.resample("1D")
350-
.interpolate(method="linear")
373+
for df in dfs:
374+
result = df.groupby("volume").apply(
375+
lambda x: x.resample("1d").interpolate(method="linear"),
376+
include_groups=False,
351377
)
352378

353-
volume = [50] * 15 + [60]
354-
week_starting = list(date_range("2018-01-07", "2018-01-21")) + [
355-
Timestamp("2018-01-14")
379+
volume = [50] * 15 + [60]
380+
week_starting = list(date_range("2018-01-07", "2018-01-21")) + [
381+
Timestamp("2018-01-14")
382+
]
383+
expected_ind = pd.MultiIndex.from_arrays(
384+
[volume, week_starting],
385+
names=["volume", df.index.name],
386+
)
387+
388+
expected = DataFrame(
389+
data={
390+
"price": [
391+
10.0,
392+
9.928571428571429,
393+
9.857142857142858,
394+
9.785714285714286,
395+
9.714285714285714,
396+
9.642857142857142,
397+
9.571428571428571,
398+
9.5,
399+
9.428571428571429,
400+
9.357142857142858,
401+
9.285714285714286,
402+
9.214285714285714,
403+
9.142857142857142,
404+
9.071428571428571,
405+
9.0,
406+
11.0,
407+
]
408+
},
409+
index=expected_ind,
410+
)
411+
tm.assert_frame_equal(result, expected)
412+
413+
414+
def test_groupby_resample_interpolate_with_apply_syntax_off_grid(groupy_test_df):
415+
"""Similar test as test_groupby_resample_interpolate_with_apply_syntax but
416+
with resampling that results in missing anchor points when interpolating.
417+
See GH#21351."""
418+
# GH#21351
419+
result = groupy_test_df.groupby("volume").apply(
420+
lambda x: x.resample("265h").interpolate(method="linear"), include_groups=False
421+
)
422+
423+
volume = [50, 50, 60]
424+
week_starting = [
425+
Timestamp("2018-01-07"),
426+
Timestamp("2018-01-18 01:00:00"),
427+
Timestamp("2018-01-14"),
356428
]
357429
expected_ind = pd.MultiIndex.from_arrays(
358430
[volume, week_starting],
@@ -363,24 +435,10 @@ def test_groupby_resample_interpolate():
363435
data={
364436
"price": [
365437
10.0,
366-
9.928571428571429,
367-
9.857142857142858,
368-
9.785714285714286,
369-
9.714285714285714,
370-
9.642857142857142,
371-
9.571428571428571,
372-
9.5,
373-
9.428571428571429,
374-
9.357142857142858,
375-
9.285714285714286,
376-
9.214285714285714,
377-
9.142857142857142,
378-
9.071428571428571,
379-
9.0,
438+
9.21131,
380439
11.0,
381-
],
382-
"volume": [50.0] * 15 + [60],
440+
]
383441
},
384442
index=expected_ind,
385443
)
386-
tm.assert_frame_equal(result, expected)
444+
tm.assert_frame_equal(result, expected, check_names=False)

pandas/tests/series/methods/test_interpolate.py

+7-2
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,12 @@ def test_interpolate(self, datetime_series):
9494
ts = Series(np.arange(len(datetime_series), dtype=float), datetime_series.index)
9595

9696
ts_copy = ts.copy()
97-
ts_copy[5:10] = np.nan
97+
98+
# Set data between Tuesday and Thursday to NaN for 2 consecutive weeks.
99+
# Linear interpolation should fill in the missing values correctly,
100+
# as the index is equally-spaced within each week.
101+
ts_copy[1:4] = np.nan
102+
ts_copy[6:9] = np.nan
98103

99104
linear_interp = ts_copy.interpolate(method="linear")
100105
tm.assert_series_equal(linear_interp, ts)
@@ -265,7 +270,7 @@ def test_nan_interpolate(self, kwargs):
265270
def test_nan_irregular_index(self):
266271
s = Series([1, 2, np.nan, 4], index=[1, 3, 5, 9])
267272
result = s.interpolate()
268-
expected = Series([1.0, 2.0, 3.0, 4.0], index=[1, 3, 5, 9])
273+
expected = Series([1.0, 2.0, 2.6666666666666665, 4.0], index=[1, 3, 5, 9])
269274
tm.assert_series_equal(result, expected)
270275

271276
def test_nan_str_index(self):

0 commit comments

Comments
 (0)