Skip to content

DOC warn user about potential information loss in Resampler.interpolate #52198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 7, 2023
84 changes: 83 additions & 1 deletion pandas/core/resample.py
Original file line number Diff line number Diff line change
Expand Up @@ -825,7 +825,6 @@ def fillna(self, method, limit: int | None = None):
"""
return self._upsample(method, limit=limit)

@doc(NDFrame.interpolate, **_shared_docs_kwargs)
Copy link
Contributor Author

@kopytjuk kopytjuk Apr 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke here I removed the docstring concatenation (this is misleading, since the user probably has no NaNs in its data originally) and added links to documentation, so the user can inform her/himself about various methods

def interpolate(
self,
method: QuantileInterpolation = "linear",
Expand All @@ -840,7 +839,90 @@ def interpolate(
):
"""
Interpolate values according to different methods.

Returns
-------
DataFrame or Series
Interpolated values at the specified freq.

See Also
--------
core.resample.Resampler.asfreq: Return the values at the new freq,
essentially a reindex.
DataFrame.interpolate: Fill NaN values using an interpolation method.

Notes
-----
The original index is first reindexed
(see :meth:`core.resample.Resampler.asfreq`) to new time buckets,
then the interpolation`of NaNs via `DataFrame.interpolate` happens.
For non-equidistant time-series this
behaviour may lead to data loss as shown in the last example.

Examples
--------

>>> import datetime as dt
>>> timesteps = [
... dt.datetime(2023, 3, 1, 7, 0, 0),
... dt.datetime(2023, 3, 1, 7, 0, 1),
... dt.datetime(2023, 3, 1, 7, 0, 2),
... dt.datetime(2023, 3, 1, 7, 0, 3),
... dt.datetime(2023, 3, 1, 7, 0, 4)]
>>> series = pd.Series(data=[1, -1, 2, 1, 3], index=timesteps)
>>> series
2023-03-01 07:00:00 1
2023-03-01 07:00:01 -1
2023-03-01 07:00:02 2
2023-03-01 07:00:03 1
2023-03-01 07:00:04 3
dtype: int64

Upsample the dataframe to 0.5Hz

>>> series.resample("2s").interpolate("linear")
2023-03-01 07:00:00 1
2023-03-01 07:00:02 2
2023-03-01 07:00:04 3
Freq: 2S, dtype: int64

Downsample the dataframe to 2Hz

>>> series.resample("500ms").interpolate("linear")
2023-03-01 07:00:00.000 1.0
2023-03-01 07:00:00.500 0.0
2023-03-01 07:00:01.000 -1.0
2023-03-01 07:00:01.500 0.5
2023-03-01 07:00:02.000 2.0
2023-03-01 07:00:02.500 1.5
2023-03-01 07:00:03.000 1.0
2023-03-01 07:00:03.500 2.0
2023-03-01 07:00:04.000 3.0
Freq: 500L, dtype: float64

Internal reindexing with ``as_freq()`` prior to interpolation leads to
an interpolated timeseries on the basis the reindexed timestamps (anchors).
Since not all datapoints from original series become anchors,
it can lead to misleading interpolation results as in the following example:

>>> series.resample("400ms").interpolate("linear")
2023-03-01 07:00:00.000 1.0
2023-03-01 07:00:00.400 1.2
2023-03-01 07:00:00.800 1.4
2023-03-01 07:00:01.200 1.6
2023-03-01 07:00:01.600 1.8
2023-03-01 07:00:02.000 2.0
2023-03-01 07:00:02.400 2.2
2023-03-01 07:00:02.800 2.4
2023-03-01 07:00:03.200 2.6
2023-03-01 07:00:03.600 2.8
2023-03-01 07:00:04.000 3.0
Freq: 400L, dtype: float64

Note that the series erroneously increases between two anchors
``07:00:00`` and ``07:00:02``.
"""

result = self._upsample("asfreq")
return result.interpolate(
method=method,
Expand Down