Skip to content

Commit 4a267c6

Browse files
authored
ENH: add 'origin' and 'offset' arguments to 'resample' and 'pd.Grouper' (#31809)
1 parent b3f54d7 commit 4a267c6

15 files changed

+1026
-252
lines changed

doc/source/user_guide/timeseries.rst

+55-6
Original file line numberDiff line numberDiff line change
@@ -1572,19 +1572,16 @@ end of the interval is closed:
15721572
15731573
ts.resample('5Min', closed='left').mean()
15741574
1575-
Parameters like ``label`` and ``loffset`` are used to manipulate the resulting
1576-
labels. ``label`` specifies whether the result is labeled with the beginning or
1577-
the end of the interval. ``loffset`` performs a time adjustment on the output
1578-
labels.
1575+
Parameters like ``label`` are used to manipulate the resulting labels.
1576+
``label`` specifies whether the result is labeled with the beginning or
1577+
the end of the interval.
15791578

15801579
.. ipython:: python
15811580
15821581
ts.resample('5Min').mean() # by default label='left'
15831582
15841583
ts.resample('5Min', label='left').mean()
15851584
1586-
ts.resample('5Min', label='left', loffset='1s').mean()
1587-
15881585
.. warning::
15891586

15901587
The default values for ``label`` and ``closed`` is '**left**' for all
@@ -1789,6 +1786,58 @@ natural and functions similarly to :py:func:`itertools.groupby`:
17891786
17901787
See :ref:`groupby.iterating-label` or :class:`Resampler.__iter__` for more.
17911788

1789+
.. _timeseries.adjust-the-start-of-the-bins:
1790+
1791+
Use `origin` or `offset` to adjust the start of the bins
1792+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1793+
1794+
.. versionadded:: 1.1.0
1795+
1796+
The bins of the grouping are adjusted based on the beginning of the day of the time series starting point. This works well with frequencies that are multiples of a day (like `30D`) or that divide a day evenly (like `90s` or `1min`). This can create inconsistencies with some frequencies that do not meet this criteria. To change this behavior you can specify a fixed Timestamp with the argument ``origin``.
1797+
1798+
For example:
1799+
1800+
.. ipython:: python
1801+
1802+
start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'
1803+
middle = '2000-10-02 00:00:00'
1804+
rng = pd.date_range(start, end, freq='7min')
1805+
ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
1806+
ts
1807+
1808+
Here we can see that, when using ``origin`` with its default value (``'start_day'``), the result after ``'2000-10-02 00:00:00'`` are not identical depending on the start of time series:
1809+
1810+
.. ipython:: python
1811+
1812+
ts.resample('17min', origin='start_day').sum()
1813+
ts[middle:end].resample('17min', origin='start_day').sum()
1814+
1815+
1816+
Here we can see that, when setting ``origin`` to ``'epoch'``, the result after ``'2000-10-02 00:00:00'`` are identical depending on the start of time series:
1817+
1818+
.. ipython:: python
1819+
1820+
ts.resample('17min', origin='epoch').sum()
1821+
ts[middle:end].resample('17min', origin='epoch').sum()
1822+
1823+
1824+
If needed you can use a custom timestamp for ``origin``:
1825+
1826+
.. ipython:: python
1827+
1828+
ts.resample('17min', origin='2001-01-01').sum()
1829+
ts[middle:end].resample('17min', origin=pd.Timestamp('2001-01-01')).sum()
1830+
1831+
If needed you can just adjust the bins with an ``offset`` Timedelta that would be added to the default ``origin``.
1832+
Those two examples are equivalent for this time series:
1833+
1834+
.. ipython:: python
1835+
1836+
ts.resample('17min', origin='start').sum()
1837+
ts.resample('17min', offset='23h30min').sum()
1838+
1839+
1840+
Note the use of ``'start'`` for ``origin`` on the last example. In that case, ``origin`` will be set to the first value of the timeseries.
17921841

17931842
.. _timeseries.periods:
17941843

doc/source/whatsnew/v1.1.0.rst

+43
Original file line numberDiff line numberDiff line change
@@ -152,6 +152,49 @@ For example:
152152
pd.to_datetime(tz_strs, format='%Y-%m-%d %H:%M:%S %z', utc=True)
153153
pd.to_datetime(tz_strs, format='%Y-%m-%d %H:%M:%S %z')
154154
155+
.. _whatsnew_110.grouper_resample_origin:
156+
157+
Grouper and resample now supports the arguments origin and offset
158+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
159+
160+
:class:`Grouper` and :class:`DataFrame.resample` now supports the arguments ``origin`` and ``offset``. It let the user control the timestamp on which to adjust the grouping. (:issue:`31809`)
161+
162+
The bins of the grouping are adjusted based on the beginning of the day of the time series starting point. This works well with frequencies that are multiples of a day (like `30D`) or that divides a day (like `90s` or `1min`). But it can create inconsistencies with some frequencies that do not meet this criteria. To change this behavior you can now specify a fixed timestamp with the argument ``origin``.
163+
164+
Two arguments are now deprecated (more information in the documentation of :class:`DataFrame.resample`):
165+
166+
- ``base`` should be replaced by ``offset``.
167+
- ``loffset`` should be replaced by directly adding an offset to the index DataFrame after being resampled.
168+
169+
Small example of the use of ``origin``:
170+
171+
.. ipython:: python
172+
173+
start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'
174+
middle = '2000-10-02 00:00:00'
175+
rng = pd.date_range(start, end, freq='7min')
176+
ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
177+
ts
178+
179+
Resample with the default behavior ``'start_day'`` (origin is ``2000-10-01 00:00:00``):
180+
181+
.. ipython:: python
182+
183+
ts.resample('17min').sum()
184+
ts.resample('17min', origin='start_day').sum()
185+
186+
Resample using a fixed origin:
187+
188+
.. ipython:: python
189+
190+
ts.resample('17min', origin='epoch').sum()
191+
ts.resample('17min', origin='2000-01-01').sum()
192+
193+
If needed you can adjust the bins with the argument ``offset`` (a Timedelta) that would be added to the default ``origin``.
194+
195+
For a full example, see: :ref:`timeseries.adjust-the-start-of-the-bins`.
196+
197+
155198
.. _whatsnew_110.enhancements.other:
156199

157200
Other enhancements

pandas/_typing.py

+10
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
from datetime import datetime, timedelta
12
from pathlib import Path
23
from typing import (
34
IO,
@@ -43,6 +44,15 @@
4344
PandasScalar = Union["Period", "Timestamp", "Timedelta", "Interval"]
4445
Scalar = Union[PythonScalar, PandasScalar]
4546

47+
# timestamp and timedelta convertible types
48+
49+
TimestampConvertibleTypes = Union[
50+
"Timestamp", datetime, np.datetime64, int, np.int64, float, str
51+
]
52+
TimedeltaConvertibleTypes = Union[
53+
"Timedelta", timedelta, np.timedelta64, int, np.int64, float, str
54+
]
55+
4656
# other
4757

4858
Dtype = Union[

pandas/core/generic.py

+113-2
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,8 @@
3939
Label,
4040
Level,
4141
Renamer,
42+
TimedeltaConvertibleTypes,
43+
TimestampConvertibleTypes,
4244
ValueKeyFunc,
4345
)
4446
from pandas.compat import set_function_name
@@ -7758,9 +7760,11 @@ def resample(
77587760
convention: str = "start",
77597761
kind: Optional[str] = None,
77607762
loffset=None,
7761-
base: int = 0,
7763+
base: Optional[int] = None,
77627764
on=None,
77637765
level=None,
7766+
origin: Union[str, TimestampConvertibleTypes] = "start_day",
7767+
offset: Optional[TimedeltaConvertibleTypes] = None,
77647768
) -> "Resampler":
77657769
"""
77667770
Resample time-series data.
@@ -7795,17 +7799,40 @@ def resample(
77957799
By default the input representation is retained.
77967800
loffset : timedelta, default None
77977801
Adjust the resampled time labels.
7802+
7803+
.. deprecated:: 1.1.0
7804+
You should add the loffset to the `df.index` after the resample.
7805+
See below.
7806+
77987807
base : int, default 0
77997808
For frequencies that evenly subdivide 1 day, the "origin" of the
78007809
aggregated intervals. For example, for '5min' frequency, base could
78017810
range from 0 through 4. Defaults to 0.
7811+
7812+
.. deprecated:: 1.1.0
7813+
The new arguments that you should use are 'offset' or 'origin'.
7814+
78027815
on : str, optional
78037816
For a DataFrame, column to use instead of index for resampling.
78047817
Column must be datetime-like.
7805-
78067818
level : str or int, optional
78077819
For a MultiIndex, level (name or number) to use for
78087820
resampling. `level` must be datetime-like.
7821+
origin : {'epoch', 'start', 'start_day'}, Timestamp or str, default 'start_day'
7822+
The timestamp on which to adjust the grouping. The timezone of origin
7823+
must match the timezone of the index.
7824+
If a timestamp is not used, these values are also supported:
7825+
7826+
- 'epoch': `origin` is 1970-01-01
7827+
- 'start': `origin` is the first value of the timeseries
7828+
- 'start_day': `origin` is the first day at midnight of the timeseries
7829+
7830+
.. versionadded:: 1.1.0
7831+
7832+
offset : Timedelta or str, default is None
7833+
An offset timedelta added to the origin.
7834+
7835+
.. versionadded:: 1.1.0
78097836
78107837
Returns
78117838
-------
@@ -8023,6 +8050,88 @@ def resample(
80238050
2000-01-02 22 140
80248051
2000-01-03 32 150
80258052
2000-01-04 36 90
8053+
8054+
If you want to adjust the start of the bins based on a fixed timestamp:
8055+
8056+
>>> start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'
8057+
>>> rng = pd.date_range(start, end, freq='7min')
8058+
>>> ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
8059+
>>> ts
8060+
2000-10-01 23:30:00 0
8061+
2000-10-01 23:37:00 3
8062+
2000-10-01 23:44:00 6
8063+
2000-10-01 23:51:00 9
8064+
2000-10-01 23:58:00 12
8065+
2000-10-02 00:05:00 15
8066+
2000-10-02 00:12:00 18
8067+
2000-10-02 00:19:00 21
8068+
2000-10-02 00:26:00 24
8069+
Freq: 7T, dtype: int64
8070+
8071+
>>> ts.resample('17min').sum()
8072+
2000-10-01 23:14:00 0
8073+
2000-10-01 23:31:00 9
8074+
2000-10-01 23:48:00 21
8075+
2000-10-02 00:05:00 54
8076+
2000-10-02 00:22:00 24
8077+
Freq: 17T, dtype: int64
8078+
8079+
>>> ts.resample('17min', origin='epoch').sum()
8080+
2000-10-01 23:18:00 0
8081+
2000-10-01 23:35:00 18
8082+
2000-10-01 23:52:00 27
8083+
2000-10-02 00:09:00 39
8084+
2000-10-02 00:26:00 24
8085+
Freq: 17T, dtype: int64
8086+
8087+
>>> ts.resample('17min', origin='2000-01-01').sum()
8088+
2000-10-01 23:24:00 3
8089+
2000-10-01 23:41:00 15
8090+
2000-10-01 23:58:00 45
8091+
2000-10-02 00:15:00 45
8092+
Freq: 17T, dtype: int64
8093+
8094+
If you want to adjust the start of the bins with an `offset` Timedelta, the two
8095+
following lines are equivalent:
8096+
8097+
>>> ts.resample('17min', origin='start').sum()
8098+
2000-10-01 23:30:00 9
8099+
2000-10-01 23:47:00 21
8100+
2000-10-02 00:04:00 54
8101+
2000-10-02 00:21:00 24
8102+
Freq: 17T, dtype: int64
8103+
8104+
>>> ts.resample('17min', offset='23h30min').sum()
8105+
2000-10-01 23:30:00 9
8106+
2000-10-01 23:47:00 21
8107+
2000-10-02 00:04:00 54
8108+
2000-10-02 00:21:00 24
8109+
Freq: 17T, dtype: int64
8110+
8111+
To replace the use of the deprecated `base` argument, you can now use `offset`,
8112+
in this example it is equivalent to have `base=2`:
8113+
8114+
>>> ts.resample('17min', offset='2min').sum()
8115+
2000-10-01 23:16:00 0
8116+
2000-10-01 23:33:00 9
8117+
2000-10-01 23:50:00 36
8118+
2000-10-02 00:07:00 39
8119+
2000-10-02 00:24:00 24
8120+
Freq: 17T, dtype: int64
8121+
8122+
To replace the use of the deprecated `loffset` argument:
8123+
8124+
>>> from pandas.tseries.frequencies import to_offset
8125+
>>> loffset = '19min'
8126+
>>> ts_out = ts.resample('17min').sum()
8127+
>>> ts_out.index = ts_out.index + to_offset(loffset)
8128+
>>> ts_out
8129+
2000-10-01 23:33:00 0
8130+
2000-10-01 23:50:00 9
8131+
2000-10-02 00:07:00 21
8132+
2000-10-02 00:24:00 54
8133+
2000-10-02 00:41:00 24
8134+
Freq: 17T, dtype: int64
80268135
"""
80278136
from pandas.core.resample import get_resampler
80288137

@@ -8039,6 +8148,8 @@ def resample(
80398148
base=base,
80408149
key=on,
80418150
level=level,
8151+
origin=origin,
8152+
offset=offset,
80428153
)
80438154

80448155
def first(self: FrameOrSeries, offset) -> FrameOrSeries:

pandas/core/groupby/groupby.py

-9
Original file line numberDiff line numberDiff line change
@@ -1646,15 +1646,6 @@ def resample(self, rule, *args, **kwargs):
16461646
0 2000-01-01 00:00:00 0 1
16471647
2000-01-01 00:03:00 0 2
16481648
5 2000-01-01 00:03:00 5 1
1649-
1650-
Add an offset of twenty seconds.
1651-
1652-
>>> df.groupby('a').resample('3T', loffset='20s').sum()
1653-
a b
1654-
a
1655-
0 2000-01-01 00:00:20 0 2
1656-
2000-01-01 00:03:20 0 1
1657-
5 2000-01-01 00:00:20 5 1
16581649
"""
16591650
from pandas.core.resample import get_resampler_for_grouping
16601651

0 commit comments

Comments
 (0)