Skip to content

DOC: warning on look-ahead bias with resampling #26754

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 12, 2019
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 42 additions & 37 deletions doc/source/user_guide/timeseries.rst
Original file line number Diff line number Diff line change
Expand Up @@ -761,34 +761,6 @@ regularity will result in a ``DatetimeIndex``, although frequency is lost:

ts2[[0, 2, 6]].index

.. _timeseries.iterating-label:

Iterating through groups
------------------------

With the ``Resampler`` object in hand, iterating through the grouped data is very
natural and functions similarly to :py:func:`itertools.groupby`:

.. ipython:: python

small = pd.Series(
range(6),
index=pd.to_datetime(['2017-01-01T00:00:00',
'2017-01-01T00:30:00',
'2017-01-01T00:31:00',
'2017-01-01T01:00:00',
'2017-01-01T03:00:00',
'2017-01-01T03:05:00'])
)
resampled = small.resample('H')

for name, group in resampled:
print("Group: ", name)
print("-" * 27)
print(group, end="\n\n")

See :ref:`groupby.iterating-label` or :class:`Resampler.__iter__` for more.

.. _timeseries.components:

Time/Date Components
Expand Down Expand Up @@ -1628,24 +1600,29 @@ labels.

ts.resample('5Min', label='left', loffset='1s').mean()

.. note::
.. warning::

The default values for ``label`` and ``closed`` is 'left' for all
The default values for ``label`` and ``closed`` is '**left**' for all
frequency offsets except for 'M', 'A', 'Q', 'BM', 'BA', 'BQ', and 'W'
which all have a default of 'right'.

This might lead to unintended look-ahead bias as in the following example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "look-ahead bias" a common term?

Copy link
Contributor Author

@0x0L 0x0L Jun 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is in the context of timeseries forecasting.
EDIT: what about dropping the word bias ?

Copy link
Contributor

@TomAugspurger TomAugspurger Jun 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, dropping bias, and perhaps adding a small definition sounds good. Something like

This might lead to unintended looking ahead, where the value for a later date (Sunday) is pulled back to
a previous date (Friday).


.. ipython:: python

rng2 = pd.date_range('1/1/2012', end='3/31/2012', freq='D')
ts2 = pd.Series(range(len(rng2)), index=rng2)
s = pd.date_range('2000-01-01', '2000-01-05').to_series()
s.iloc[2] = pd.NaT
s.dt.weekday_name

# default: label='right', closed='right'
ts2.resample('M').max()
# default: label='left', closed='left'
s.resample('B').last().dt.weekday_name

# default: label='left', closed='left'
ts2.resample('SM').max()
Notice how the value for Sunday got pulled back to the previous Friday.
To prevent any look-ahead bias, use instead

ts2.resample('SM', label='right', closed='right').max()
.. ipython:: python

s.resample('B', label='right', closed='right').last().dt.weekday_name

The ``axis`` parameter can be set to 0 or 1 and allows you to resample the
specified axis for a ``DataFrame``.
Expand Down Expand Up @@ -1796,6 +1773,34 @@ level of ``MultiIndex``, its name or location can be passed to the

df.resample('M', level='d').sum()

.. _timeseries.iterating-label:

Iterating through groups
~~~~~~~~~~~~~~~~~~~~~~~~

With the ``Resampler`` object in hand, iterating through the grouped data is very
natural and functions similarly to :py:func:`itertools.groupby`:

.. ipython:: python

small = pd.Series(
range(6),
index=pd.to_datetime(['2017-01-01T00:00:00',
'2017-01-01T00:30:00',
'2017-01-01T00:31:00',
'2017-01-01T01:00:00',
'2017-01-01T03:00:00',
'2017-01-01T03:05:00'])
)
resampled = small.resample('H')

for name, group in resampled:
print("Group: ", name)
print("-" * 27)
print(group, end="\n\n")

See :ref:`groupby.iterating-label` or :class:`Resampler.__iter__` for more.


.. _timeseries.periods:

Expand Down