Skip to content

PeriodIndex count & size fix #12874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions doc/source/whatsnew/v0.18.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,9 @@ Bug Fixes
- ``usecols`` parameter in ``pd.read_csv`` is now respected even when the lines of a CSV file are not even (:issue:`12203`)
- Bug in ``groupby.transform(..)`` when ``axis=1`` is specified with a non-monotonic ordered index (:issue:`12713`)
- Bug in ``Period`` and ``PeriodIndex`` creation raises ``KeyError`` if ``freq="Minute"`` is specified. Note that "Minute" freq is deprecated in v0.17.0, and recommended to use ``freq="T"`` instead (:issue:`11854`)
- Bug in ``PeriodIndex.resample(...).count()`` always raised a ``TypeError`` (:issue:`12774`)
- Bug in ``PeriodIndex.resample`` casting to ``DatetimeIndex`` when empty (:issue:`12868`)
- Bug in ``PeriodInedx.resample`` when resampling to existing frequency (:issue:`12770`)
- Bug in printing data which contains ``Period`` with different ``freq`` raises ``ValueError`` (:issue:`12615`)
- Bug in numpy compatibility of ``np.round()`` on a ``Series`` (:issue:`12600`)
- Bug in ``Series`` construction with ``Categorical`` and ``dtype='category'`` is specified (:issue:`12574`)
Expand Down
41 changes: 19 additions & 22 deletions pandas/tseries/resample.py
Original file line number Diff line number Diff line change
Expand Up @@ -653,9 +653,6 @@ def _convert_obj(self, obj):
# Cannot have multiple of periods, convert to timestamp
self.kind = 'timestamp'

if not len(obj):
self.kind = 'timestamp'

# convert to timestamp
if not (self.kind is None or self.kind == 'period'):
obj = obj.to_timestamp(how=self.convention)
Expand All @@ -673,18 +670,15 @@ def aggregate(self, arg, *args, **kwargs):
def _get_new_index(self):
""" return our new index """
ax = self.ax
ax_attrs = ax._get_attributes_dict()
ax_attrs['freq'] = self.freq
obj = self._selected_obj

if len(ax) == 0:
new_index = PeriodIndex(data=[], **ax_attrs)
return obj.reindex(new_index)

start = ax[0].asfreq(self.freq, how=self.convention)
end = ax[-1].asfreq(self.freq, how='end')
values = []
else:
start = ax[0].asfreq(self.freq, how=self.convention)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this doesn't break the test, I suspect if you did how=self.convention it would work no?
or maybe just ignore convention entirely and make:

start = ax[0].asfreq(self.freq, how='start')
end = ax[-1].asfreq(self.freq, how='end')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the suggestion here, but it breaks that test above. Or am I misunderstanding?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it I think your change makes sense. make a new sub-section on resampling with PeriodIndex (and put your bug fixes there, with a small example which shows this change).

As an aside, I think convention here is really refering to something completely different. IOW it has to do whether to take the start or end of the original series as the counting point for the freq, NOT the period resampling which is independent of that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you strongly object I'm going to add an issue on this & follow up; I'm already a bit overextended on this and don't want to leave the stuff completed so far hanging around.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no that's fine

end = ax[-1].asfreq(self.freq, how='end')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, should this also be self.convention?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we treat PeriodIndexs as an Interval-like Index, I think this should always be how='start' and how='end' for start & end respectively. Being able to specify these seems like misapplying the logic from DatetimeIndex.

So if you're upsampling, the whole super-period at the start would be included, and if you're downsampling, every sub-period would be included.

Making that change makes only this test fail. I can't see the logic behind that test's expected - to include only final point of the first year?

If people agree I'd be up for doing that, but probably counts as an API breakage

Series length are different
[left]:  2191, PeriodIndex(['1990-01-01', '1990-01-02', '1990-01-03', '1990-01-04',
             '1990-01-05', '1990-01-06', '1990-01-07', '1990-01-08',
             '1990-01-09', '1990-01-10',
             ...
             '1995-12-22', '1995-12-23', '1995-12-24', '1995-12-25',
             '1995-12-26', '1995-12-27', '1995-12-28', '1995-12-29',
             '1995-12-30', '1995-12-31'],
            dtype='int64', length=2191, freq='D')
[right]: 1827, PeriodIndex(['1990-12-31', '1991-01-01', '1991-01-02', '1991-01-03',
             '1991-01-04', '1991-01-05', '1991-01-06', '1991-01-07',
             '1991-01-08', '1991-01-09',
             ...
             '1995-12-22', '1995-12-23', '1995-12-24', '1995-12-25',
             '1995-12-26', '1995-12-27', '1995-12-28', '1995-12-29',
             '1995-12-30', '1995-12-31'],
            dtype='int64', length=1827, freq='D')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is that test in master. What exactly is the problem?

In [10]: p = pd.period_range('1/1/1990', '6/30/1995', freq='M')

In [11]: ts = Series(range(len(p)), index=p)

In [12]: ts
Out[12]: 
1990-01     0
1990-02     1
1990-03     2
1990-04     3
1990-05     4
           ..
1995-02    61
1995-03    62
1995-04    63
1995-05    64
1995-06    65
Freq: M, dtype: int64

In [13]: ts.resample('a-dec').mean()
Out[13]: 
1990     5.5
1991    17.5
1992    29.5
1993    41.5
1994    53.5
1995    62.5
Freq: A-DEC, dtype: float64

In [14]: result = ts.resample('a-dec').mean()

In [15]: result.resample('D', convention='end').ffill()
Out[15]: 
1990-12-31     5.5
1991-01-01     5.5
1991-01-02     5.5
1991-01-03     5.5
1991-01-04     5.5
              ... 
1995-12-27    53.5
1995-12-28    53.5
1995-12-29    53.5
1995-12-30    53.5
1995-12-31    62.5
Freq: D, dtype: float64

In [15]: result.resample('D', convention='end').ffill()
Out[15]: 
1990-12-31     5.5
1991-01-01     5.5
1991-01-02     5.5
1991-01-03     5.5
1991-01-04     5.5
              ... 
1995-12-27    53.5
1995-12-28    53.5
1995-12-29    53.5
1995-12-30    53.5
1995-12-31    62.5
Freq: D, dtype: float64

In [16]: result.resample('D', convention='start').ffill()
Out[16]: 
1990-01-01     5.5
1990-01-02     5.5
1990-01-03     5.5
1990-01-04     5.5
1990-01-05     5.5
              ... 
1995-12-27    62.5
1995-12-28    62.5
1995-12-29    62.5
1995-12-30    62.5
1995-12-31    62.5
Freq: D, dtype: float64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is the problem?

That it doesn't make conceptual sense. If we think about a Period as span of time with a start and an end, then what is this example doing, in concept? It's all the subperiods of the second period and only the final subperiod of the first period, but the value from the first period bleeding through to the second?
(it makes sense with point-in-time indexes, so I can see where it came from, it's just misapplied here)

There are two operations that make sense with these interval-like items:

  1. Upsample: Get all the subperiods of each period and put some transformation of the values in them
  2. Downsample: Bin the values into a set of superperiods and aggregate them in some way
In [67]: pd.Series(range(2), pd.period_range('1/1/1990', periods=2, freq='M')).resample('D',convention='end').pad()
Out[67]: 
1990-01-31    0
1990-02-01    0
1990-02-02    0
1990-02-03    0
1990-02-04    0
1990-02-05    0
1990-02-06    0
1990-02-07    0
1990-02-08    0
1990-02-09    0
1990-02-10    0
1990-02-11    0
1990-02-12    0
1990-02-13    0
1990-02-14    0
1990-02-15    0
1990-02-16    0
1990-02-17    0
1990-02-18    0
1990-02-19    0
1990-02-20    0
1990-02-21    0
1990-02-22    0
1990-02-23    0
1990-02-24    0
1990-02-25    0
1990-02-26    0
1990-02-27    0
1990-02-28    1
Freq: D, dtype: int64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, do this with .asfreq() rather than pad and it is clear that the snap dates
when upsampling. are either start/end depending on the convention.

In [25]: s = pd.Series(range(2), pd.period_range('1/1/1990', periods=2, freq='M'))

In [26]: s
Out[26]:
1990-01 0
1990-02 1
Freq: M, dtype: int64

In [27]: s.resample('5D', convention='start').asfreq()
Out[27]: 
1990-01-01    0.0
1990-01-06    NaN
1990-01-11    NaN
1990-01-16    NaN
1990-01-21    NaN
1990-01-26    NaN
1990-01-31    NaN
Freq: 5D, dtype: float64

In [28]: s.resample('5D', convention='end').asfreq()
Out[28]: 
1990-01-31    0.0
1990-02-05    NaN
1990-02-10    NaN
1990-02-15    NaN
1990-02-20    NaN
1990-02-25    NaN
Freq: 5D, dtype: float64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree in part - you example valid because it's finding the edges of the Periods (and so casts to edges / DatetimeIndex):

In [73]: s.resample('5D', convention='start').asfreq().index
Out[73]: 
DatetimeIndex(['1990-01-01', '1990-01-06', '1990-01-11', '1990-01-16',
               '1990-01-21', '1990-01-26', '1990-01-31'],
              dtype='datetime64[ns]', freq='5D')

...but that doesn't mean it's valid when converted back to Periods.

Rather than resample doing this through a convention kwarg and changing the type, I think that could be a method to get the edges. I realize this is leaking into Interval Index discussion - that's because that's the concept that Periods are attempting to represent.

CC @shoyer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, so this should actually be returning a PeriodIndex where freq is 5D, then the convention wouldn't matter as we are not going to a DTI. hmm. I think we need to be more explicit about returning PeriodIndex then, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, so this should actually be returning a PeriodIndex where freq is 5D, then the convention wouldn't matter as we are not going to a DTI.

👍

I think we need to be more explicit about returning PeriodIndex then, right?

Yes I would vote to only return PeriodIndex, unless there's an explicit conversion. Rules (here)[https://github.com//pull/12874#discussion_r59448140].

The additional feature is how to deal with bins that aren't strict subperiods or superperiods. e.g. resampling from monthly to weekly. That's a bigger question I think, probably not for this PR

values = period_range(start, end, freq=self.freq).values

return period_range(start, end, **ax_attrs)
return ax._shallow_copy(values, freq=self.freq)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delegating the attribute consistency to _shallow_copy here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. (of course be sure to add tests as neeeded, more are better!) and I know you have tests!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha!

This was actually tested in #12771, this is making it more panda-ntic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like panda-ntic! (usually I use pandonic)


def _downsample(self, how, **kwargs):
"""
Expand All @@ -705,7 +699,7 @@ def _downsample(self, how, **kwargs):

new_index = self._get_new_index()
if len(new_index) == 0:
return self._wrap_result(new_index)
return self._wrap_result(self._selected_obj.reindex(new_index))

# Start vs. end of period
memb = ax.asfreq(self.freq, how=self.convention)
Expand All @@ -718,6 +712,8 @@ def _downsample(self, how, **kwargs):
return self._groupby_and_aggregate(grouper, how)
elif is_superperiod(ax.freq, self.freq):
return self.asfreq()
elif ax.freq == self.freq:
return self.asfreq()

raise ValueError('Frequency {axfreq} cannot be '
'resampled to {freq}'.format(
Expand All @@ -743,23 +739,24 @@ def _upsample(self, method, limit=None):

ax = self.ax
obj = self.obj

new_index = self._get_new_index()
if len(new_index) == 0:
return self._wrap_result(new_index)

if not is_superperiod(ax.freq, self.freq):
return self.asfreq()
if len(new_index) == 0:
return self._wrap_result(self._selected_obj.reindex(new_index))

# Start vs. end of period
memb = ax.asfreq(self.freq, how=self.convention)

# Get the fill indexer
indexer = memb.get_indexer(new_index, method=method, limit=limit)
return self._wrap_result(_take_new_index(obj,
indexer,
new_index,
axis=self.axis))
return self._wrap_result(_take_new_index(
obj, indexer, new_index, axis=self.axis))

def _groupby_and_aggregate(self, grouper, how, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of doing this you can just change where this is called (IOW where size/count/nunique are actually defined and just call ._downsample there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all resamplers? Or branch there so the PeriodIndexresampler hits _downsample and the others hit _groupby_and_aggregate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe need to set the functions (count,size,nunique) specifically on PeriodIndexResampler. I think that might be more clear.

OR

create a new function: _aggregate_without_grouper in DatetimeResampler (and override in PeriodIndexResampler (and just set the functions to call this.

Not sure which will make the code more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that makes it that much clearer. I also imagine a lot of this will go away when PeriodIndex becomes an IntervalIndex, so there's a high discount rate on any cleaning here.

Are you OK to leave as-is for the moment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats a while off

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean when you can, don't defer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't made any changes on this PR.

Although I have added a bunch of tests that go through every resample method, as some compensation 😉 .

Let me know if you really think we can't push this without redoing this. Can open another issue for it in the meantime

if grouper is None:
return self._downsample(how, **kwargs)
return super(PeriodIndexResampler, self)._groupby_and_aggregate(
grouper, how, *args, **kwargs)


class TimedeltaResampler(DatetimeIndexResampler):
Expand Down
97 changes: 61 additions & 36 deletions pandas/tseries/tests/test_resample.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,33 +3,34 @@
from datetime import datetime, timedelta
from functools import partial

from pandas.compat import range, lrange, zip, product, OrderedDict
import nose
import numpy as np

import pandas as pd
import pandas.tseries.offsets as offsets
import pandas.util.testing as tm
from pandas import (Series, DataFrame, Panel, Index, isnull,
notnull, Timestamp)

from pandas.compat import range, lrange, zip, product, OrderedDict
from pandas.core.base import SpecificationError
from pandas.core.common import ABCSeries, ABCDataFrame
from pandas.core.groupby import DataError
from pandas.tseries.frequencies import MONTHS, DAYS
from pandas.tseries.index import date_range
from pandas.tseries.tdi import timedelta_range
from pandas.tseries.offsets import Minute, BDay
from pandas.tseries.period import period_range, PeriodIndex, Period
from pandas.tseries.resample import (DatetimeIndex, TimeGrouper,
DatetimeIndexResampler)
from pandas.tseries.frequencies import MONTHS, DAYS
from pandas.core.common import ABCSeries, ABCDataFrame
from pandas.core.base import SpecificationError

import pandas.tseries.offsets as offsets
import pandas as pd

import nose

from pandas.tseries.tdi import timedelta_range
from pandas.util.testing import (assert_series_equal, assert_almost_equal,
assert_frame_equal)
import pandas.util.testing as tm

bday = BDay()
downsample_methods = ['min', 'max', 'first', 'last', 'sum', 'mean', 'sem',
'median', 'prod', 'ohlc']
upsample_methods = ['count', 'size']
series_methods = ['nunique']
resample_methods = downsample_methods + upsample_methods + series_methods


class TestResampleAPI(tm.TestCase):
Expand Down Expand Up @@ -95,12 +96,13 @@ def test_api_changes_v018(self):
self.assertRaises(ValueError, lambda: r.iat[0])
self.assertRaises(ValueError, lambda: r.ix[0])
self.assertRaises(ValueError, lambda: r.loc[
Timestamp('2013-01-01 00:00:00', offset='H')])
Timestamp('2013-01-01 00:00:00', offset='H')])
self.assertRaises(ValueError, lambda: r.at[
Timestamp('2013-01-01 00:00:00', offset='H')])
Timestamp('2013-01-01 00:00:00', offset='H')])

def f():
r[0] = 5

self.assertRaises(ValueError, f)

# str/repr
Expand Down Expand Up @@ -144,7 +146,6 @@ def f():

# comparison ops
for op in ['__lt__', '__le__', '__gt__', '__ge__', '__eq__', '__ne__']:

r = self.series.resample('H')

with tm.assert_produces_warning(FutureWarning,
Expand Down Expand Up @@ -259,6 +260,7 @@ def test_attribute_access(self):
# setting
def f():
r.F = 'bah'

self.assertRaises(ValueError, f)

def test_api_compat_before_use(self):
Expand Down Expand Up @@ -509,10 +511,10 @@ def test_agg_misc(self):
# errors
# invalid names in the agg specification
for t in [r, g]:

def f():
r[['A']].agg({'A': ['sum', 'std'],
'B': ['mean', 'std']})

self.assertRaises(SpecificationError, f)

def test_agg_nested_dicts(self):
Expand Down Expand Up @@ -679,7 +681,7 @@ def _ohlc(group):
assert_series_equal(result, expected)
except BaseException as exc:

exc.args += ('how=%s' % arg, )
exc.args += ('how=%s' % arg,)
raise

def test_resample_how_callables(self):
Expand All @@ -692,7 +694,6 @@ def fn(x, a=1):
return str(type(x))

class fn_class:

def __call__(self, x):
return str(type(x))

Expand Down Expand Up @@ -768,7 +769,7 @@ def test_resample_rounding(self):

from pandas.compat import StringIO
df = pd.read_csv(StringIO(data), parse_dates={'timestamp': [
'date', 'time']}, index_col='timestamp')
'date', 'time']}, index_col='timestamp')
df.index.name = None
result = df.resample('6s').sum()
expected = DataFrame({'value': [
Expand Down Expand Up @@ -1061,10 +1062,10 @@ def test_resample_ohlc_dataframe(self):

df.columns = [['a', 'b'], ['c', 'd']]
res = df.resample('H').ohlc()
exp.columns = pd.MultiIndex.from_tuples([('a', 'c', 'open'), (
'a', 'c', 'high'), ('a', 'c', 'low'), ('a', 'c', 'close'), (
'b', 'd', 'open'), ('b', 'd', 'high'), ('b', 'd', 'low'), (
'b', 'd', 'close')])
exp.columns = pd.MultiIndex.from_tuples([
('a', 'c', 'open'), ('a', 'c', 'high'), ('a', 'c', 'low'),
('a', 'c', 'close'), ('b', 'd', 'open'), ('b', 'd', 'high'),
('b', 'd', 'low'), ('b', 'd', 'close')])
assert_frame_equal(exp, res)

# dupe columns fail atm
Expand Down Expand Up @@ -1449,11 +1450,12 @@ def test_resample_anchored_multiday(self):
#
# See: https://github.com/pydata/pandas/issues/8683

s = pd.Series(np.random.randn(5),
index=pd.date_range('2014-10-14 23:06:23.206',
periods=3, freq='400L') |
pd.date_range('2014-10-15 23:00:00',
periods=2, freq='2200L'))
index = pd.date_range(
'2014-10-14 23:06:23.206', periods=3, freq='400L'
) | pd.date_range(
'2014-10-15 23:00:00', periods=2, freq='2200L')

s = pd.Series(np.random.randn(5), index=index)

# Ensure left closing works
result = s.resample('2200L').mean()
Expand Down Expand Up @@ -1763,7 +1765,6 @@ def _simple_pts(start, end, freq='D'):


class TestResamplePeriodIndex(tm.TestCase):

_multiprocess_can_split_ = True

def test_annual_upsample_D_s_f(self):
Expand Down Expand Up @@ -1907,16 +1908,40 @@ def test_resample_basic(self):

def test_resample_empty(self):

# GH12771
# GH12771 & GH12868
index = PeriodIndex(start='2000', periods=0, freq='D', name='idx')
s = Series(index=index)
result = s.resample('M').sum()

# after GH12774 is resolved, this should be a PeriodIndex
expected_index = DatetimeIndex([], name='idx')
expected_index = PeriodIndex([], name='idx', freq='M')
expected = Series(index=expected_index)

for method in resample_methods:
result = getattr(s.resample('M'), method)()
assert_series_equal(result, expected)

def test_resample_count(self):

# GH12774
series = pd.Series(1, index=pd.period_range(start='2000',
periods=100))
result = series.resample('M').count()

expected_index = pd.period_range(start='2000', freq='M', periods=4)
expected = pd.Series([31, 29, 31, 9], index=expected_index)

assert_series_equal(result, expected)

def test_resample_same_freq(self):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# GH12770
series = pd.Series(range(3), index=pd.period_range(
start='2000', periods=3, freq='M'))
expected = series

for method in resample_methods:
result = getattr(series.resample('M'), method)()
assert_series_equal(result, expected)

def test_with_local_timezone_pytz(self):
# GH5430
tm._skip_if_no_pytz()
Expand Down Expand Up @@ -2493,8 +2518,8 @@ def test_aggregate_with_nat(self):
# GH 9925
self.assertEqual(dt_result.index.name, 'key')

# if NaT is included, 'var', 'std', 'mean', 'first','last' and 'nth'
# doesn't work yet
# if NaT is included, 'var', 'std', 'mean', 'first','last'
# and 'nth' doesn't work yet


if __name__ == '__main__':
Expand Down