Skip to content

Cannot aggregate by mean when using PeriodIndex and high-frequency series does cross between bins #2070

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
abielr opened this issue Oct 15, 2012 · 3 comments
Assignees
Labels
Bug Datetime Datetime data dtype
Milestone

Comments

@abielr
Copy link

abielr commented Oct 15, 2012

When using a Series indexed by a PeriodIndex and downsampling, resampling fails when using how='mean' and where the series to be resampled does not span multiple lower-frequency bins.

For example:

ix = period_range(start="2012-01-01", end="2012-12-31", freq="M")
s = Series(np.random.randn(len(ix)), index=ix)
s.resample("A", how='mean')

Fails because the period range is entirely contained within a single year. I've been able to replicate this going from quarterly to annual, or monthly to quarterly, etc. As of 0.9.1-dev, crashes Python without an exception as Cython function group_mean_bin() attempts to index into an empty bins array.

if bins[len(bins) - 1] == len(values): # Crash

I don't know how fine-grained pandas is right now when aggregating partially-filled periods, but it could be nice to have an option to return a NaN when the higher-frequency window is only partially filled. For example, suppose we sum daily to monthly and take a percent change across months, and either the recording started partway through the first month or data is only available partway through the last month. Then the first or last period percent change will possibly show a dramatic swing, and the user may not realize its simply an artifact of the data availability, as opposed to a truly interesting move in the underlying process. When running alot of automated aggregations the user may wish to not aggregate any partially filled periods in order to protect themselves from reaching a false conclusion about the time-series trend at the beginning or end of the series.

@abielr
Copy link
Author

abielr commented Oct 16, 2012

This issue seems like it can be resolved by uncommenting the Cython decorators at the top function group_mean_bin in src/groupby.pyx:

@cython.boundscheck(False)

This was the only case where I could see these commented out; didn't know if it was just an oversight or some testing was in progress at a point in the past.

@abielr
Copy link
Author

abielr commented Oct 16, 2012

I noticed when I install the pre-built binaries for Python 2.7 on Windows I don't run into quite the same error; in that case Python doesn't crash but the call to resample() nonetheless returns an NaN. I built the dev version using MinGW after which I noticed unexpected failures popping up elsewhere, so the compilation process may not have been good. I will also try building with Visual C++. In any case, the original problem of taking the mean when the data is all within one period is still outstanding.

@wesm
Copy link
Member

wesm commented Jan 19, 2013

thanks i will have a look

@ghost ghost assigned wesm Jan 21, 2013
@wesm wesm closed this as completed in 7c6e30a Jan 21, 2013
yarikoptic added a commit to neurodebian/pandas that referenced this issue Jan 23, 2013
Version 0.10.1

* tag 'v0.10.1': (195 commits)
  RLS: set released to true
  RLS: Version 0.10.1
  TST: skip problematic xlrd test
  Merging in MySQL support pandas-dev#2482
  Revert "Merging in MySQL support pandas-dev#2482"
  BUG: don't let np.prod overflow int64
  RLS: note changed return type in DatetimeIndex.unique
  RLS: more what's new for 0.10.1
  RLS: some what's new for 0.10.1
  API: restore inplace=TRue returns self, add FutureWarnings. re pandas-dev#1893
  Merging in MySQL support pandas-dev#2482
  BUG: fix python 3 dtype issue
  DOC: fix what's new 0.10 doc bug re pandas-dev#2651
  BUG: fix C parser thread safety. verify gil release close pandas-dev#2608
  BUG: usecols bug with implicit first index column. close pandas-dev#2654
  BUG: plotting bug when base is nonzero pandas-dev#2571
  BUG: period resampling bug when all values fall into a single bin. close pandas-dev#2070
  BUG: fix memory error in sortlevel when many multiindex levels. close pandas-dev#2684
  STY: CRLF
  BUG: perf_HEAD reports wrong vbench name when an exception is raised
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype
Projects
None yet
Development

No branches or pull requests

2 participants