Skip to content

ENH: Custom date boundaries for resamplers? #44712

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
manoelpqueiroz opened this issue Dec 1, 2021 · 4 comments
Closed

ENH: Custom date boundaries for resamplers? #44712

manoelpqueiroz opened this issue Dec 1, 2021 · 4 comments
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@manoelpqueiroz
Copy link

Is your feature request related to a problem?

No, not in terms of pandas' performance or exectution.

Over this year I have finally grasped the true power of pandas and allowed me to go deep into the library's differentials. One particular behaviour that helps a lot is resampling time-series with such an easy-to-use API (people who come from Excel will also point out how atrocious dealing with dates is with spreadsheets).

However, one thing that would make it easier for data analysts was to allow the resampler to receive sepcific boundaries to create the resampled output, either applied to the DataFrame overall or within a GroupBy operation.

Describe the solution you'd like

Examples using the current API

Let's consider a DataFrame with a datetime column, some values and a discrete variable column:

import pandas as pd

df = pd.DataFrame({
    'class':{
        0:'beta', 1:'beta', 2:'beta', 3:'beta', 4:'gaga', 5:'beta', 6:'alpha',
        7:'gaga', 8:'alpha', 9:'gaga', 10:'gaga', 11:'beta', 12:'alpha',
        13:'alpha', 14:'beta'},
    'a':{
        0:74, 1:34, 2:89, 3:32, 4:44, 5:49, 6:52, 7:28, 8:19, 9:95, 10:78,
        11:64, 12:68, 13:76, 14:40},
    'b':{
        0:53, 1:25, 2:34, 3:66, 4:55, 5:72, 6:64, 7:30, 8:63, 9:48, 10:46,
        11:54, 12:41, 13:11, 14:35},
    'date':{
        0:pd.Timestamp('2021-09-07 00:00:00'),
        1:pd.Timestamp('2021-06-06 00:00:00'),
        2:pd.Timestamp('2021-09-16 00:00:00'),
        3:pd.Timestamp('2021-05-26 00:00:00'),
        4:pd.Timestamp('2021-11-12 00:00:00'),
        5:pd.Timestamp('2021-07-15 00:00:00'),
        6:pd.Timestamp('2021-11-14 00:00:00'),
        7:pd.Timestamp('2021-09-29 00:00:00'),
        8:pd.Timestamp('2021-10-27 00:00:00'),
        9:pd.Timestamp('2021-05-24 00:00:00'),
        10:pd.Timestamp('2021-06-09 00:00:00'),
        11:pd.Timestamp('2021-05-22 00:00:00'),
        12:pd.Timestamp('2021-05-19 00:00:00'),
        13:pd.Timestamp('2021-04-01 00:00:00'),
        14:pd.Timestamp('2021-06-26 00:00:00')}})

It's easy to see what this feature would be if we resample by month end:

>>> df.set_index('date').resample('M').sum()
              a    b
date                
2021-04-30   76   11
2021-05-31  259  209
2021-06-30  152  106
2021-07-31   49   72
2021-08-31    0    0
2021-09-30  191  117
2021-10-31   19   63
2021-11-30   96  119

What is really handy from resamplers is that they automatically fill the missing months (e.g., August above), which a SQL GROUP BY or even pandas' GroupBy objects wouldn't do.

However, in many cases when dealing with real data, we want to resample to get the picture over time of a higher order (e.g., if we're resampling by month, we would actually want to see an entire year), so we could get something like:

              a    b
2021-01-31    0    0
2021-02-28    0    0
2021-03-31    0    0
2021-04-30   76   11
2021-05-31  259  209
2021-06-30  152  106
2021-07-31   49   72
2021-08-31    0    0
2021-09-30  191  117
2021-10-31   19   63
2021-11-30   96  119
2021-12-31    0    0

When applying this to a whole DataFrame, that becomes easy with a date_range and reindex:

>>> dr = pd.date_range('20210101', '20211231', freq='M')
>>> rs = df.set_index('date').resample('M').sum().reindex(dr, fill_value=0)

But this becomes a bit more cumbersome when mixing resample with groupby. In the example, we would get this:

>>> df.set_index('date').groupby('class').resample('M').sum()
                    a    b
class date                
alpha 2021-04-30   76   11
      2021-05-31   68   41
      2021-06-30    0    0
      2021-07-31    0    0
      2021-08-31    0    0
      2021-09-30    0    0
      2021-10-31   19   63
      2021-11-30   52   64
beta  2021-05-31   96  120
      2021-06-30   74   60
      2021-07-31   49   72
      2021-08-31    0    0
      2021-09-30  163   87
gaga  2021-05-31   95   48
      2021-06-30   78   46
      2021-07-31    0    0
      2021-08-31    0    0
      2021-09-30   28   30
      2021-10-31    0    0
      2021-11-30   44   55

Using the same rule, we should get a way to have this as an output:

                    a    b
class date                
alpha 2021-01-31    0    0
      2021-02-28    0    0
      2021-03-31    0    0
      2021-04-30   76   11
      2021-05-31   68   41
      2021-06-30    0    0
      2021-07-31    0    0
      2021-08-31    0    0
      2021-09-30    0    0
      2021-10-31   19   63
      2021-11-30   52   64
      2021-12-31    0    0
beta  2021-01-31    0    0
      2021-02-28    0    0
      2021-03-31    0    0
      2021-04-30    0    0
      2021-05-31   96  120
      2021-06-30   74   60
      2021-07-31   49   72
      2021-08-31    0    0
      2021-09-30  163   87
      2021-10-31    0    0
      2021-11-30    0    0
      2021-12-31    0    0
gaga  2021-01-31    0    0
      2021-02-28    0    0
      2021-03-31    0    0
      2021-04-30    0    0
      2021-05-31   95   48
      2021-06-30   78   46
      2021-07-31    0    0
      2021-08-31    0    0
      2021-09-30   28   30
      2021-10-31    0    0
      2021-11-30   44   55
      2021-12-31    0    0

Again, this is doable by itself with reindex, but in this case it requires a bit more of code:

>>> grs = df.set_index('date').groupby('class').resample('M').sum()
>>> mi = pd.MultiIndex.from_product([grs.index.levels[0], dr], names=grs.index.names)
>>> grsf = grs.reindex(mi, fill_value=0)

One caveat: the above solution works because there was only one aggregation level besides the date. Should one be willing to use more, we would have more trouble, as each level of the MultiIndex should be specified individually to the list of the from_product method.

How this solution would work?

I though that new optional parameters could be supplied when resampling the data so the user can provide pandas a custom start date for the period and/or a custom end date:

>>> df.set_index('date').resample('M', start='20210101', end='20211231').sum()
              a    b
2021-01-31    0    0
2021-02-28    0    0
2021-03-31    0    0
2021-04-30   76   11
2021-05-31  259  209
2021-06-30  152  106
2021-07-31   49   72
2021-08-31    0    0
2021-09-30  191  117
2021-10-31   19   63
2021-11-30   96  119
2021-12-31    0    0

I first though about adapting the origin parameter to be able to receive a string or Timestamp object, but that would leave the end date without a parameter.

API breaking implications

My suggestion relates to implementing two new parameters to the resample method of DataFrames (start and end), but I'm not sure if allowing the user to set very specific dates for beginning and end of a resampler would make it more innefficient. After all, one could go wild and unlike me (who only asked for every month in 2021) set the beginning to 1970 and end to 2100 and beyond.

Describe alternatives you've considered

One could use the API as is along with date_range and reindex to get the same result of this proposed new feature, or perhaps create a custom function to handle the necessary steps, but in my opinion having this built in the method makes it even more expansive that it already is (again, resample is fantastic and saves a lot of work compared to doing time-series analysis in something like Excel).

Additional context

This feature could actually prove useful in daily usage of pandas

I've come up with this suggestion because I messed up a rather basic analysis at work and got me thinking that it might be more common than one might guess:

  • Have a dataset with each individual invoice made by the company, along with its date, client and value (e.g., dollars);
  • Compute monthly mean by each client;
  • Those clients below a threshold of 1 should be directed to another department;
  • Used groupby, resample, sum to get monthly invoicing;
  • Then used groupby and mean to get final data, got a small number of clients below the threshold (say, 5);
  • However, say a client had invoicing only on August (value of 3) and September (value of 2). Using resample gives a mean of 2.5 (i.e., above the threshold), but considering the mean over twelve months of a year, this client's value should be 0.42 (i.e., below the threshold);
  • Naturally, what originally were only 5 clients quickly turned into 100 by adjusting the code to address the entire year 😮
@manoelpqueiroz manoelpqueiroz added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 1, 2021
@jreback
Copy link
Contributor

jreback commented Dec 2, 2021

this is defacto supporting cartesian product for MultiIndez via reindex which we sort of support (meaning u can do it but it's not super pretty)

i would rather find an idiomatic way to reindex on levels than add arguments to resample (as it has already too many)

@jreback
Copy link
Contributor

jreback commented Dec 2, 2021

pretty much #7895

@ms7463
Copy link
Contributor

ms7463 commented Dec 15, 2021

A potentially cleaner existing approach in the meantime could look like this.

(
    df.set_index('date').groupby('class')
    .apply(lambda dfx:
        dfx.reindex(pd.date_range('2021-01-01', '2021-12-31')).resample('M').sum()
    )
)

although as it is above it wouldn't handle duplicate dates within a group

Edit: or this one which does handle it

(
    df.set_index('date').groupby('class')
    .apply(lambda dfx:
        dfx.resample('M').sum().reindex(pd.date_range('2021-01-01', '2021-12-31', freq='M'))
    ).fillna(0)
)

Edit2: or could use this code that I suggested in the PR linked above (#7895 (comment)):

grs = df.set_index('date').groupby('class').resample('M').sum()
mi_reindex(grs, pd.date_range('20210101', '20211231', freq='M', name='date')).fillna(0)

@mroeschke
Copy link
Member

Since it appears that #7895 is the preferred root approach to this request, closing in favor of that issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

4 participants