ENH: Custom date boundaries for resamplers? #44712

manoelpqueiroz · 2021-12-01T22:22:18Z

Is your feature request related to a problem?

No, not in terms of pandas' performance or exectution.

Over this year I have finally grasped the true power of pandas and allowed me to go deep into the library's differentials. One particular behaviour that helps a lot is resampling time-series with such an easy-to-use API (people who come from Excel will also point out how atrocious dealing with dates is with spreadsheets).

However, one thing that would make it easier for data analysts was to allow the resampler to receive sepcific boundaries to create the resampled output, either applied to the DataFrame overall or within a GroupBy operation.

Describe the solution you'd like

Examples using the current API

Let's consider a DataFrame with a datetime column, some values and a discrete variable column:

import pandas as pd

df = pd.DataFrame({
    'class':{
        0:'beta', 1:'beta', 2:'beta', 3:'beta', 4:'gaga', 5:'beta', 6:'alpha',
        7:'gaga', 8:'alpha', 9:'gaga', 10:'gaga', 11:'beta', 12:'alpha',
        13:'alpha', 14:'beta'},
    'a':{
        0:74, 1:34, 2:89, 3:32, 4:44, 5:49, 6:52, 7:28, 8:19, 9:95, 10:78,
        11:64, 12:68, 13:76, 14:40},
    'b':{
        0:53, 1:25, 2:34, 3:66, 4:55, 5:72, 6:64, 7:30, 8:63, 9:48, 10:46,
        11:54, 12:41, 13:11, 14:35},
    'date':{
        0:pd.Timestamp('2021-09-07 00:00:00'),
        1:pd.Timestamp('2021-06-06 00:00:00'),
        2:pd.Timestamp('2021-09-16 00:00:00'),
        3:pd.Timestamp('2021-05-26 00:00:00'),
        4:pd.Timestamp('2021-11-12 00:00:00'),
        5:pd.Timestamp('2021-07-15 00:00:00'),
        6:pd.Timestamp('2021-11-14 00:00:00'),
        7:pd.Timestamp('2021-09-29 00:00:00'),
        8:pd.Timestamp('2021-10-27 00:00:00'),
        9:pd.Timestamp('2021-05-24 00:00:00'),
        10:pd.Timestamp('2021-06-09 00:00:00'),
        11:pd.Timestamp('2021-05-22 00:00:00'),
        12:pd.Timestamp('2021-05-19 00:00:00'),
        13:pd.Timestamp('2021-04-01 00:00:00'),
        14:pd.Timestamp('2021-06-26 00:00:00')}})

It's easy to see what this feature would be if we resample by month end:

>>> df.set_index('date').resample('M').sum()
              a    b
date                
2021-04-30   76   11
2021-05-31  259  209
2021-06-30  152  106
2021-07-31   49   72
2021-08-31    0    0
2021-09-30  191  117
2021-10-31   19   63
2021-11-30   96  119

What is really handy from resamplers is that they automatically fill the missing months (e.g., August above), which a SQL GROUP BY or even pandas' GroupBy objects wouldn't do.

However, in many cases when dealing with real data, we want to resample to get the picture over time of a higher order (e.g., if we're resampling by month, we would actually want to see an entire year), so we could get something like:

              a    b
2021-01-31    0    0
2021-02-28    0    0
2021-03-31    0    0
2021-04-30   76   11
2021-05-31  259  209
2021-06-30  152  106
2021-07-31   49   72
2021-08-31    0    0
2021-09-30  191  117
2021-10-31   19   63
2021-11-30   96  119
2021-12-31    0    0

When applying this to a whole DataFrame, that becomes easy with a date_range and reindex:

>>> dr = pd.date_range('20210101', '20211231', freq='M')
>>> rs = df.set_index('date').resample('M').sum().reindex(dr, fill_value=0)

But this becomes a bit more cumbersome when mixing resample with groupby. In the example, we would get this:

>>> df.set_index('date').groupby('class').resample('M').sum()
                    a    b
class date                
alpha 2021-04-30   76   11
      2021-05-31   68   41
      2021-06-30    0    0
      2021-07-31    0    0
      2021-08-31    0    0
      2021-09-30    0    0
      2021-10-31   19   63
      2021-11-30   52   64
beta  2021-05-31   96  120
      2021-06-30   74   60
      2021-07-31   49   72
      2021-08-31    0    0
      2021-09-30  163   87
gaga  2021-05-31   95   48
      2021-06-30   78   46
      2021-07-31    0    0
      2021-08-31    0    0
      2021-09-30   28   30
      2021-10-31    0    0
      2021-11-30   44   55

Using the same rule, we should get a way to have this as an output:

                    a    b
class date                
alpha 2021-01-31    0    0
      2021-02-28    0    0
      2021-03-31    0    0
      2021-04-30   76   11
      2021-05-31   68   41
      2021-06-30    0    0
      2021-07-31    0    0
      2021-08-31    0    0
      2021-09-30    0    0
      2021-10-31   19   63
      2021-11-30   52   64
      2021-12-31    0    0
beta  2021-01-31    0    0
      2021-02-28    0    0
      2021-03-31    0    0
      2021-04-30    0    0
      2021-05-31   96  120
      2021-06-30   74   60
      2021-07-31   49   72
      2021-08-31    0    0
      2021-09-30  163   87
      2021-10-31    0    0
      2021-11-30    0    0
      2021-12-31    0    0
gaga  2021-01-31    0    0
      2021-02-28    0    0
      2021-03-31    0    0
      2021-04-30    0    0
      2021-05-31   95   48
      2021-06-30   78   46
      2021-07-31    0    0
      2021-08-31    0    0
      2021-09-30   28   30
      2021-10-31    0    0
      2021-11-30   44   55
      2021-12-31    0    0

Again, this is doable by itself with reindex, but in this case it requires a bit more of code:

>>> grs = df.set_index('date').groupby('class').resample('M').sum()
>>> mi = pd.MultiIndex.from_product([grs.index.levels[0], dr], names=grs.index.names)
>>> grsf = grs.reindex(mi, fill_value=0)

One caveat: the above solution works because there was only one aggregation level besides the date. Should one be willing to use more, we would have more trouble, as each level of the MultiIndex should be specified individually to the list of the from_product method.

How this solution would work?

I though that new optional parameters could be supplied when resampling the data so the user can provide pandas a custom start date for the period and/or a custom end date:

>>> df.set_index('date').resample('M', start='20210101', end='20211231').sum()
              a    b
2021-01-31    0    0
2021-02-28    0    0
2021-03-31    0    0
2021-04-30   76   11
2021-05-31  259  209
2021-06-30  152  106
2021-07-31   49   72
2021-08-31    0    0
2021-09-30  191  117
2021-10-31   19   63
2021-11-30   96  119
2021-12-31    0    0

I first though about adapting the origin parameter to be able to receive a string or Timestamp object, but that would leave the end date without a parameter.

API breaking implications

My suggestion relates to implementing two new parameters to the resample method of DataFrames (start and end), but I'm not sure if allowing the user to set very specific dates for beginning and end of a resampler would make it more innefficient. After all, one could go wild and unlike me (who only asked for every month in 2021) set the beginning to 1970 and end to 2100 and beyond.

Describe alternatives you've considered

One could use the API as is along with date_range and reindex to get the same result of this proposed new feature, or perhaps create a custom function to handle the necessary steps, but in my opinion having this built in the method makes it even more expansive that it already is (again, resample is fantastic and saves a lot of work compared to doing time-series analysis in something like Excel).

Additional context

This feature could actually prove useful in daily usage of pandas

I've come up with this suggestion because I messed up a rather basic analysis at work and got me thinking that it might be more common than one might guess:

Have a dataset with each individual invoice made by the company, along with its date, client and value (e.g., dollars);
Compute monthly mean by each client;
Those clients below a threshold of 1 should be directed to another department;
Used groupby, resample, sum to get monthly invoicing;
Then used groupby and mean to get final data, got a small number of clients below the threshold (say, 5);
However, say a client had invoicing only on August (value of 3) and September (value of 2). Using resample gives a mean of 2.5 (i.e., above the threshold), but considering the mean over twelve months of a year, this client's value should be 0.42 (i.e., below the threshold);
Naturally, what originally were only 5 clients quickly turned into 100 by adjusting the code to address the entire year 😮

The text was updated successfully, but these errors were encountered:

jreback · 2021-12-02T03:31:13Z

this is defacto supporting cartesian product for MultiIndez via reindex which we sort of support (meaning u can do it but it's not super pretty)

i would rather find an idiomatic way to reindex on levels than add arguments to resample (as it has already too many)

jreback · 2021-12-02T03:36:08Z

pretty much #7895

ms7463 · 2021-12-15T07:12:41Z

A potentially cleaner existing approach in the meantime could look like this.

(
    df.set_index('date').groupby('class')
    .apply(lambda dfx:
        dfx.reindex(pd.date_range('2021-01-01', '2021-12-31')).resample('M').sum()
    )
)

although as it is above it wouldn't handle duplicate dates within a group

Edit: or this one which does handle it

(
    df.set_index('date').groupby('class')
    .apply(lambda dfx:
        dfx.resample('M').sum().reindex(pd.date_range('2021-01-01', '2021-12-31', freq='M'))
    ).fillna(0)
)

Edit2: or could use this code that I suggested in the PR linked above (#7895 (comment)):

grs = df.set_index('date').groupby('class').resample('M').sum()
mi_reindex(grs, pd.date_range('20210101', '20211231', freq='M', name='date')).fillna(0)

mroeschke · 2021-12-27T19:50:29Z

Since it appears that #7895 is the preferred root approach to this request, closing in favor of that issue.

manoelpqueiroz added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 1, 2021

mroeschke closed this as completed Dec 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Custom date boundaries for resamplers? #44712

ENH: Custom date boundaries for resamplers? #44712

manoelpqueiroz commented Dec 1, 2021

jreback commented Dec 2, 2021

jreback commented Dec 2, 2021

ms7463 commented Dec 15, 2021 •

edited

Loading

mroeschke commented Dec 27, 2021

ENH: Custom date boundaries for resamplers? #44712

ENH: Custom date boundaries for resamplers? #44712

Comments

manoelpqueiroz commented Dec 1, 2021

Is your feature request related to a problem?

Describe the solution you'd like

Examples using the current API

How this solution would work?

API breaking implications

Describe alternatives you've considered

Additional context

This feature could actually prove useful in daily usage of pandas

jreback commented Dec 2, 2021

jreback commented Dec 2, 2021

ms7463 commented Dec 15, 2021 • edited Loading

mroeschke commented Dec 27, 2021

ms7463 commented Dec 15, 2021 •

edited

Loading