You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
No, not in terms of pandas' performance or exectution.
Over this year I have finally grasped the true power of pandas and allowed me to go deep into the library's differentials. One particular behaviour that helps a lot is resampling time-series with such an easy-to-use API (people who come from Excel will also point out how atrocious dealing with dates is with spreadsheets).
However, one thing that would make it easier for data analysts was to allow the resampler to receive sepcific boundaries to create the resampled output, either applied to the DataFrame overall or within a GroupBy operation.
Describe the solution you'd like
Examples using the current API
Let's consider a DataFrame with a datetime column, some values and a discrete variable column:
What is really handy from resamplers is that they automatically fill the missing months (e.g., August above), which a SQL GROUP BY or even pandas' GroupBy objects wouldn't do.
However, in many cases when dealing with real data, we want to resample to get the picture over time of a higher order (e.g., if we're resampling by month, we would actually want to see an entire year), so we could get something like:
One caveat: the above solution works because there was only one aggregation level besides the date. Should one be willing to use more, we would have more trouble, as each level of the MultiIndex should be specified individually to the list of the from_product method.
How this solution would work?
I though that new optional parameters could be supplied when resampling the data so the user can provide pandas a custom start date for the period and/or a custom end date:
I first though about adapting the origin parameter to be able to receive a string or Timestamp object, but that would leave the end date without a parameter.
API breaking implications
My suggestion relates to implementing two new parameters to the resample method of DataFrames (start and end), but I'm not sure if allowing the user to set very specific dates for beginning and end of a resampler would make it more innefficient. After all, one could go wild and unlike me (who only asked for every month in 2021) set the beginning to 1970 and end to 2100 and beyond.
Describe alternatives you've considered
One could use the API as is along with date_range and reindex to get the same result of this proposed new feature, or perhaps create a custom function to handle the necessary steps, but in my opinion having this built in the method makes it even more expansive that it already is (again, resample is fantastic and saves a lot of work compared to doing time-series analysis in something like Excel).
Additional context
This feature could actually prove useful in daily usage of pandas
I've come up with this suggestion because I messed up a rather basic analysis at work and got me thinking that it might be more common than one might guess:
Have a dataset with each individual invoice made by the company, along with its date, client and value (e.g., dollars);
Compute monthly mean by each client;
Those clients below a threshold of 1 should be directed to another department;
Used groupby, resample, sum to get monthly invoicing;
Then used groupby and mean to get final data, got a small number of clients below the threshold (say, 5);
However, say a client had invoicing only on August (value of 3) and September (value of 2). Using resample gives a mean of 2.5 (i.e., above the threshold), but considering the mean over twelve months of a year, this client's value should be 0.42 (i.e., below the threshold);
Naturally, what originally were only 5 clients quickly turned into 100 by adjusting the code to address the entire year 😮
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem?
No, not in terms of pandas' performance or exectution.
Over this year I have finally grasped the true power of pandas and allowed me to go deep into the library's differentials. One particular behaviour that helps a lot is resampling time-series with such an easy-to-use API (people who come from Excel will also point out how atrocious dealing with dates is with spreadsheets).
However, one thing that would make it easier for data analysts was to allow the resampler to receive sepcific boundaries to create the resampled output, either applied to the DataFrame overall or within a GroupBy operation.
Describe the solution you'd like
Examples using the current API
Let's consider a DataFrame with a datetime column, some values and a discrete variable column:
It's easy to see what this feature would be if we resample by month end:
What is really handy from resamplers is that they automatically fill the missing months (e.g., August above), which a SQL
GROUP BY
or even pandas'GroupBy
objects wouldn't do.However, in many cases when dealing with real data, we want to resample to get the picture over time of a higher order (e.g., if we're resampling by month, we would actually want to see an entire year), so we could get something like:
When applying this to a whole DataFrame, that becomes easy with a
date_range
andreindex
:But this becomes a bit more cumbersome when mixing
resample
withgroupby
. In the example, we would get this:Using the same rule, we should get a way to have this as an output:
Again, this is doable by itself with
reindex
, but in this case it requires a bit more of code:One caveat: the above solution works because there was only one aggregation level besides the date. Should one be willing to use more, we would have more trouble, as each level of the
MultiIndex
should be specified individually to the list of thefrom_product
method.How this solution would work?
I though that new optional parameters could be supplied when resampling the data so the user can provide pandas a custom start date for the period and/or a custom end date:
I first though about adapting the
origin
parameter to be able to receive a string orTimestamp
object, but that would leave the end date without a parameter.API breaking implications
My suggestion relates to implementing two new parameters to the
resample
method of DataFrames (start
andend
), but I'm not sure if allowing the user to set very specific dates for beginning and end of a resampler would make it more innefficient. After all, one could go wild and unlike me (who only asked for every month in 2021) set the beginning to 1970 and end to 2100 and beyond.Describe alternatives you've considered
One could use the API as is along with
date_range
andreindex
to get the same result of this proposed new feature, or perhaps create a custom function to handle the necessary steps, but in my opinion having this built in the method makes it even more expansive that it already is (again,resample
is fantastic and saves a lot of work compared to doing time-series analysis in something like Excel).Additional context
This feature could actually prove useful in daily usage of pandas
I've come up with this suggestion because I messed up a rather basic analysis at work and got me thinking that it might be more common than one might guess:
groupby
,resample
,sum
to get monthly invoicing;groupby
andmean
to get final data, got a small number of clients below the threshold (say, 5);resample
gives a mean of 2.5 (i.e., above the threshold), but considering the mean over twelve months of a year, this client's value should be 0.42 (i.e., below the threshold);The text was updated successfully, but these errors were encountered: