-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
weighted mean #10030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Why not just write:
|
I agree with @bgrayburn, weighted statistics would be very useful in pandas. One can use statsmodel but extending DataFrame methods to use weight would be very useful for people using weighted survey data. |
@stephan : I agree your code snippet accomplishes (nearly) the same thing On Sat, May 2, 2015 at 3:04 PM, Mahdi Ben Jelloul [email protected]
|
Okay, fair enough. This seems within scope for pandas. We recently added a |
Again, I think we're open to most of these changes (all of which are backward compatible with |
@shoyer @benjello sorry for the delay on this, still planning on submitting a PR, I'll be coding this weekend. in regards to the numpy thing, for weighted means they use .average which you can see here. My plan was to implement
pretty much as written, by multiplying the input dataframe's columns by the weight vector. Alternatively we could call np.average when a weights parameter is present, OR (3rd option) we could implement a pandas.DataFrame(a).average(weights=[...]) to mirror pandas. one last question, should weighting be applicable in either axis=0 or axis=1 mode? I'm assuming yes, but wanted to check. Let me know your preferences, or if somehow this should be incorporated with a larger change as mentioned above. Best |
why are you not just adding a weights keyword to .mean? much more consistent in the API and we don't implement average I suspect because it's just confusing |
and this needs to trickle down to nanops.py where all of the actual computation is done - this handles many different dtypes |
@bgrayburn : i think @jreback suggestion is worth following: using a mean with weights is what you re expecting for a weighted mean |
+1 Would make working with microdata samples (think PUMS) so much nicer. |
I agree that it would be better to incorporate this directly into aggregation functions like |
+1 to @mattayes & @shoyer -- when working with weighted data you pretty much want to weight EVERY statistic and graph that you generate. It's pretty much a necessity to have some weighting option if you're working with such data. Adding a weights argument to as many functions as possible over time sounds like the way to go to the extent it isn't going to be handled in numpy/statsmodels/matplotlib. Stata, for example, allows a weight option for practically all functions. I use stata's tabstat with the weight option very frequently and at the moment there isn't any good analog in pandas that I know of. |
A possible complication to consider: there are potentially different kinds of weights. Stata, for example, defines 4 types of weights: frequency, analytical, probability, and importance (although the last one is just an abstract catchall). [http://www.stata.com/help.cgi?weight] I'm thinking that in this thread most people are thinking of frequency weights, but it might be necessary to clarify this. Also, it probably won't matter for something like mean or median, but could affect something like variance. |
There has been some recent discussion about implementing efficient algorithms for weighted partition (e.g., to do weighted median) upstream in NumPy, as well: In any case, a first draft that uses sorting to do weighted median would still be valuable. |
From pydata/xarray#650: How about designing this as a So for example And then this is extensible, clean, pandan-tic. |
what other things would you do with a .weighted(..).mean() interface? IOW what other parameters would it accept aside from the actual weights? |
@jreback I think |
@shoyer I can see some syntax from that e.g.
though any idea how/does R do this? (julia?) |
I used R wtd.stats. wtd.mean(x, weights=NULL, normwt="ignored", na.rm=TRUE)
wtd.var(x, weights=NULL, normwt=FALSE, na.rm=TRUE)
wtd.quantile(x, weights=NULL, probs=c(0, .25, .5, .75, 1),
type=c('quantile','(i-1)/(n-1)','i/(n+1)','i/n'),
normwt=FALSE, na.rm=TRUE)
wtd.Ecdf(x, weights=NULL,
type=c('i/n','(i-1)/(n-1)','i/(n+1)'),
normwt=FALSE, na.rm=TRUE)
wtd.table(x, weights=NULL, type=c('list','table'),
normwt=FALSE, na.rm=TRUE)
wtd.rank(x, weights=NULL, normwt=FALSE, na.rm=TRUE)
wtd.loess.noiter(x, y, weights=rep(1,n), robust=rep(1,n),
span=2/3, degree=1, cell=.13333,
type=c('all','ordered all','evaluate'),
evaluation=100, na.rm=TRUE) |
@benjello hmm, that's interesting. |
On the face of it, this does look as nice. But from an API design perspective, adding a keyword argument for Being "weighted" is orthogonal to the type of statistical calculation. With this proposal, instead of adding the Separately: perhaps |
Interesting package addressing weighted calculations here - https://github.com/jsvine/weightedcalcs. Has an api along the lines of |
I am looking for weighted means in group by aggregates, where the weights are another column of the dataframe. Does this thread include support for this? To make my question explicit:
and I do operations like:
The real problem is that this requires two columns in the aggregate input. I have looked at workarounds like the ones suggested here but I find this to be slow: In my case when the original dataframe has about ~10000 rows, and 10000 groups, a direct |
Same problem here. |
You can improve performance significantly by not using apply, but instead building the calc out of existing vectorized ops. Example for mean below.
|
This is so complicated between weight types, API, etc. but just to chime in while stuff is fresh in my mind: From a statistical point of view, basic statistics seem to break down into 2 categories:
I believe that the first category is very straightforward to handle. I'm an amateur statistician at best, but I think there is really only one basic way to calculate mean/max/median etc. Conversely, std dev & variance are a lot more complicated than you think -- not that the math is that hard but more that "std dev" can mean more than one thing here. Really great article here that lays out the issues: For example, if you type these two commands in stata: Also, to the extent pandas is handing this sort of thing off to statsmodels, they do have a library here that does some weighting for most basic stats (although min & max seem to be missing). See this link for more (a recent answer I wrote at SO using the statsmodel library): |
look im sorry but this is largely incorrect. weighted statistics are really basis stuff |
OK, then which of these is correct?
|
FWIW: def resample_weighted_quantile(frame, weight=None, rule='D', q=0.5):
if weight is None:
return frame.resample(rule).apply(lambda x: x.quantile(q))
else:
data = [series.resample(rule).apply(_weighted_quantile, weight=weight[col], q=q).rename(col)
for col, series in frame.items()]
return pd.concat(data, axis=1)
def _weighted_quantile(series, weight, q):
series = series.sort_values()
cumsum = weight.reindex(series.index).cumsum()
cutoff = cumsum.iloc[-1] * q
return series[cumsum >= cutoff].iloc[0] frame and weight are dataframes with same index and columns. It could probably be optimized, but at least it works. |
This would be a great addition to pandas! |
I would love to have such functionality - this is one of the things I sorely miss in pandas in comparison to R. df.groupby(...).mean(weights={'a': df.d, 'b': df.e}) |
this is already available in master / 1.3, see https://pandas.pydata.org/pandas-docs/dev/user_guide/window.html?highlight=weighted_mean#overview, look for weighted_mean. |
Thank you though I have one small problem with that - I would like to apply it after groupby that does not have a fixed size. |
I'm involved in a project to extend pandas to include weights: https://github.com/PSLmodels/microdf This allows normal interaction with many pandas functions, including
Result:
It currently only supports one set of weights, so @JacekPliszka's request would require a post-hoc Of course, we'd gladly shut it down if |
As a reference, xarray supports weighted with a groupby-like |
statsmodels also has a pretty decent suite of weighted descriptive stats:
https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.DescrStatsW.html
means, quantiles, corrcoef, and various statistical tests
…On Fri, Mar 12, 2021 at 1:51 AM Maximilian Roos ***@***.***> wrote:
As a reference, xarray supports weighted with a groupby-like .weighted
method:
http://xarray.pydata.org/en/stable/examples/area_weighted_temperature.html#Weighted-mean
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#10030 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACCZFVDCZNM3XLC7HXD6N7LTDGTX5ANCNFSM4BBSTXYA>
.
|
What's the status on this issue? Any extra help needed? Or is this implemented? |
@lababidi : not implemented. |
Hi! Are there any plans to eventually review this feature? Just wanted to add a note to the conversation as I unfortunately was a bit mislead by it. The original comment by @shoyer had a bit of an issue (might want to make a quick note/edit since it's the top comment 😊):
This doesn't really apply a weighted mean, as the denominator should be the sum of weights across all valid values, not the count of valid values as the Here's what I wrote, which also should handle missing values appropriately: (
data[
[
..., # variables
"weight", # must include weights!
]
]
.groupby("groups")
.apply( # apply to each group df
lambda df: df.apply( # apply to each col within df
lambda col: col.mul(df["weight"], axis=0).sum()
# use `.mask(col.isna())` so weights for missing values don't count towards the denominator
/ df["weight"].mask(col.isna()).sum()
)
)
.drop(columns="weight") # drop weights after aggregation
) This of course feels very clunky, so I would also favor the Hopefully this can help a bit with implementation once it happens. 👍 |
A "weights" labeled parameter in the df.mean method would be extremely useful. In numpy this functionality is provided via np.average instead of np.mean which I'm assuming is how similar functionality would be added to pandas.
ex requested feature:
If this is a desired feature I'll complete this and submit a pull request if desired. If this is somewhere already I've overlooked, my apologies, I've tried to look thoroughly.
The text was updated successfully, but these errors were encountered: