Skip to content

weighted mean #10030

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bgrayburn opened this issue Apr 30, 2015 · 55 comments
Open

weighted mean #10030

bgrayburn opened this issue Apr 30, 2015 · 55 comments
Labels
Enhancement Reduction Operations sum, mean, min, max, etc. Window rolling, ewma, expanding

Comments

@bgrayburn
Copy link

A "weights" labeled parameter in the df.mean method would be extremely useful. In numpy this functionality is provided via np.average instead of np.mean which I'm assuming is how similar functionality would be added to pandas.

ex requested feature:

> a = np.array([[2,4,6],[8,10,12]])
> w = [.5, 0, .5]
> np_avg = np.average(a, weights = w, axis=1)
#output ->  array([ 4., 10.])
> pd_avg = pandas.DataFrame(a).mean(weights = w, axis=1)
#desired output -> series with entries 4 and 10

If this is a desired feature I'll complete this and submit a pull request if desired. If this is somewhere already I've overlooked, my apologies, I've tried to look thoroughly.

@shoyer
Copy link
Member

shoyer commented Apr 30, 2015

Why not just write:

pd_avg = (np.array(w) * pandas.DataFrame(a)).mean(axis=1)

@benjello
Copy link
Contributor

benjello commented May 2, 2015

I agree with @bgrayburn, weighted statistics would be very useful in pandas. One can use statsmodel but extending DataFrame methods to use weight would be very useful for people using weighted survey data.

@bgrayburn
Copy link
Author

@stephan : I agree your code snippet accomplishes (nearly) the same thing
from a functional perspective, but from a code-readability standpoint, and
a code reuse standpoint, including a weights parameter seems optimal. Also
numpy's weight parameter automatically normalizes the weights vector which
is also extremely useful.

On Sat, May 2, 2015 at 3:04 PM, Mahdi Ben Jelloul [email protected]
wrote:

I agree with @bgrayburn https://github.com/bgrayburn, weighted
statistics would be very useful in pandas. One can use statsmodel but
extending DataFrame methods to use weight would be very useful for people
using weighted survey data.


Reply to this email directly or view it on GitHub
#10030 (comment).

@shoyer
Copy link
Member

shoyer commented May 3, 2015

Okay, fair enough. This seems within scope for pandas. We recently added a sample method which includes a similar weights argument (#9666) which might be useful as a starting point.

@benjello
Copy link
Contributor

@shoyer asked elsewhere (#10000) about a list of methods that could be enhanced by a 'weighted' version.
Almost all the statistical functions that can be found here are candidates. I can also think of describe, value_counts, qcut, hist and margin computations in pivot tables.

@shoyer
Copy link
Member

shoyer commented Jun 13, 2015

Again, I think we're open to most of these changes (all of which are backward compatible with weights=None). The main obstacle is that we need implementations, documentation and benchmarks to show we aren't slowing anything down. PRs would be welcome.although, It would also be worth checking if any of these could be pushed upstream to numpy.

@bgrayburn
Copy link
Author

@shoyer @benjello sorry for the delay on this, still planning on submitting a PR, I'll be coding this weekend.

in regards to the numpy thing, for weighted means they use .average which you can see here. My plan was to implement

pd_avg = (np.array(w) * pandas.DataFrame(a)).mean(axis=1)

pretty much as written, by multiplying the input dataframe's columns by the weight vector. Alternatively we could call np.average when a weights parameter is present, OR (3rd option) we could implement a pandas.DataFrame(a).average(weights=[...]) to mirror pandas.

one last question, should weighting be applicable in either axis=0 or axis=1 mode? I'm assuming yes, but wanted to check.

Let me know your preferences, or if somehow this should be incorporated with a larger change as mentioned above.

Best

@jreback
Copy link
Contributor

jreback commented Jul 24, 2015

why are you not just adding a weights keyword to .mean?

much more consistent in the API and we don't implement average I suspect because it's just confusing

@jreback
Copy link
Contributor

jreback commented Jul 24, 2015

and this needs to trickle down to nanops.py where all of the actual computation is done - this handles many different dtypes

@benjello
Copy link
Contributor

@bgrayburn : i think @jreback suggestion is worth following: using a mean with weights is what you re expecting for a weighted mean

@mattayes
Copy link
Contributor

+1 Would make working with microdata samples (think PUMS) so much nicer.

@shoyer
Copy link
Member

shoyer commented Jul 29, 2015

I agree that it would be better to incorporate this directly into aggregation functions like mean and var instead of adding specialized methods like average.

@johne13
Copy link

johne13 commented Feb 27, 2016

+1 to @mattayes & @shoyer -- when working with weighted data you pretty much want to weight EVERY statistic and graph that you generate. It's pretty much a necessity to have some weighting option if you're working with such data.

Adding a weights argument to as many functions as possible over time sounds like the way to go to the extent it isn't going to be handled in numpy/statsmodels/matplotlib.

Stata, for example, allows a weight option for practically all functions. I use stata's tabstat with the weight option very frequently and at the moment there isn't any good analog in pandas that I know of.

@johne13
Copy link

johne13 commented Feb 28, 2016

A possible complication to consider: there are potentially different kinds of weights. Stata, for example, defines 4 types of weights: frequency, analytical, probability, and importance (although the last one is just an abstract catchall). [http://www.stata.com/help.cgi?weight]

I'm thinking that in this thread most people are thinking of frequency weights, but it might be necessary to clarify this. Also, it probably won't matter for something like mean or median, but could affect something like variance.

@shoyer
Copy link
Member

shoyer commented Mar 17, 2016

There has been some recent discussion about implementing efficient algorithms for weighted partition (e.g., to do weighted median) upstream in NumPy, as well:
https://mail.scipy.org/pipermail/numpy-discussion/2016-February/075000.html

In any case, a first draft that uses sorting to do weighted median would still be valuable.

@max-sixty
Copy link
Contributor

From pydata/xarray#650:

How about designing this as a groupby-like interface? In the same way as .rolling (or .expanding & .ewm in pandas)?

So for example ds.weighted(weights=ds.dim).mean().

And then this is extensible, clean, pandan-tic.

@jreback
Copy link
Contributor

jreback commented May 11, 2016

what other things would you do with a .weighted(..).mean() interface?

IOW what other parameters would it accept aside from the actual weights?

@shoyer
Copy link
Member

shoyer commented May 11, 2016

@jreback I think .weighted() would only accept weights, which could be either an array or a callable of the usual form (lambda df: ....). But the WeightedMethods class could also expose weighted implementations of other methods, such as std, var, median, sum, value_counts, hist, etc. I would even consider moving over sample and deprecating the weights argument.

@jreback
Copy link
Contributor

jreback commented May 11, 2016

@shoyer I can see some syntax from that e.g.

df.weighted('A').B.mean() is pretty clear

though df.B.mean(weights=df.A) is just as clear, so looking for a case where this is significantly nicer.

any idea how/does R do this? (julia?)

@benjello
Copy link
Contributor

benjello commented May 11, 2016

I used R wtd.stats.

wtd.mean(x, weights=NULL, normwt="ignored", na.rm=TRUE)
wtd.var(x, weights=NULL, normwt=FALSE, na.rm=TRUE)
wtd.quantile(x, weights=NULL, probs=c(0, .25, .5, .75, 1), 
             type=c('quantile','(i-1)/(n-1)','i/(n+1)','i/n'), 
             normwt=FALSE, na.rm=TRUE)
wtd.Ecdf(x, weights=NULL, 
         type=c('i/n','(i-1)/(n-1)','i/(n+1)'), 
         normwt=FALSE, na.rm=TRUE)
wtd.table(x, weights=NULL, type=c('list','table'), 
          normwt=FALSE, na.rm=TRUE)
wtd.rank(x, weights=NULL, normwt=FALSE, na.rm=TRUE)
wtd.loess.noiter(x, y, weights=rep(1,n), robust=rep(1,n), 
                 span=2/3, degree=1, cell=.13333, 
                 type=c('all','ordered all','evaluate'), 
                 evaluation=100, na.rm=TRUE)

@jreback
Copy link
Contributor

jreback commented May 11, 2016

@benjello hmm, that's interesting.

@shoyer
Copy link
Member

shoyer commented May 11, 2016

though df.B.mean(weights=df.A) is just as clear, so looking for a case where this is significantly nicer.

On the face of it, this does look as nice. But from an API design perspective, adding a keyword argument for weights is much less elegant.

Being "weighted" is orthogonal to the type of statistical calculation. With this proposal, instead of adding the weights keyword argument to N different methods, we define a single weighted method, and add statistical methods to it that exactly match the signature of the same methods on DataFrame/Series. This makes it obvious that all these methods share the same approach, and keeps method signatures from growing additional arguments that trigger entirely independent code paths (which is a sign of code smell).

Separately: perhaps weightby is a slightly better name than weighted? It suggests more similarity to groupby.

jreback added a commit to jreback/pandas that referenced this issue Jan 2, 2017
@chris-b1
Copy link
Contributor

chris-b1 commented Apr 5, 2017

Interesting package addressing weighted calculations here - https://github.com/jsvine/weightedcalcs. Has an api along the lines of weightby.

@rbiswas4
Copy link

I am looking for weighted means in group by aggregates, where the weights are another column of the dataframe. Does this thread include support for this? To make my question explicit:

lcc:

snid	mjd	band	flux	weights
obsHistID					
609734	141374000	60493.416959	lsstg	2.825651e-09	6.442312e+20
609733	141374000	60493.416511	lsstg	2.893961e-09	5.962141e+20
609732	141374000	60493.416062	lsstg	2.834461e-09	6.590458e+20
....
611542	141374000	60495.426047	lssti	6.722778e-09	1.307280e+20
610790	141374000	60494.432074	lsstz	6.619978e-09	6.156260e+19

and I do operations like:
grouped = lcc.groupby(['snid', 'band', 'night']) res = grouped.agg(dict(flux=np.mean))
What I really want to do is fast

res = grouped.agg(dict(flux=weightedmean(weights='weights'))

The real problem is that this requires two columns in the aggregate input. I have looked at workarounds like the ones suggested here but I find this to be slow:

In my case when the original dataframe has about ~10000 rows, and 10000 groups, a direct np.mean aggregate times to a best of 1.6 ms per loop, while running through the workaround is ~ 2 mins per loop. Is there an implementation/workaround that will speed this up for me?

@ilemhadri
Copy link

Same problem here.
At the moment i am using .apply as a workaround but this is slow.
Being able to call .agg on groupby objects using more than one column as input would be highly appreciated.

@chris-b1
Copy link
Contributor

chris-b1 commented Aug 3, 2017

You can improve performance significantly by not using apply, but instead building the calc out of existing vectorized ops. Example for mean below.

In [49]: df = pd.DataFrame({'id': np.repeat(np.arange(100, dtype='i8'), 100),
    ...:                    'v': np.random.randn(10000),
    ...:                    'w': np.random.randn(10000)})


In [46]: %timeit mean1 = df.groupby('id').apply(lambda x: (x['v'] * x['w']).sum() / x['w'].sum())
32.4 ms ± 2.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [47]: %%timeit
    ...: df['interim'] = df['v'] * df['w']
    ...: gb = df.groupby('id')
    ...: mean2 = gb['interim'].sum() / gb['w'].sum()
1.21 ms ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [48]: np.allclose(mean1, mean2)
Out[48]: True

@johne13
Copy link

johne13 commented Nov 18, 2017

This is so complicated between weight types, API, etc. but just to chime in while stuff is fresh in my mind:

From a statistical point of view, basic statistics seem to break down into 2 categories:

  1. means and order statistics (min/max/median/quantiles)
  2. higher order moments (std dev/variance/skew/kurtosis)

I believe that the first category is very straightforward to handle. I'm an amateur statistician at best, but I think there is really only one basic way to calculate mean/max/median etc.

Conversely, std dev & variance are a lot more complicated than you think -- not that the math is that hard but more that "std dev" can mean more than one thing here. Really great article here that lays out the issues:
https://www.stata.com/support/faqs/statistics/weights-and-summary-statistics/

For example, if you type these two commands in stata:
sum x [fw=weight], detail
sum x [aw=weight], detail
you'll get the same results for all stats except std & var

Also, to the extent pandas is handing this sort of thing off to statsmodels, they do have a library here that does some weighting for most basic stats (although min & max seem to be missing). See this link for more (a recent answer I wrote at SO using the statsmodel library):

https://stackoverflow.com/questions/17689099/using-describe-with-weighted-data-mean-standard-deviation-median-quantil/47368071#47368071

@randomgambit
Copy link

look im sorry but this is largely incorrect. weighted statistics are really basis stuff

@johne13
Copy link

johne13 commented Nov 18, 2017

@randomgambit

OK, then which of these is correct?

sum x [fw=weight], detail
sum x [aw=weight], detail

@kdebrab
Copy link
Contributor

kdebrab commented Jun 12, 2018

FWIW:
I needed a resampled weighted quantile and implemented it as follows.

def resample_weighted_quantile(frame, weight=None, rule='D', q=0.5):
    if weight is None:
        return frame.resample(rule).apply(lambda x: x.quantile(q))
    else:
        data = [series.resample(rule).apply(_weighted_quantile, weight=weight[col], q=q).rename(col)
                          for col, series in frame.items()]
        return pd.concat(data, axis=1)

def _weighted_quantile(series, weight, q):
    series = series.sort_values()
    cumsum = weight.reindex(series.index).cumsum()
    cutoff = cumsum.iloc[-1] * q
    return series[cumsum >= cutoff].iloc[0]

frame and weight are dataframes with same index and columns. It could probably be optimized, but at least it works.

@Heuertje
Copy link

Heuertje commented Aug 8, 2018

This would be a great addition to pandas!

@JacekPliszka
Copy link

I would love to have such functionality - this is one of the things I sorely miss in pandas in comparison to R.
My use case needs different weights for different columns. so something like that would be great:

df.groupby(...).mean(weights={'a': df.d, 'b': df.e})

@jreback
Copy link
Contributor

jreback commented Mar 11, 2021

this is already available in master / 1.3, see https://pandas.pydata.org/pandas-docs/dev/user_guide/window.html?highlight=weighted_mean#overview, look for weighted_mean.

@JacekPliszka
Copy link

Thank you though I have one small problem with that - I would like to apply it after groupby that does not have a fixed size.

@MaxGhenis
Copy link

MaxGhenis commented Mar 12, 2021

I'm involved in a project to extend pandas to include weights: https://github.com/PSLmodels/microdf

This allows normal interaction with many pandas functions, including mean and groupby, after defining weights upfront, e.g. (notebook):

import microdf as mdf

d = mdf.MicroDataFrame({"x": [1, 2, 3], "g": ["a", "a", "b"]}, weights=[2, 1, 1])
d.groupby("g").x.mean()

Result:

g
a    1.333333
b    3.000000
dtype: float64

It currently only supports one set of weights, so @JacekPliszka's request would require a post-hoc merge.

Of course, we'd gladly shut it down if pandas included it natively at some point :)

@max-sixty
Copy link
Contributor

As a reference, xarray supports weighted with a groupby-like .weighted method: http://xarray.pydata.org/en/stable/examples/area_weighted_temperature.html#Weighted-mean

@johne13
Copy link

johne13 commented Mar 16, 2021 via email

@lababidi
Copy link

What's the status on this issue? Any extra help needed? Or is this implemented?

@benjello
Copy link
Contributor

@lababidi : not implemented.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@jbrockmendel jbrockmendel added Reduction Operations sum, mean, min, max, etc. and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels Mar 28, 2023
@nachomaiz
Copy link

nachomaiz commented Apr 11, 2025

Hi!

Are there any plans to eventually review this feature?

Just wanted to add a note to the conversation as I unfortunately was a bit mislead by it.

The original comment by @shoyer had a bit of an issue (might want to make a quick note/edit since it's the top comment 😊):

Why not just write:

pd_avg = (np.array(w) * pandas.DataFrame(a)).mean(axis=1)

This doesn't really apply a weighted mean, as the denominator should be the sum of weights across all valid values, not the count of valid values as the .mean() method seems to use.

Here's what I wrote, which also should handle missing values appropriately:

(
    data[
        [
            ..., # variables
            "weight",  # must include weights!
        ]
    ]
    .groupby("groups")
    .apply(  # apply to each group df
        lambda df: df.apply(  # apply to each col within df
            lambda col: col.mul(df["weight"], axis=0).sum()
            # use `.mask(col.isna())` so weights for missing values don't count towards the denominator
            / df["weight"].mask(col.isna()).sum()
        )
    )
    .drop(columns="weight")  # drop weights after aggregation
)

This of course feels very clunky, so I would also favor the .weighted or .weightby API that was mentioned above.

Hopefully this can help a bit with implementation once it happens. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Reduction Operations sum, mean, min, max, etc. Window rolling, ewma, expanding
Projects
None yet