weighted mean #10030

bgrayburn · 2015-04-30T15:24:04Z

A "weights" labeled parameter in the df.mean method would be extremely useful. In numpy this functionality is provided via np.average instead of np.mean which I'm assuming is how similar functionality would be added to pandas.

ex requested feature:

> a = np.array([[2,4,6],[8,10,12]])
> w = [.5, 0, .5]
> np_avg = np.average(a, weights = w, axis=1)
#output ->  array([ 4., 10.])
> pd_avg = pandas.DataFrame(a).mean(weights = w, axis=1)
#desired output -> series with entries 4 and 10

If this is a desired feature I'll complete this and submit a pull request if desired. If this is somewhere already I've overlooked, my apologies, I've tried to look thoroughly.

The text was updated successfully, but these errors were encountered:

shoyer · 2015-04-30T17:07:45Z

Why not just write:

pd_avg = (np.array(w) * pandas.DataFrame(a)).mean(axis=1)

benjello · 2015-05-02T19:03:52Z

I agree with @bgrayburn, weighted statistics would be very useful in pandas. One can use statsmodel but extending DataFrame methods to use weight would be very useful for people using weighted survey data.

bgrayburn · 2015-05-02T23:13:42Z

@stephan : I agree your code snippet accomplishes (nearly) the same thing
from a functional perspective, but from a code-readability standpoint, and
a code reuse standpoint, including a weights parameter seems optimal. Also
numpy's weight parameter automatically normalizes the weights vector which
is also extremely useful.

On Sat, May 2, 2015 at 3:04 PM, Mahdi Ben Jelloul [email protected]
wrote:

I agree with @bgrayburn https://github.com/bgrayburn, weighted
statistics would be very useful in pandas. One can use statsmodel but
extending DataFrame methods to use weight would be very useful for people
using weighted survey data.

—
Reply to this email directly or view it on GitHub
#10030 (comment).

shoyer · 2015-05-03T18:37:43Z

Okay, fair enough. This seems within scope for pandas. We recently added a sample method which includes a similar weights argument (#9666) which might be useful as a starting point.

benjello · 2015-06-13T17:54:02Z

@shoyer asked elsewhere (#10000) about a list of methods that could be enhanced by a 'weighted' version.
Almost all the statistical functions that can be found here are candidates. I can also think of describe, value_counts, qcut, hist and margin computations in pivot tables.

shoyer · 2015-06-13T19:15:49Z

Again, I think we're open to most of these changes (all of which are backward compatible with weights=None). The main obstacle is that we need implementations, documentation and benchmarks to show we aren't slowing anything down. PRs would be welcome.although, It would also be worth checking if any of these could be pushed upstream to numpy.

bgrayburn · 2015-07-24T20:24:52Z

@shoyer @benjello sorry for the delay on this, still planning on submitting a PR, I'll be coding this weekend.

in regards to the numpy thing, for weighted means they use .average which you can see here. My plan was to implement

pd_avg = (np.array(w) * pandas.DataFrame(a)).mean(axis=1)

pretty much as written, by multiplying the input dataframe's columns by the weight vector. Alternatively we could call np.average when a weights parameter is present, OR (3rd option) we could implement a pandas.DataFrame(a).average(weights=[...]) to mirror pandas.

one last question, should weighting be applicable in either axis=0 or axis=1 mode? I'm assuming yes, but wanted to check.

Let me know your preferences, or if somehow this should be incorporated with a larger change as mentioned above.

Best

jreback · 2015-07-24T20:28:28Z

why are you not just adding a weights keyword to .mean?

much more consistent in the API and we don't implement average I suspect because it's just confusing

jreback · 2015-07-24T20:29:19Z

and this needs to trickle down to nanops.py where all of the actual computation is done - this handles many different dtypes

benjello · 2015-07-29T11:33:29Z

@bgrayburn : i think @jreback suggestion is worth following: using a mean with weights is what you re expecting for a weighted mean

mattayes · 2015-07-29T21:20:41Z

+1 Would make working with microdata samples (think PUMS) so much nicer.

shoyer · 2015-07-29T21:54:51Z

I agree that it would be better to incorporate this directly into aggregation functions like mean and var instead of adding specialized methods like average.

johne13 · 2016-02-27T19:13:38Z

+1 to @mattayes & @shoyer -- when working with weighted data you pretty much want to weight EVERY statistic and graph that you generate. It's pretty much a necessity to have some weighting option if you're working with such data.

Adding a weights argument to as many functions as possible over time sounds like the way to go to the extent it isn't going to be handled in numpy/statsmodels/matplotlib.

Stata, for example, allows a weight option for practically all functions. I use stata's tabstat with the weight option very frequently and at the moment there isn't any good analog in pandas that I know of.

johne13 · 2016-02-28T23:42:22Z

A possible complication to consider: there are potentially different kinds of weights. Stata, for example, defines 4 types of weights: frequency, analytical, probability, and importance (although the last one is just an abstract catchall). [http://www.stata.com/help.cgi?weight]

I'm thinking that in this thread most people are thinking of frequency weights, but it might be necessary to clarify this. Also, it probably won't matter for something like mean or median, but could affect something like variance.

shoyer · 2016-03-17T23:21:29Z

There has been some recent discussion about implementing efficient algorithms for weighted partition (e.g., to do weighted median) upstream in NumPy, as well:
https://mail.scipy.org/pipermail/numpy-discussion/2016-February/075000.html

In any case, a first draft that uses sorting to do weighted median would still be valuable.

max-sixty · 2016-05-11T00:00:24Z

From pydata/xarray#650:

How about designing this as a groupby-like interface? In the same way as .rolling (or .expanding & .ewm in pandas)?

So for example ds.weighted(weights=ds.dim).mean().

And then this is extensible, clean, pandan-tic.

jreback · 2016-05-11T03:00:50Z

what other things would you do with a .weighted(..).mean() interface?

IOW what other parameters would it accept aside from the actual weights?

shoyer · 2016-05-11T03:37:16Z

@jreback I think .weighted() would only accept weights, which could be either an array or a callable of the usual form (lambda df: ....). But the WeightedMethods class could also expose weighted implementations of other methods, such as std, var, median, sum, value_counts, hist, etc. I would even consider moving over sample and deprecating the weights argument.

jreback · 2016-05-11T12:22:05Z

@shoyer I can see some syntax from that e.g.

df.weighted('A').B.mean() is pretty clear

though df.B.mean(weights=df.A) is just as clear, so looking for a case where this is significantly nicer.

any idea how/does R do this? (julia?)

benjello · 2016-05-11T12:30:06Z

I used R wtd.stats.

wtd.mean(x, weights=NULL, normwt="ignored", na.rm=TRUE)
wtd.var(x, weights=NULL, normwt=FALSE, na.rm=TRUE)
wtd.quantile(x, weights=NULL, probs=c(0, .25, .5, .75, 1), 
             type=c('quantile','(i-1)/(n-1)','i/(n+1)','i/n'), 
             normwt=FALSE, na.rm=TRUE)
wtd.Ecdf(x, weights=NULL, 
         type=c('i/n','(i-1)/(n-1)','i/(n+1)'), 
         normwt=FALSE, na.rm=TRUE)
wtd.table(x, weights=NULL, type=c('list','table'), 
          normwt=FALSE, na.rm=TRUE)
wtd.rank(x, weights=NULL, normwt=FALSE, na.rm=TRUE)
wtd.loess.noiter(x, y, weights=rep(1,n), robust=rep(1,n), 
                 span=2/3, degree=1, cell=.13333, 
                 type=c('all','ordered all','evaluate'), 
                 evaluation=100, na.rm=TRUE)

jreback · 2016-05-11T12:41:45Z

@benjello hmm, that's interesting.

shoyer · 2016-05-11T16:33:03Z

though df.B.mean(weights=df.A) is just as clear, so looking for a case where this is significantly nicer.

On the face of it, this does look as nice. But from an API design perspective, adding a keyword argument for weights is much less elegant.

Being "weighted" is orthogonal to the type of statistical calculation. With this proposal, instead of adding the weights keyword argument to N different methods, we define a single weighted method, and add statistical methods to it that exactly match the signature of the same methods on DataFrame/Series. This makes it obvious that all these methods share the same approach, and keeps method signatures from growing additional arguments that trigger entirely independent code paths (which is a sign of code smell).

Separately: perhaps weightby is a slightly better name than weighted? It suggests more similarity to groupby.

closes pandas-dev#10030

chris-b1 · 2017-04-05T21:27:14Z

Interesting package addressing weighted calculations here - https://github.com/jsvine/weightedcalcs. Has an api along the lines of weightby.

rbiswas4 · 2017-04-30T17:48:03Z

I am looking for weighted means in group by aggregates, where the weights are another column of the dataframe. Does this thread include support for this? To make my question explicit:

lcc:

snid	mjd	band	flux	weights
obsHistID					
609734	141374000	60493.416959	lsstg	2.825651e-09	6.442312e+20
609733	141374000	60493.416511	lsstg	2.893961e-09	5.962141e+20
609732	141374000	60493.416062	lsstg	2.834461e-09	6.590458e+20
....
611542	141374000	60495.426047	lssti	6.722778e-09	1.307280e+20
610790	141374000	60494.432074	lsstz	6.619978e-09	6.156260e+19

and I do operations like:
grouped = lcc.groupby(['snid', 'band', 'night']) res = grouped.agg(dict(flux=np.mean))
What I really want to do is fast

res = grouped.agg(dict(flux=weightedmean(weights='weights'))

The real problem is that this requires two columns in the aggregate input. I have looked at workarounds like the ones suggested here but I find this to be slow:

In my case when the original dataframe has about ~10000 rows, and 10000 groups, a direct np.mean aggregate times to a best of 1.6 ms per loop, while running through the workaround is ~ 2 mins per loop. Is there an implementation/workaround that will speed this up for me?

ilemhadri · 2017-08-03T15:52:39Z

Same problem here.
At the moment i am using .apply as a workaround but this is slow.
Being able to call .agg on groupby objects using more than one column as input would be highly appreciated.

chris-b1 · 2017-08-03T16:57:48Z

You can improve performance significantly by not using apply, but instead building the calc out of existing vectorized ops. Example for mean below.

In [49]: df = pd.DataFrame({'id': np.repeat(np.arange(100, dtype='i8'), 100),
    ...:                    'v': np.random.randn(10000),
    ...:                    'w': np.random.randn(10000)})


In [46]: %timeit mean1 = df.groupby('id').apply(lambda x: (x['v'] * x['w']).sum() / x['w'].sum())
32.4 ms ± 2.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [47]: %%timeit
    ...: df['interim'] = df['v'] * df['w']
    ...: gb = df.groupby('id')
    ...: mean2 = gb['interim'].sum() / gb['w'].sum()
1.21 ms ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [48]: np.allclose(mean1, mean2)
Out[48]: True

johne13 · 2017-11-18T17:41:47Z

This is so complicated between weight types, API, etc. but just to chime in while stuff is fresh in my mind:

From a statistical point of view, basic statistics seem to break down into 2 categories:

means and order statistics (min/max/median/quantiles)
higher order moments (std dev/variance/skew/kurtosis)

I believe that the first category is very straightforward to handle. I'm an amateur statistician at best, but I think there is really only one basic way to calculate mean/max/median etc.

Conversely, std dev & variance are a lot more complicated than you think -- not that the math is that hard but more that "std dev" can mean more than one thing here. Really great article here that lays out the issues:
https://www.stata.com/support/faqs/statistics/weights-and-summary-statistics/

For example, if you type these two commands in stata:
sum x [fw=weight], detail
sum x [aw=weight], detail
you'll get the same results for all stats except std & var

Also, to the extent pandas is handing this sort of thing off to statsmodels, they do have a library here that does some weighting for most basic stats (although min & max seem to be missing). See this link for more (a recent answer I wrote at SO using the statsmodel library):

https://stackoverflow.com/questions/17689099/using-describe-with-weighted-data-mean-standard-deviation-median-quantil/47368071#47368071

randomgambit · 2017-11-18T18:40:50Z

look im sorry but this is largely incorrect. weighted statistics are really basis stuff

johne13 · 2017-11-18T20:01:27Z

@randomgambit

OK, then which of these is correct?

sum x [fw=weight], detail
sum x [aw=weight], detail

kdebrab · 2018-06-12T15:13:55Z

FWIW:
I needed a resampled weighted quantile and implemented it as follows.

def resample_weighted_quantile(frame, weight=None, rule='D', q=0.5):
    if weight is None:
        return frame.resample(rule).apply(lambda x: x.quantile(q))
    else:
        data = [series.resample(rule).apply(_weighted_quantile, weight=weight[col], q=q).rename(col)
                          for col, series in frame.items()]
        return pd.concat(data, axis=1)

def _weighted_quantile(series, weight, q):
    series = series.sort_values()
    cumsum = weight.reindex(series.index).cumsum()
    cutoff = cumsum.iloc[-1] * q
    return series[cumsum >= cutoff].iloc[0]

frame and weight are dataframes with same index and columns. It could probably be optimized, but at least it works.

Heuertje · 2018-08-08T08:38:41Z

This would be a great addition to pandas!

JacekPliszka · 2021-03-11T20:09:11Z

I would love to have such functionality - this is one of the things I sorely miss in pandas in comparison to R.
My use case needs different weights for different columns. so something like that would be great:

df.groupby(...).mean(weights={'a': df.d, 'b': df.e})

jreback · 2021-03-11T20:19:38Z

this is already available in master / 1.3, see https://pandas.pydata.org/pandas-docs/dev/user_guide/window.html?highlight=weighted_mean#overview, look for weighted_mean.

JacekPliszka · 2021-03-11T20:54:34Z

Thank you though I have one small problem with that - I would like to apply it after groupby that does not have a fixed size.

MaxGhenis · 2021-03-12T05:50:22Z

I'm involved in a project to extend pandas to include weights: https://github.com/PSLmodels/microdf

This allows normal interaction with many pandas functions, including mean and groupby, after defining weights upfront, e.g. (notebook):

import microdf as mdf

d = mdf.MicroDataFrame({"x": [1, 2, 3], "g": ["a", "a", "b"]}, weights=[2, 1, 1])
d.groupby("g").x.mean()

Result:

g
a    1.333333
b    3.000000
dtype: float64

It currently only supports one set of weights, so @JacekPliszka's request would require a post-hoc merge.

Of course, we'd gladly shut it down if pandas included it natively at some point :)

max-sixty · 2021-03-12T06:51:42Z

As a reference, xarray supports weighted with a groupby-like .weighted method: http://xarray.pydata.org/en/stable/examples/area_weighted_temperature.html#Weighted-mean

johne13 · 2021-03-16T17:27:00Z

statsmodels also has a pretty decent suite of weighted descriptive stats: https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.DescrStatsW.html means, quantiles, corrcoef, and various statistical tests

…

On Fri, Mar 12, 2021 at 1:51 AM Maximilian Roos ***@***.***> wrote: As a reference, xarray supports weighted with a groupby-like .weighted method: http://xarray.pydata.org/en/stable/examples/area_weighted_temperature.html#Weighted-mean — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#10030 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACCZFVDCZNM3XLC7HXD6N7LTDGTX5ANCNFSM4BBSTXYA> .

lababidi · 2022-08-12T06:44:40Z

What's the status on this issue? Any extra help needed? Or is this implemented?

benjello · 2022-08-19T08:56:40Z

@lababidi : not implemented.

nachomaiz · 2025-04-11T17:31:53Z

Hi!

Are there any plans to eventually review this feature?

Just wanted to add a note to the conversation as I unfortunately was a bit mislead by it.

The original comment by @shoyer had a bit of an issue (might want to make a quick note/edit since it's the top comment 😊):

Why not just write:

pd_avg = (np.array(w) * pandas.DataFrame(a)).mean(axis=1)

This doesn't really apply a weighted mean, as the denominator should be the sum of weights across all valid values, not the count of valid values as the .mean() method seems to use.

Here's what I wrote, which also should handle missing values appropriately:

(
    data[
        [
            ..., # variables
            "weight",  # must include weights!
        ]
    ]
    .groupby("groups")
    .apply(  # apply to each group df
        lambda df: df.apply(  # apply to each col within df
            lambda col: col.mul(df["weight"], axis=0).sum()
            # use `.mask(col.isna())` so weights for missing values don't count towards the denominator
            / df["weight"].mask(col.isna()).sum()
        )
    )
    .drop(columns="weight")  # drop weights after aggregation
)

This of course feels very clunky, so I would also favor the .weighted or .weightby API that was mentioned above.

Hopefully this can help a bit with implementation once it happens. 👍

jorisvandenbossche added Enhancement Effort Low labels May 3, 2015

jorisvandenbossche added this to the Someday milestone May 3, 2015

jorisvandenbossche added the Difficulty Intermediate label May 3, 2015

benjello mentioned this issue Jun 13, 2015

Towards "pandas 1.0" #10000

Closed

jorisvandenbossche added Effort Medium and removed Effort Low labels Jun 14, 2015

max-sixty mentioned this issue Nov 9, 2015

Feature/average pydata/xarray#650

Closed

jreback added a commit to jreback/pandas that referenced this issue Jan 2, 2017

ENH: add weights kw to numeric aggregation functions

f9de1ba

closes pandas-dev#10030

jreback mentioned this issue Apr 5, 2017

nice package! jsvine/weightedcalcs#1

Open

jbrockmendel removed Effort Medium labels Oct 21, 2019

This was referenced Dec 11, 2019

Remove weighted functions if pandas adds weights to Series operations PolicyEngine/microdf#55

Open

Support list of quantiles (e.g. 25th and 75th percentiles) jsvine/weightedcalcs#7

Closed

mroeschke added the Numeric Operations Arithmetic, Comparison, and Logical operations label Dec 25, 2019

jreback added the Window rolling, ewma, expanding label Nov 25, 2020

jreback mentioned this issue Oct 10, 2021

ENH/Discussion: "event"/"per sample" weights support #43960

Closed

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

jbrockmendel added Reduction Operations sum, mean, min, max, etc. and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels Mar 28, 2023

weighted mean #10030

weighted mean #10030

Comments

bgrayburn commented Apr 30, 2015

shoyer commented Apr 30, 2015

benjello commented May 2, 2015

bgrayburn commented May 2, 2015

shoyer commented May 3, 2015

benjello commented Jun 13, 2015

shoyer commented Jun 13, 2015

bgrayburn commented Jul 24, 2015

jreback commented Jul 24, 2015

jreback commented Jul 24, 2015

benjello commented Jul 29, 2015

mattayes commented Jul 29, 2015

shoyer commented Jul 29, 2015

johne13 commented Feb 27, 2016

johne13 commented Feb 28, 2016

shoyer commented Mar 17, 2016

max-sixty commented May 11, 2016

jreback commented May 11, 2016

shoyer commented May 11, 2016

jreback commented May 11, 2016

benjello commented May 11, 2016 • edited Loading

jreback commented May 11, 2016

shoyer commented May 11, 2016

chris-b1 commented Apr 5, 2017

rbiswas4 commented Apr 30, 2017

ilemhadri commented Aug 3, 2017

chris-b1 commented Aug 3, 2017 • edited Loading

johne13 commented Nov 18, 2017

randomgambit commented Nov 18, 2017

johne13 commented Nov 18, 2017

kdebrab commented Jun 12, 2018 • edited Loading

Heuertje commented Aug 8, 2018

JacekPliszka commented Mar 11, 2021

jreback commented Mar 11, 2021

JacekPliszka commented Mar 11, 2021

MaxGhenis commented Mar 12, 2021 • edited Loading

max-sixty commented Mar 12, 2021

johne13 commented Mar 16, 2021 via email

lababidi commented Aug 12, 2022

benjello commented Aug 19, 2022

nachomaiz commented Apr 11, 2025 • edited Loading

benjello commented May 11, 2016 •

edited

Loading

chris-b1 commented Aug 3, 2017 •

edited

Loading

kdebrab commented Jun 12, 2018 •

edited

Loading

MaxGhenis commented Mar 12, 2021 •

edited

Loading

nachomaiz commented Apr 11, 2025 •

edited

Loading