ENH/Discussion: "event"/"per sample" weights support #43960

jonas-eschle · 2021-10-10T21:06:46Z

Why this discussion

Having looked around, I did to my surprise not find any real discussion on per sample weights for pandas DataFrames and therefore opening this thread to have a discussion on it. (if I am mistaken and the discussion exists somewhere, I must have missed it and this can be gladly closed)

Per sample weights

A datapoint can have a weight assigned to it reflecting the "count" of that point. Weighted data samples are not uncommon in data science (and many APIs in machine learning such as Scikit-learn, Keras, ...) and are an essential part of the data (as far as that a data point can be regarded as a tuple of two points: the data and the weight).

A data weight changes many things fundamentally: plotting of a histogram or scatter plot (or others), calculation of quantities such as mean, std, ...

Pandas, AFAIK, does not offer any support of dealing with weights and calculating e.g. a variance means that one needs to use a function from another library.

What it could look like

DataFrame could take an additional argument "weights" and have an attribute with the same name that is technically a Series and behaves like another column in the dataframe (broadcasting, index) but is not accessible over the normal indexing.

A lot of the behavior is straight forward (and has his open questions):

any kind of slicing will always return a weighted dataframe as well (but what about concat? like two different columns)
plots will use weights (but what if weights can't be used?)
calculated quantities will use the weighted version (what if not available? Error?)
Other libraries (such as sklearn) could access them directly

API breaking implications

Since it would constitute a completely new feature, having errors if it can't be handled - or also other considerations of changed behavior - would not break any existing code (modulo user defined attribute "weight", but these users will likely want the new feature)

Describe alternatives you've considered

This is of course always achievable in other ways such as calling matplotlib, passing a weights argument explicitly etc. But this can be said about anything in pandas, given that it here to help with tabular data.

In fact that would be the main reason to kickoff the discussion: if one has weighted data, pandas (AFAIK) does not support them at all, as it doesn't know about them (in the sense that many features cannot just be used). Adding them would make pandas a platform that can also handle weighted data.

Discussion

Needless to say that there are many cases that will need to be sorted out, needless to say that there are many fundamental objections and to be clear: this post is not saying we should introduce them. But it discusses it, since

weights are a common occurrence in data science
weights are an essential part of a dataset as they change the meaning and a lot of functionality. They are not merely a specialized attribute that can be handled as meta-data but fundamentally part of a data point.
pandas makes analysis of (tabular) data easier by offering handles for it, yet (AFAIK) does not support weights at all, offering only a minimal help with weighted data
It doesn't break any backwards compatibility and could be gradually introduced (with methods first erroring if weights are not yet supported).

I am not clear on a side myself: It seems to me on the same time an unreasonably big change in so many places that will, realistically, not occur, yet at the same time actually absolutely save (no break, not every feature needs to be supported, no conflict) and a fundamental property to data.

What are your thoughts on this?

jreback · 2021-10-10T21:12:52Z

see #10030

mzeitlin11 · 2021-10-11T17:34:11Z

Sounds interesting, but seems like it might add a bunch of complexity if added directly to DataFrame since for a bunch of places we'd now need to have conditional logic for the presence of weights. Since most users likely wouldn't specify weights, what about instead creating something like WeightedDataFrame which subclasses DataFrame and can override methods for which weights matter?

mroeschke · 2021-10-16T04:08:20Z

It appears that this request is largely similar to #10030, which includes API design considerations. Let's centralize discussion in that issue, so close this one as a duplicate.

jonas-eschle added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 10, 2021

jonas-eschle changed the title ~~ENH/Discussion: "event"/"per sample" weights~~ ENH/Discussion: "event"/"per sample" weights support Oct 10, 2021

mroeschke closed this as completed Oct 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH/Discussion: "event"/"per sample" weights support #43960

ENH/Discussion: "event"/"per sample" weights support #43960

jonas-eschle commented Oct 10, 2021

jreback commented Oct 10, 2021

mzeitlin11 commented Oct 11, 2021

mroeschke commented Oct 16, 2021

ENH/Discussion: "event"/"per sample" weights support #43960

ENH/Discussion: "event"/"per sample" weights support #43960

Comments

jonas-eschle commented Oct 10, 2021

Why this discussion

Per sample weights

What it could look like

API breaking implications

Describe alternatives you've considered

Discussion

jreback commented Oct 10, 2021

mzeitlin11 commented Oct 11, 2021

mroeschke commented Oct 16, 2021