Skip to content

ENH/Discussion: "event"/"per sample" weights support #43960

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jonas-eschle opened this issue Oct 10, 2021 · 3 comments
Closed

ENH/Discussion: "event"/"per sample" weights support #43960

jonas-eschle opened this issue Oct 10, 2021 · 3 comments
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@jonas-eschle
Copy link

Why this discussion

Having looked around, I did to my surprise not find any real discussion on per sample weights for pandas DataFrames and therefore opening this thread to have a discussion on it. (if I am mistaken and the discussion exists somewhere, I must have missed it and this can be gladly closed)

Per sample weights

A datapoint can have a weight assigned to it reflecting the "count" of that point. Weighted data samples are not uncommon in data science (and many APIs in machine learning such as Scikit-learn, Keras, ...) and are an essential part of the data (as far as that a data point can be regarded as a tuple of two points: the data and the weight).

A data weight changes many things fundamentally: plotting of a histogram or scatter plot (or others), calculation of quantities such as mean, std, ...

Pandas, AFAIK, does not offer any support of dealing with weights and calculating e.g. a variance means that one needs to use a function from another library.

What it could look like

DataFrame could take an additional argument "weights" and have an attribute with the same name that is technically a Series and behaves like another column in the dataframe (broadcasting, index) but is not accessible over the normal indexing.

A lot of the behavior is straight forward (and has his open questions):

  • any kind of slicing will always return a weighted dataframe as well (but what about concat? like two different columns)
  • plots will use weights (but what if weights can't be used?)
  • calculated quantities will use the weighted version (what if not available? Error?)
  • Other libraries (such as sklearn) could access them directly

API breaking implications

Since it would constitute a completely new feature, having errors if it can't be handled - or also other considerations of changed behavior - would not break any existing code (modulo user defined attribute "weight", but these users will likely want the new feature)

Describe alternatives you've considered

This is of course always achievable in other ways such as calling matplotlib, passing a weights argument explicitly etc. But this can be said about anything in pandas, given that it here to help with tabular data.

In fact that would be the main reason to kickoff the discussion: if one has weighted data, pandas (AFAIK) does not support them at all, as it doesn't know about them (in the sense that many features cannot just be used). Adding them would make pandas a platform that can also handle weighted data.

Discussion

Needless to say that there are many cases that will need to be sorted out, needless to say that there are many fundamental objections and to be clear: this post is not saying we should introduce them. But it discusses it, since

  • weights are a common occurrence in data science
  • weights are an essential part of a dataset as they change the meaning and a lot of functionality. They are not merely a specialized attribute that can be handled as meta-data but fundamentally part of a data point.
  • pandas makes analysis of (tabular) data easier by offering handles for it, yet (AFAIK) does not support weights at all, offering only a minimal help with weighted data
  • It doesn't break any backwards compatibility and could be gradually introduced (with methods first erroring if weights are not yet supported).

I am not clear on a side myself: It seems to me on the same time an unreasonably big change in so many places that will, realistically, not occur, yet at the same time actually absolutely save (no break, not every feature needs to be supported, no conflict) and a fundamental property to data.

What are your thoughts on this?

@jonas-eschle jonas-eschle added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 10, 2021
@jonas-eschle jonas-eschle changed the title ENH/Discussion: "event"/"per sample" weights ENH/Discussion: "event"/"per sample" weights support Oct 10, 2021
@jreback
Copy link
Contributor

jreback commented Oct 10, 2021

see #10030

@mzeitlin11
Copy link
Member

Sounds interesting, but seems like it might add a bunch of complexity if added directly to DataFrame since for a bunch of places we'd now need to have conditional logic for the presence of weights. Since most users likely wouldn't specify weights, what about instead creating something like WeightedDataFrame which subclasses DataFrame and can override methods for which weights matter?

@mroeschke
Copy link
Member

It appears that this request is largely similar to #10030, which includes API design considerations. Let's centralize discussion in that issue, so close this one as a duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

4 participants