You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Having looked around, I did to my surprise not find any real discussion on per sample weights for pandas DataFrames and therefore opening this thread to have a discussion on it. (if I am mistaken and the discussion exists somewhere, I must have missed it and this can be gladly closed)
Per sample weights
A datapoint can have a weight assigned to it reflecting the "count" of that point. Weighted data samples are not uncommon in data science (and many APIs in machine learning such as Scikit-learn, Keras, ...) and are an essential part of the data (as far as that a data point can be regarded as a tuple of two points: the data and the weight).
A data weight changes many things fundamentally: plotting of a histogram or scatter plot (or others), calculation of quantities such as mean, std, ...
Pandas, AFAIK, does not offer any support of dealing with weights and calculating e.g. a variance means that one needs to use a function from another library.
What it could look like
DataFrame could take an additional argument "weights" and have an attribute with the same name that is technically a Series and behaves like another column in the dataframe (broadcasting, index) but is not accessible over the normal indexing.
A lot of the behavior is straight forward (and has his open questions):
any kind of slicing will always return a weighted dataframe as well (but what about concat? like two different columns)
plots will use weights (but what if weights can't be used?)
calculated quantities will use the weighted version (what if not available? Error?)
Other libraries (such as sklearn) could access them directly
API breaking implications
Since it would constitute a completely new feature, having errors if it can't be handled - or also other considerations of changed behavior - would not break any existing code (modulo user defined attribute "weight", but these users will likely want the new feature)
Describe alternatives you've considered
This is of course always achievable in other ways such as calling matplotlib, passing a weights argument explicitly etc. But this can be said about anything in pandas, given that it here to help with tabular data.
In fact that would be the main reason to kickoff the discussion: if one has weighted data, pandas (AFAIK) does not support them at all, as it doesn't know about them (in the sense that many features cannot just be used). Adding them would make pandas a platform that can also handle weighted data.
Discussion
Needless to say that there are many cases that will need to be sorted out, needless to say that there are many fundamental objections and to be clear: this post is not saying we should introduce them. But it discusses it, since
weights are a common occurrence in data science
weights are an essential part of a dataset as they change the meaning and a lot of functionality. They are not merely a specialized attribute that can be handled as meta-data but fundamentally part of a data point.
pandas makes analysis of (tabular) data easier by offering handles for it, yet (AFAIK) does not support weights at all, offering only a minimal help with weighted data
It doesn't break any backwards compatibility and could be gradually introduced (with methods first erroring if weights are not yet supported).
I am not clear on a side myself: It seems to me on the same time an unreasonably big change in so many places that will, realistically, not occur, yet at the same time actually absolutely save (no break, not every feature needs to be supported, no conflict) and a fundamental property to data.
What are your thoughts on this?
The text was updated successfully, but these errors were encountered:
Sounds interesting, but seems like it might add a bunch of complexity if added directly to DataFrame since for a bunch of places we'd now need to have conditional logic for the presence of weights. Since most users likely wouldn't specify weights, what about instead creating something like WeightedDataFrame which subclasses DataFrame and can override methods for which weights matter?
It appears that this request is largely similar to #10030, which includes API design considerations. Let's centralize discussion in that issue, so close this one as a duplicate.
Why this discussion
Having looked around, I did to my surprise not find any real discussion on per sample weights for pandas DataFrames and therefore opening this thread to have a discussion on it. (if I am mistaken and the discussion exists somewhere, I must have missed it and this can be gladly closed)
Per sample weights
A datapoint can have a weight assigned to it reflecting the "count" of that point. Weighted data samples are not uncommon in data science (and many APIs in machine learning such as Scikit-learn, Keras, ...) and are an essential part of the data (as far as that a data point can be regarded as a tuple of two points: the data and the weight).
A data weight changes many things fundamentally: plotting of a histogram or scatter plot (or others), calculation of quantities such as mean, std, ...
Pandas, AFAIK, does not offer any support of dealing with weights and calculating e.g. a variance means that one needs to use a function from another library.
What it could look like
DataFrame could take an additional argument "weights" and have an attribute with the same name that is technically a Series and behaves like another column in the dataframe (broadcasting, index) but is not accessible over the normal indexing.
A lot of the behavior is straight forward (and has his open questions):
API breaking implications
Since it would constitute a completely new feature, having errors if it can't be handled - or also other considerations of changed behavior - would not break any existing code (modulo user defined attribute "weight", but these users will likely want the new feature)
Describe alternatives you've considered
This is of course always achievable in other ways such as calling matplotlib, passing a weights argument explicitly etc. But this can be said about anything in pandas, given that it here to help with tabular data.
In fact that would be the main reason to kickoff the discussion: if one has weighted data, pandas (AFAIK) does not support them at all, as it doesn't know about them (in the sense that many features cannot just be used). Adding them would make pandas a platform that can also handle weighted data.
Discussion
Needless to say that there are many cases that will need to be sorted out, needless to say that there are many fundamental objections and to be clear: this post is not saying we should introduce them. But it discusses it, since
I am not clear on a side myself: It seems to me on the same time an unreasonably big change in so many places that will, realistically, not occur, yet at the same time actually absolutely save (no break, not every feature needs to be supported, no conflict) and a fundamental property to data.
What are your thoughts on this?
The text was updated successfully, but these errors were encountered: