-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: low variance arrays' kurtosis wrongfully set to zero #58176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
dontgoto
wants to merge
13
commits into
pandas-dev:main
from
dontgoto:240323_57972_kurtosis_low_value_cutoff
Closed
Changes from 10 commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
9923460
add check for not completely unstable parameter range
dontgoto d413792
refine cutoff for small arrays
dontgoto 15aa86e
add cutoff in cython
dontgoto 79e3381
add test
dontgoto 2e67416
add comment and code path
dontgoto 2a958af
Merge remote-tracking branch 'upstream/main' into 240323_57972_kurtos…
dontgoto b030d9b
add whatsnew
dontgoto 6c807cd
remove skew
dontgoto 7a65c8b
update threshold
dontgoto ac68587
update comments
dontgoto c96fc4f
Update v3.0.0.rst
dontgoto af13513
Revert aggregations.pyx
dontgoto 35b3c70
Merge branch 'main' into 240323_57972_kurtosis_low_value_cutoff
dontgoto File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this constant? The minimum number of significant decimal digits that have guaranteed precision for a double is 15, which I assume is where
1e-14
came from.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the variance of the observations. The e-14 cutoff is too conservative and also sets numerically stable results to NaN, e.g. when the mean of the observations is very low.
I did some more testing after your comment and setting the cutoff as low as in nanops (e-281) prevents the false positives, but it also lets numerically unstable results pass, so I reverted it. Schemes that take into account the mean etc. of the observations in the cutoff were also not really satisfactory.
I'll look into this and make another PR if I find some satisfactory solution for the equation in the .pyx here. Numerically it behaves very different from the one in nanops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since double precision is only guaranteed up to 15 significant decimal digits across implementations choosing anything smaller than
1e-14
is not going to workThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The e-14 might make one think that it is about a float's number of significant digits, but
B <= e-14
only checks for potential numerical instabilities irrespective of that.E.g.,
Is numerically unstable, though the variance is larger than e-14 (rolling kurt = e15).
On the other hand
Is numerically stable, but the variance is smaller than e-14 (rolling kurt = 2.5).
In nanops the equations become unstable only around e-281 in my tests, but here it's more complex.
Example code:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a whitepaper that lays out what you are trying to accomplish? The problem with your local results is that it is dependent on your hardware, and floating point implementations an vary.
I don't believe that a statement like
# scipy.kurt is nan at e-81
is generally True (nan can be generated from quite a few different patterns, although technically platforms should be choosing one canonical pattern), and thee-72
ande-281
sentinels seem arbitraryThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree adjusting this arbitrary limit is not ideal. IMO ideally we shouldn't have one in the first place. We removed a similar arbitrary limit for var and std a few releases ago #40505
I would support just removing this limit and just documenting floating point precision artifacts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points @WillAyd, @mroeschke.
I was setting the limit(s) to better reflect the actual stability range of the calculations, but the limit is arbitrary due to dependencies on inputs as well as machine platforms.
For the nanops version I would agree with removing the check, since it was introduced before the equation there was stabilised. From my tests the kurt calculation there is about as stable as the scipy implementation and only becomes unstable for very extreme cases.
From my tests, the kurt implementation in
aggregation.pyx
here seems comparatively much more unstable. The equations involved have a lot of potential cancellations. I would suggest first stabilising the equations there. That's something I could do.