-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
QST: Consistently apply Welford Method and Kahan Summation in roll_xxx functions #59715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@kaixiongg I was just linking those issues because multiple users have noted that the pandas kurtosis functions seem less stable than similar functions from e.g. scipy. To the extent that some stability techniques are not being used that may be the root cause of the kurtosis issues reported. |
Thanks for clarifying. I have also noticed that roll_var/std face similar issues. (#52407 + #54518 + #53829) Could you please reopen those issues? I would appreciate it if they could be reassigned to someone who can address them. Additionally, if there are any better algorithms or solutions available, I would be glad to hear about them. |
I think PRs to improve the consistency of numerical operations used would be welcome, provided they do not significantly degrade performance. Below are my thoughts on the issue, but I haven't been too involved in this aspect in the past. cc @pandas-dev/pandas-core for any thoughts.
I believe thus far it has been done on an ad-hoc basis, mostly dependent on whether there is a contributor interested in improving the stability of an operation.
I would say this is not our goal. Rather, it is to balance stability and performance. However I do not know of any objective criteria utilized in doing so, to my awareness it has been done on an ad-hoc basis. My ideal solution would be to allow the user to choose between reasonable implementations of these methods, defaulting to a numerically stable operation but allowing less stable options for more performance. However, I worry about the maintenance burden this would impose. |
I agree that our goal is to balance stability and performance. However, the precision issue is more of an optimization problem rather than a right/wrong issue. Some issues show that 'incorrect' results are due to precision loss. Even with the most precise algorithm we can implement, there will always be some corner cases where the result appears 'incorrect.' From my understanding, Welford’s method is commonly used for rolling calculations. If we were to add Kahan summation to these algorithms, we’d need to apply it to each operation, which seems like it would come with a significant performance cost, and still wouldn't completely solve the 'bug' caused by precision loss.
Yes, that would be great, but it seems like most open-source packages don’t take that approach. |
The two that are issues are open (and have never been closed); the third is an old PR which I believe should stay closed.
It seems to me this should be experimented with to determine what the performance cost is.
This isn't possible to do, so I don't think this should enter into consideration when evaluating a method for improving numerical stability. |
Research
I have searched the [pandas] tag on StackOverflow for similar questions.
I have asked my usage related question on StackOverflow.
Link to question on StackOverflow
https://stackoverflow.com/questions/78951576/pandas-consistently-apply-welford-method-and-kahan-summation-in-roll-xxx-functi
Question about pandas
For the following functions:
1.def nancorr
2.cdef void add_var
3.cdef void add_skew
4.cdef void add_mean
It appears that both the Welford method and Kahan summation are taken into account. However, for second-order functions like correlation and variance, only the Welford method is used without Kahan summation for the means (meanx or meany). For third-order functions like skewness, only Kahan summation for the naive one-pass algorithm is employed without using Welford.
My question is: How does the Pandas community decide which method to use for stable precision? If our goal is to achieve the highest possible stability, it seems that all these functions should utilize a combination of Welford and Kahan methods.
Could you please clarify the rationale behind these choices?
The text was updated successfully, but these errors were encountered: