QST: Consistently apply Welford Method and Kahan Summation in roll_xxx functions #59715

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

2 tasks done

kaixiongg opened this issue Sep 5, 2024 · 7 comments

Open

2 tasks done

QST: Consistently apply Welford Method and Kahan Summation in roll_xxx functions #59715

kaixiongg opened this issue Sep 5, 2024 · 7 comments

Labels

API Design Needs Discussion Performance Reduction Operations

kaixiongg commented Sep 5, 2024

Research

I have searched the [pandas] tag on StackOverflow for similar questions.
I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/questions/78951576/pandas-consistently-apply-welford-method-and-kahan-summation-in-roll-xxx-functi

Question about pandas

For the following functions:

1.def nancorr
2.cdef void add_var
3.cdef void add_skew
4.cdef void add_mean

It appears that both the Welford method and Kahan summation are taken into account. However, for second-order functions like correlation and variance, only the Welford method is used without Kahan summation for the means (meanx or meany). For third-order functions like skewness, only Kahan summation for the naive one-pass algorithm is employed without using Welford.

My question is: How does the Pandas community decide which method to use for stable precision? If our goal is to achieve the highest possible stability, it seems that all these functions should utilize a combination of Welford and Kahan methods.

Could you please clarify the rationale behind these choices?

kaixiongg added Needs Triage Usage Question labels

kaixiongg changed the title ~~QST: Consistently Apply Welford Method and Kahan Summation in roll_xxx Functions~~ QST: Consistently apply Welford Method and Kahan Summation in roll_xxx functions

Contributor

Liam3851 commented Sep 5, 2024

Seems related to #57972 and #59572 (perhaps root-cause of the above issues?).

Author

kaixiongg commented Sep 6, 2024

Seems related to #57972 and #59572 (perhaps root-cause of the above issues?).

So, does that mean only applying Welford for those online functions is enough? We don't need Kahan summation for those complicated multi-order functions?

Contributor

Liam3851 commented Sep 9, 2024

@kaixiongg I was just linking those issues because multiple users have noted that the pandas kurtosis functions seem less stable than similar functions from e.g. scipy. To the extent that some stability techniques are not being used that may be the root cause of the kurtosis issues reported.

Author

kaixiongg commented Sep 10, 2024

@kaixiongg I was just linking those issues because multiple users have noted that the pandas kurtosis functions seem less stable than similar functions from e.g. scipy. To the extent that some stability techniques are not being used that may be the root cause of the kurtosis issues reported.

Thanks for clarifying. I have also noticed that roll_var/std face similar issues. (#52407 + #54518 + #53829) Could you please reopen those issues? I would appreciate it if they could be reassigned to someone who can address them. Additionally, if there are any better algorithms or solutions available, I would be glad to hear about them.

Member

rhshadrach commented Sep 15, 2024

I think PRs to improve the consistency of numerical operations used would be welcome, provided they do not significantly degrade performance.

Below are my thoughts on the issue, but I haven't been too involved in this aspect in the past. cc @pandas-dev/pandas-core for any thoughts.

How does the Pandas community decide which method to use for stable precision?

I believe thus far it has been done on an ad-hoc basis, mostly dependent on whether there is a contributor interested in improving the stability of an operation.

If our goal is to achieve the highest possible stability

I would say this is not our goal. Rather, it is to balance stability and performance. However I do not know of any objective criteria utilized in doing so, to my awareness it has been done on an ad-hoc basis.

My ideal solution would be to allow the user to choose between reasonable implementations of these methods, defaulting to a numerically stable operation but allowing less stable options for more performance. However, I worry about the maintenance burden this would impose.

rhshadrach added Needs Discussion Reduction Operations API Design Performance and removed Needs Triage Usage Question labels

Author

kaixiongg commented Sep 18, 2024 •

edited

Loading

I would say this is not our goal. Rather, it is to balance stability and performance. However I do not know of any objective criteria utilized in doing so, to my awareness it has been done on an ad-hoc basis.

I agree that our goal is to balance stability and performance. However, the precision issue is more of an optimization problem rather than a right/wrong issue. Some issues show that 'incorrect' results are due to precision loss. Even with the most precise algorithm we can implement, there will always be some corner cases where the result appears 'incorrect.' From my understanding, Welford’s method is commonly used for rolling calculations. If we were to add Kahan summation to these algorithms, we’d need to apply it to each operation, which seems like it would come with a significant performance cost, and still wouldn't completely solve the 'bug' caused by precision loss.

My ideal solution would be to allow the user to choose between reasonable implementations of these methods, defaulting to a numerically stable operation but allowing less stable options for more performance. However, I worry about the maintenance burden this would impose.

Yes, that would be great, but it seems like most open-source packages don’t take that approach.

Member

rhshadrach commented Sep 22, 2024

I have also noticed that roll_var/std face similar issues. (#52407 + #54518 + #53829) Could you please reopen those issues?

The two that are issues are open (and have never been closed); the third is an old PR which I believe should stay closed.

which seems like it would come with a significant performance cost...

It seems to me this should be experimented with to determine what the performance cost is.

...and still wouldn't completely solve the 'bug' caused by precision loss.

This isn't possible to do, so I don't think this should enter into consideration when evaluating a method for improving numerical stability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment