Skip to content

QST: Consistently apply Welford Method and Kahan Summation in roll_xxx functions #59715

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks done
kaixiongg opened this issue Sep 5, 2024 · 7 comments
Open
2 tasks done
Labels
API Design Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance Reduction Operations sum, mean, min, max, etc.

Comments

@kaixiongg
Copy link

Research

  • I have searched the [pandas] tag on StackOverflow for similar questions.

  • I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/questions/78951576/pandas-consistently-apply-welford-method-and-kahan-summation-in-roll-xxx-functi

Question about pandas

For the following functions:

1.def nancorr
2.cdef void add_var
3.cdef void add_skew
4.cdef void add_mean

It appears that both the Welford method and Kahan summation are taken into account. However, for second-order functions like correlation and variance, only the Welford method is used without Kahan summation for the means (meanx or meany). For third-order functions like skewness, only Kahan summation for the naive one-pass algorithm is employed without using Welford.

My question is: How does the Pandas community decide which method to use for stable precision? If our goal is to achieve the highest possible stability, it seems that all these functions should utilize a combination of Welford and Kahan methods.

Could you please clarify the rationale behind these choices?

@kaixiongg kaixiongg added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Sep 5, 2024
@kaixiongg kaixiongg changed the title QST: Consistently Apply Welford Method and Kahan Summation in roll_xxx Functions QST: Consistently apply Welford Method and Kahan Summation in roll_xxx functions Sep 5, 2024
@Liam3851
Copy link
Contributor

Liam3851 commented Sep 5, 2024

Seems related to #57972 and #59572 (perhaps root-cause of the above issues?).

@kaixiongg
Copy link
Author

Seems related to #57972 and #59572 (perhaps root-cause of the above issues?).

So, does that mean only applying Welford for those online functions is enough? We don't need Kahan summation for those complicated multi-order functions?

@Liam3851
Copy link
Contributor

Liam3851 commented Sep 9, 2024

@kaixiongg I was just linking those issues because multiple users have noted that the pandas kurtosis functions seem less stable than similar functions from e.g. scipy. To the extent that some stability techniques are not being used that may be the root cause of the kurtosis issues reported.

@kaixiongg
Copy link
Author

@kaixiongg I was just linking those issues because multiple users have noted that the pandas kurtosis functions seem less stable than similar functions from e.g. scipy. To the extent that some stability techniques are not being used that may be the root cause of the kurtosis issues reported.

Thanks for clarifying. I have also noticed that roll_var/std face similar issues. (#52407 + #54518 + #53829) Could you please reopen those issues? I would appreciate it if they could be reassigned to someone who can address them. Additionally, if there are any better algorithms or solutions available, I would be glad to hear about them.

@rhshadrach
Copy link
Member

I think PRs to improve the consistency of numerical operations used would be welcome, provided they do not significantly degrade performance.

Below are my thoughts on the issue, but I haven't been too involved in this aspect in the past. cc @pandas-dev/pandas-core for any thoughts.

How does the Pandas community decide which method to use for stable precision?

I believe thus far it has been done on an ad-hoc basis, mostly dependent on whether there is a contributor interested in improving the stability of an operation.

If our goal is to achieve the highest possible stability

I would say this is not our goal. Rather, it is to balance stability and performance. However I do not know of any objective criteria utilized in doing so, to my awareness it has been done on an ad-hoc basis.

My ideal solution would be to allow the user to choose between reasonable implementations of these methods, defaulting to a numerically stable operation but allowing less stable options for more performance. However, I worry about the maintenance burden this would impose.

@rhshadrach rhshadrach added Needs Discussion Requires discussion from core team before further action Reduction Operations sum, mean, min, max, etc. API Design Performance Memory or execution speed performance and removed Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Sep 15, 2024
@kaixiongg
Copy link
Author

kaixiongg commented Sep 18, 2024

I would say this is not our goal. Rather, it is to balance stability and performance. However I do not know of any objective criteria utilized in doing so, to my awareness it has been done on an ad-hoc basis.

I agree that our goal is to balance stability and performance. However, the precision issue is more of an optimization problem rather than a right/wrong issue. Some issues show that 'incorrect' results are due to precision loss. Even with the most precise algorithm we can implement, there will always be some corner cases where the result appears 'incorrect.' From my understanding, Welford’s method is commonly used for rolling calculations. If we were to add Kahan summation to these algorithms, we’d need to apply it to each operation, which seems like it would come with a significant performance cost, and still wouldn't completely solve the 'bug' caused by precision loss.

My ideal solution would be to allow the user to choose between reasonable implementations of these methods, defaulting to a numerically stable operation but allowing less stable options for more performance. However, I worry about the maintenance burden this would impose.

Yes, that would be great, but it seems like most open-source packages don’t take that approach.

@rhshadrach
Copy link
Member

I have also noticed that roll_var/std face similar issues. (#52407 + #54518 + #53829) Could you please reopen those issues?

The two that are issues are open (and have never been closed); the third is an old PR which I believe should stay closed.

which seems like it would come with a significant performance cost...

It seems to me this should be experimented with to determine what the performance cost is.

...and still wouldn't completely solve the 'bug' caused by precision loss.

This isn't possible to do, so I don't think this should enter into consideration when evaluating a method for improving numerical stability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

No branches or pull requests

3 participants