Skip to content

BUG/ENH: GH10319 added higher_precision argument to rolling_mean/sum #10328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

kawochen
Copy link
Contributor

This is not specifically for #10319, but should close it for most cases, should users pass in higher_precision=True. #10319 is not really a bug in the Cython code, just precision loss when you do finite-precision arithmetic naively.

>>> sum([0.00012456, 0.0003, -0.00012456, -0.0003])
-5.421010862427522e-20

This is 2**-64 and quite negligible, although it can be made much bigger with carefully chosen numbers. I think rolling_* functions that require summation could use some more precision, since doing everything in one pass accumulates rounding error, even if the rolling window is narrow.
The only reason why [0.00012456, 0.0003, 0.0, 0.0] currently returns the 'correct' result while [0.00012456, 0.0003, -0.0, -0.0] doesn't is because of explicit negative entry counting. This is only done in rolling_mean but not rolling_sum, and the lack of symmetry doesn't seem desirable. Perhaps we could remove the counting, or add it to rolling_sum as well?

@jreback
Copy link
Contributor

jreback commented Jun 11, 2015

This is going to kill performance. I also think an option to do this is really a bad API. I would only be not -1 if this could be done by another non-slow algo (e.g. like we did using Welford method for variance) .

@kawochen
Copy link
Contributor Author

Point taken. The algo itself doesn't seem that slow to me, though my implementation certainly is. I will do everything C-style and remove the option (no branching) and measure the performance.

@shoyer
Copy link
Member

shoyer commented Jun 11, 2015

I agree with @jreback, I don't think this is right API. Double doubles are fine (in theory), but the right way to use them should be to cast your column to a double double explicitly, e.g., via astype. But I would also want to see this float128 type in numpy first rather than trying to implement it in pandas.

@jreback
Copy link
Contributor

jreback commented Jun 11, 2015

actually float128 DO exist in numpy (though not supported on some systems) - so better is to add them to the code generators (but these would have to be protected by try/excepts as the signature doesn't exist) - maybe a better soon for that is to use ifdef though

@kawochen
Copy link
Contributor Author

@shoyer My concern is that the current rolling_sum/mean algorithm, although fast, introduces a lot more numerical instability than the most naive algorithm without the users being aware. The last sum, even if the window size is 1, takes 2*n computations.

@jreback
Copy link
Contributor

jreback commented Jun 11, 2015

@kawochen separately you could short-circuit rolling functions if the window==1 (though only for sum/mean, var/std etc are all nan at that point.

@kawochen kawochen closed this Jun 11, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants