-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Improve numerical stability for window functions skew and kurt #37557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Edit: Have to run asvs them again because of code changes |
Got somewhat ugly now through the Kahan summation for all parts. |
Would have expected a bigger hit. |
pandas/_libs/window/aggregations.pyx
Outdated
|
||
with nogil: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can leave the original nogil right? e.g. no reason to do it twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subtraction of the mean does not wirk with nogil, that is the reason did it there again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
umm why? this is array - scalar right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought so too, but got an error. Will try again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, getting the following error without gil:
Error compiling Cython file:
------------------------------------------------------------
...
nobs_mean += 1
sum_val += val
mean_val = sum_val / nobs_mean
# Other cases would lead to imprecision for smallest values
if min_val - mean_val > -1e5:
values = values - round(mean_val)
^
------------------------------------------------------------
pandas/_libs/window/aggregations.pyx:545:36: Converting to Python object not allowed without gil
Traceback (most recent call last):
File "setup.py", line 790, in <module>
setup_package()
File "setup.py", line 760, in setup_package
ext_modules=maybe_cythonize(extensions, compiler_directives=directives),
File "setup.py", line 539, in maybe_cythonize
return cythonize(extensions, *args, **kwargs)
File "/home/developer/anaconda3/envs/omi_reports/pandas-dev/lib/python3.8/site-packages/Cython/Build/Dependencies.py", line 1102, in cythonize
cythonize_one(*args)
File "/home/developer/anaconda3/envs/omi_reports/pandas-dev/lib/python3.8/site-packages/Cython/Build/Dependencies.py", line 1225, in cythonize_one
raise CompileError(None, pyx_file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm no, mean_value is a float. This does not work without round either. Maybe the broadcasting is the problem here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah maybe have to use np.broadcast_to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a solution for the round issue. Used a loop for the subtraction, because np.broadcast_to
did not work either. No matter this, I think the loop is the most intuitive solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from libc.math cimport round
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works, thanks very much.
pandas/_libs/window/aggregations.pyx
Outdated
if notnan(val): | ||
nobs_mean += 1 | ||
sum_val += val | ||
mean_val = sum_val / nobs_mean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
umm, why are you adding this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issues mentioned above describes a use case. where the series is shifted by 5000. In this case, if we do not subtract the mean, The decimal places will be so small, that they will be ignored. Leading to the numerical imprecision described in the issue.
Kahan summation only solves issues, when the range of the values between the windows is really big. Could post a more precise and explicit example, if this helps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think i understand the issue, but am unclear why you are not always just subtracting the mean rather than conditionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaah sorry. Misunderstood you.
In case of something like [1e12, 0.023, 0.04565, 0.343545, 0.343434]
and subtract the mean from the complete series, we lose precision for the second window starting at 0.023
, because this abs value gets really big too
� Conflicts: � doc/source/whatsnew/v1.2.0.rst
� Conflicts: � doc/source/whatsnew/v1.2.0.rst � pandas/tests/window/test_rolling.py
� Conflicts: � doc/source/whatsnew/v1.2.0.rst
� Conflicts: � pandas/tests/window/test_expanding.py
looks good, can you merge master and ping on green (also maybe want to re-run the perf tests to give a good read) |
asvs:
|
cc @jreback pipelines are green |
thanks @phofl keep em coming! |
Why not consider implementing it using the Welford method? |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Used the same procedure as in
DataFrame.skew()