-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
use z-scores in rolling skew/kurt calculations #8270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c232707
to
28cea38
Compare
Note also #8086. |
Yep, separate issue. Just hoping someone (:-)) will want to unify |
@behzadnouri can you create a separate issue about |
@behzadnouri tests! |
does ths close #6929 or is their a further extension needed and ths is a partial cas |
cc @jaimefro |
@behzadnouri I realized #8086 is this issue @seth-p pointing at (unifying skew) |
28cea38
to
5b60546
Compare
@jreback tests added |
well, I think a more complete soln to #6929 is in order no? |
@jaimefrio I tried to take that route before making my code changes and that is definitely the right way to go if the code was doing expanding window moments not rolling window skew/kurt while handling nan values gracefully. take kurtosis for example, you need to maintain effective number of observations and even doing all that, at the end of the day, it is not like:
no! definitely not. for all real world series that I have seen, and any time series which does not change scale/location with multiple orders of magnitude z-scores will do the job with minimal code complexity and a lot more code maintainability. |
I don't know... A mock up of your solution runs on my system a little over 2x slower than the current code. And while it improves stability for what arguably are the most common use cases, it is still going to fail for a number of not-so-pathological ones, e.g. I am not much of a Pandas user or contributor, so I may be getting the philosophy all wrong. But on a project that uses home brewed implementations of skip lists (for Just to make it explicitly clear, I am -1 on this approach. But if it is deemed the right one, I would suggest that you put your code in the Python wrapper, as I really don't think you are taking any advantage of Cython. |
@jaimefrio your impl that you pointed above seems pretty straightforward and you have nicely refactored out the counting issue. I think it seems pretty reasonable and is likely to have similar perf to the current routines, pls continue work and submit a PR when ready. |
@jaimefrio what you are calling "not-so-pathological" has a jump more thank 34K of trailing standard deviation:
what actual real-world time series makes such a jump and still someone would ask "lets look at skew numbers"?
you are calling it "objectively worse" based on a contrived example, while at the same time admitting
and looking at this i honestly do not know what to say for what you call "pretty weak argument". |
@behzadnouri I was calling it objectively worse because I timed it an it run 2x slower than the current code. Which is kind of surprisingly fast, since you are iterating at least eight times over the input array before spitting an answer. |
It turns out that it isn't that hard to come up with very simple examples that fail big time:
|
@jaimefrio the numbers you are quoting do not come out of this code (especially that That you have got even this simpler algorithm wrong, proves my point of why code simplicity and maintainability matters. also, you are looking at a series with such characteristics:
Do you honestly believe that a series which changes location around I am happy that you mentioned earlier that this patch
and that your mock up code passes sanity check for the sane example. That said, I do not feel this discussion is an effective use of our time. If you think I am wrong, please spend time on implementing what you think is right, rather than trying to prove me wrong. Please do not continue this argument, and instead focus on correct implementation of what you think is a better path. |
Trust me, @behzadnouri, I have better things to do than proving you wrong. This is supposed to be a collaborative effort to produce great software, not a contest to see who has the bestest PR. Let's try to get back to that... I have done some more tests over the weekend, and it is terribly easy to find simple examples with terrible behavior. Perhaps the most illustrative is a long enough linear ramp, with a sufficiently small window. The kurtosis should be
There is also the thing with performance... Some timings to put things in context, with the same above
I think there is plenty of food for thought here. My views on this:
So based on all of the above, my original -1 is now a +0.25. To make it a full +1, I would suggest:
|
@jaimefrio sounds like a nice comprehensive approach. Pls submit a PR when ready. |
adds some stability to rolling
skew
andkurt
calculation by using thez-scores
instead of original values; related: #6929on master:
on branch:
This does not handle all situations that causes instability, but I think it is generally more stable to work with z-scores; In particular, affine transformations generate same results.
Also,
input
is a python 3 built-in. I suggest renaming all instances ofinput
inalgo.pyx
file toinp
or something else.