-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Use stable algorithm for _nanvar. #10679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
def test_nanstd_roundoff(self): | ||
# Regression test for GH 10242 (test data taken from GH 10489). Ensure | ||
# that variance is stable. | ||
data = Series(766897346 * np.ones(10)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u add a couple of more tests with some fixed values and a fixed result (that is not 0)
also included an example with nans
So the plan is to replace this with numpy's version once we bump to that to 1.8? Could you open up a separate issue as to reminded us to do that? |
we won't be eliminating 1.7 for quite some time FYI |
This looks pretty reasonable to me. |
@@ -397,4 +397,4 @@ Bug Fixes | |||
- Bug in ``io.common.get_filepath_or_buffer`` which caused reading of valid S3 files to fail if the bucket also contained keys for which the user does not have read permission (:issue:`10604`) | |||
- Bug in vectorised setting of timestamp columns with python ``datetime.date`` and numpy ``datetime64`` (:issue:`10408`, :issue:`10412`) | |||
- Bug in ``pd.DataFrame`` when constructing an empty DataFrame with a string dtype (:issue:`9428`) | |||
|
|||
- Bug in ``_nanvar`` causing inaccurate results due to unstable algorithm (:issue:`10242`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug in.var()
no-one knows (or cares how this is implemented). Be much more explict on when this occurs. E.g. highly similar values. This sounds too alarmist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
XX = _ensure_numeric((values ** 2).sum(axis)) | ||
result = np.fabs((XX - X * X / count) / d) | ||
return result | ||
# This implementation is based on that of nanvar in numpy 1.8 and up. TODO: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add an exaplanation of the algo as well (and that its stable)
7a6d0e1
to
e1a23d5
Compare
Thanks all for the extensive comments! @jreback I made a first pass at addressing your comments. Regarding the relative lack of tests, I assumed that the existing tests in |
else: | ||
np_nanvar = getattr(np, 'nanvar', None) # Present in Numpy >= 1.8 | ||
if np_nanvar is None: | ||
return _nanvar(values, axis=axis, skipna=skipna, ddof=ddof) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather put _nanvar
here. Unless we are calling it for another reason?
@jvkersch I would like to see an explicty test that checks against a hard value (with an w/o nans), just to lock this down (rather than computing agains |
a68f4e0
to
f445ea8
Compare
Just a quick note: I haven't forgotten about this, but I'm currently lacking the time to finish this off. What's hindering me a bit is the lack of uniformity in the api for the nanops: some cast their arguments to float64 right away, some only do that for int dtypes, ... there's the implicit assumption that the nanop will raise |
5ac4b68
to
ac30ea7
Compare
ac30ea7
to
79534d4
Compare
@jreback Would you be able to give this another review? Since last time, I took care of your earlier comments (explicit tests, folding
Thanks! |
|
||
|
||
@disallow('M8') | ||
@bottleneck_switch(ddof=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you changing this. No real need to go to numpy at all. Either its hit in bottleneck, or we have our own algo which is modelled on nanvar
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I think I misunderstood one of your earlier suggestions. To clarify, are you suggesting that we don't go to Numpy at all in nanvar
, and just have our own implementation (which used to be _nanvar
)? I do think that would simplify matters a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Done (and squashed)
a9ef9ff
to
60ce2c4
Compare
def test_nanvar_all_finite(self): | ||
samples = self.samples | ||
actual_variance = nanops.nanvar(samples) | ||
np.testing.assert_almost_equal( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use tm.assert_almost_equal
or tm.assert_numpy_array_equal
as appropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I resorted to using the Numpy variants because I needed to set the decimal precision to something low; is there a Pandas version that also allows the precision to be adjusted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have a less_precise option where u can set the decimal places to compare (or defaults to 3 I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC the number of decimal places is fixed to 5 or 3 (depending on whether check_less_precise
is True) and I really need 2 ;-) I'll see if I can work around this.
fa9caef
to
dba02ad
Compare
@jreback I've added some tests for |
dba02ad
to
a60dc0a
Compare
looks good. pls rebase/squash. ping when green. |
a60dc0a
to
1d03df2
Compare
A missing 'raise' caused any exception in check_fun_ddof to be ignored. Restoring the raise revealed some issues with the existing tests: 1. Optional keyword arguments were not passed into the function to be tested. 2. nanvar, nanstd, nansem were being tested on string data, but are not able to handle string inputs. 3. test_nansem compared the output of nanops.nansem to numpy.var, which probably should have been scipy.stats.sem, judging from the conditional at the top of the test. This commit also replaces all BaseExceptions by checks for AssertionError.
1d03df2
to
8631741
Compare
@jreback Done! |
@jreback Thanks for the swift and thorough review! |
closes #10242
This PR replaces the sum-of-squares algorithm used to compute the variance by a more stable algorithm. The algorithm here is essentially the same as used in numpy 1.8 and up, and I've added a TODO to replace the implementation with a direct call to numpy when that version is the default.
Somewhat counter to the discussion in #10242, I chose not to go with the Welford algorithm, for two reasons: numpy, and the fact that
_nanvar
needs to be able to deal with arrays of different shape, which is tricky to get right in Cython.