Skip to content

BUG: Use stable algorithm for _nanvar. #10679

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

jvkersch
Copy link

closes #10242

This PR replaces the sum-of-squares algorithm used to compute the variance by a more stable algorithm. The algorithm here is essentially the same as used in numpy 1.8 and up, and I've added a TODO to replace the implementation with a direct call to numpy when that version is the default.

Somewhat counter to the discussion in #10242, I chose not to go with the Welford algorithm, for two reasons: numpy, and the fact that _nanvar needs to be able to deal with arrays of different shape, which is tricky to get right in Cython.

@jvkersch
Copy link
Author

@shoyer or @jreback Would either of you mind giving this a quick review? Thanks!

def test_nanstd_roundoff(self):
# Regression test for GH 10242 (test data taken from GH 10489). Ensure
# that variance is stable.
data = Series(766897346 * np.ones(10))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add a couple of more tests with some fixed values and a fixed result (that is not 0)

also included an example with nans

@TomAugspurger
Copy link
Contributor

So the plan is to replace this with numpy's version once we bump to that to 1.8? Could you open up a separate issue as to reminded us to do that?

@jreback
Copy link
Contributor

jreback commented Jul 26, 2015

we won't be eliminating 1.7 for quite some time FYI
so this would have to be a conditional (and could even be out in now with the conditional on the version)

@shoyer
Copy link
Member

shoyer commented Jul 27, 2015

This looks pretty reasonable to me.

@@ -397,4 +397,4 @@ Bug Fixes
- Bug in ``io.common.get_filepath_or_buffer`` which caused reading of valid S3 files to fail if the bucket also contained keys for which the user does not have read permission (:issue:`10604`)
- Bug in vectorised setting of timestamp columns with python ``datetime.date`` and numpy ``datetime64`` (:issue:`10408`, :issue:`10412`)
- Bug in ``pd.DataFrame`` when constructing an empty DataFrame with a string dtype (:issue:`9428`)

- Bug in ``_nanvar`` causing inaccurate results due to unstable algorithm (:issue:`10242`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug in.var() no-one knows (or cares how this is implemented). Be much more explict on when this occurs. E.g. highly similar values. This sounds too alarmist.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jreback jreback added Bug Numeric Operations Arithmetic, Comparison, and Logical operations labels Jul 27, 2015
@jreback jreback added this to the 0.17.0 milestone Jul 27, 2015
XX = _ensure_numeric((values ** 2).sum(axis))
result = np.fabs((XX - X * X / count) / d)
return result
# This implementation is based on that of nanvar in numpy 1.8 and up. TODO:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add an exaplanation of the algo as well (and that its stable)

@jvkersch
Copy link
Author

Thanks all for the extensive comments!

@jreback I made a first pass at addressing your comments. Regarding the relative lack of tests, I assumed that the existing tests in test_nanops would be sufficient to check the behavior of the function in the face of NaNs, different ddofs, etc. However, I noticed that the test infrastructure in test_nanops was missing a raise, causing some test errors to be swallowed. I can make that commit into a separate PR if necessary.

else:
np_nanvar = getattr(np, 'nanvar', None) # Present in Numpy >= 1.8
if np_nanvar is None:
return _nanvar(values, axis=axis, skipna=skipna, ddof=ddof)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather put _nanvar here. Unless we are calling it for another reason?

@jreback
Copy link
Contributor

jreback commented Jul 28, 2015

@jvkersch I would like to see an explicty test that checks against a hard value (with an w/o nans), just to lock this down (rather than computing agains np.var which is what we do IIRC). That is generally fine, but since you are addressing this, can't hurt.

@jvkersch
Copy link
Author

Just a quick note: I haven't forgotten about this, but I'm currently lacking the time to finish this off. What's hindering me a bit is the lack of uniformity in the api for the nanops: some cast their arguments to float64 right away, some only do that for int dtypes, ... there's the implicit assumption that the nanop will raise TypeError for non-numeric input, etc. I think it might be worth fleshing out this api in a future PR. If I can't get things to work in a satisfactorily manner I'll just go for the minimally viable implementation here, and work on that instead, before revisiting.

@jvkersch jvkersch force-pushed the fix/nanvar_stable branch 2 times, most recently from 5ac4b68 to ac30ea7 Compare August 14, 2015 20:28
@jreback jreback modified the milestones: Next Major Release, 0.17.0 Aug 15, 2015
@jvkersch
Copy link
Author

@jreback Would you be able to give this another review? Since last time, I took care of your earlier comments (explicit tests, folding _nanvar into nanvar, using a wide accumulator), but some unexpected side-effects plagued me a little:

  1. Since nanstd called _nanvar directly, having it call nanvar instead meant fixing up some tests.
  2. Numpy generates a RuntimeWarning whenever the number of dofs is not strictly positive. To avoid being swamped by those warnings, I turned off RuntimeWarnings when calling Numpy's var or nanvar.
  3. Some tests for numeric_only called nanvar with non-numeric data and listened for a TypeError. This happened to work with the earlier implementation, but in some cases the new implementation raises a ValueError instead. Implementation-wise this works fine, too, but one test needed to be made a little more flexible.

Thanks!



@disallow('M8')
@bottleneck_switch(ddof=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you changing this. No real need to go to numpy at all. Either its hit in bottleneck, or we have our own algo which is modelled on nanvar.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I think I misunderstood one of your earlier suggestions. To clarify, are you suggesting that we don't go to Numpy at all in nanvar, and just have our own implementation (which used to be _nanvar)? I do think that would simplify matters a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Done (and squashed)

@jvkersch jvkersch force-pushed the fix/nanvar_stable branch 2 times, most recently from a9ef9ff to 60ce2c4 Compare August 27, 2015 14:29
def test_nanvar_all_finite(self):
samples = self.samples
actual_variance = nanops.nanvar(samples)
np.testing.assert_almost_equal(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use tm.assert_almost_equal or tm.assert_numpy_array_equal as appropriate

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I resorted to using the Numpy variants because I needed to set the decimal precision to something low; is there a Pandas version that also allows the precision to be adjusted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a less_precise option where u can set the decimal places to compare (or defaults to 3 I think)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the number of decimal places is fixed to 5 or 3 (depending on whether check_less_precise is True) and I really need 2 ;-) I'll see if I can work around this.

@jvkersch jvkersch force-pushed the fix/nanvar_stable branch 3 times, most recently from fa9caef to dba02ad Compare August 31, 2015 08:23
@jvkersch
Copy link
Author

@jreback I've added some tests for nanstd (though they are not as extensive as the nanvar tests). I couldn't get things to work reliably with tm.assert_array_equal (because of the fixed precision) so I'm hoping this can stay as-is (and I wouldn't mind revisiting this after #10788 has been addressed). Does this look good to you?

@jreback jreback modified the milestones: 0.17.0, Next Major Release Sep 2, 2015
@jreback
Copy link
Contributor

jreback commented Sep 2, 2015

looks good. pls rebase/squash. ping when green.

Joris Vankerschaver added 2 commits September 4, 2015 10:49
A missing 'raise' caused any exception in check_fun_ddof to be
ignored. Restoring the raise revealed some issues with the existing
tests:

1. Optional keyword arguments were not passed into the function to be
   tested.
2. nanvar, nanstd, nansem were being tested on string data, but are
   not able to handle string inputs.
3. test_nansem compared the output of nanops.nansem to numpy.var, which
   probably should have been scipy.stats.sem, judging from the
   conditional at the top of the test.

This commit also replaces all BaseExceptions by checks for AssertionError.
@jvkersch
Copy link
Author

jvkersch commented Sep 4, 2015

@jreback Done!

@jreback
Copy link
Contributor

jreback commented Sep 4, 2015

merged via 2d0d58c

@jvkersch awesome job!

thanks!

@jvkersch
Copy link
Author

jvkersch commented Sep 4, 2015

@jreback Thanks for the swift and thorough review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Variance is not calculated correctly in some cases + inconsistent definition
4 participants