BUG: Use stable algorithm for _nanvar. #10679

jvkersch · 2015-07-26T18:00:31Z

This PR replaces the sum-of-squares algorithm used to compute the variance by a more stable algorithm. The algorithm here is essentially the same as used in numpy 1.8 and up, and I've added a TODO to replace the implementation with a direct call to numpy when that version is the default.

Somewhat counter to the discussion in #10242, I chose not to go with the Welford algorithm, for two reasons: numpy, and the fact that _nanvar needs to be able to deal with arrays of different shape, which is tricky to get right in Cython.

jvkersch · 2015-07-26T20:18:51Z

@shoyer or @jreback Would either of you mind giving this a quick review? Thanks!

jreback · 2015-07-26T20:36:02Z

pandas/tests/test_nanops.py

+    def test_nanstd_roundoff(self):
+        # Regression test for GH 10242 (test data taken from GH 10489). Ensure
+        # that variance is stable.
+        data = Series(766897346 * np.ones(10))


can u add a couple of more tests with some fixed values and a fixed result (that is not 0)

also included an example with nans

TomAugspurger · 2015-07-26T21:23:58Z

So the plan is to replace this with numpy's version once we bump to that to 1.8? Could you open up a separate issue as to reminded us to do that?

jreback · 2015-07-26T22:25:54Z

we won't be eliminating 1.7 for quite some time FYI
so this would have to be a conditional (and could even be out in now with the conditional on the version)

shoyer · 2015-07-27T04:39:21Z

This looks pretty reasonable to me.

jreback · 2015-07-27T11:01:44Z

doc/source/whatsnew/v0.17.0.txt

@@ -397,4 +397,4 @@ Bug Fixes
 - Bug in ``io.common.get_filepath_or_buffer`` which caused reading of valid S3 files to fail if the bucket also contained keys for which the user does not have read permission (:issue:`10604`)
 - Bug in vectorised setting of timestamp columns with python ``datetime.date`` and numpy ``datetime64`` (:issue:`10408`, :issue:`10412`)
 - Bug in ``pd.DataFrame`` when constructing an empty DataFrame with a string dtype (:issue:`9428`)
-
+- Bug in ``_nanvar`` causing inaccurate results due to unstable algorithm  (:issue:`10242`)


Bug in.var() no-one knows (or cares how this is implemented). Be much more explict on when this occurs. E.g. highly similar values. This sounds too alarmist.

jreback · 2015-07-27T11:05:54Z

pandas/core/nanops.py

-    XX = _ensure_numeric((values ** 2).sum(axis))
-    result = np.fabs((XX - X * X / count) / d)
-    return result
+    # This implementation is based on that of nanvar in numpy 1.8 and up. TODO:


pls add an exaplanation of the algo as well (and that its stable)

jvkersch · 2015-07-28T17:37:47Z

Thanks all for the extensive comments!

@jreback I made a first pass at addressing your comments. Regarding the relative lack of tests, I assumed that the existing tests in test_nanops would be sufficient to check the behavior of the function in the face of NaNs, different ddofs, etc. However, I noticed that the test infrastructure in test_nanops was missing a raise, causing some test errors to be swallowed. I can make that commit into a separate PR if necessary.

jreback · 2015-07-28T21:27:20Z

pandas/core/nanops.py

+    else:
+        np_nanvar = getattr(np, 'nanvar', None)  # Present in Numpy >= 1.8
+        if np_nanvar is None:
+            return _nanvar(values, axis=axis, skipna=skipna, ddof=ddof)


I would rather put _nanvar here. Unless we are calling it for another reason?

jreback · 2015-07-28T21:28:48Z

@jvkersch I would like to see an explicty test that checks against a hard value (with an w/o nans), just to lock this down (rather than computing agains np.var which is what we do IIRC). That is generally fine, but since you are addressing this, can't hurt.

jvkersch · 2015-08-10T16:31:17Z

Just a quick note: I haven't forgotten about this, but I'm currently lacking the time to finish this off. What's hindering me a bit is the lack of uniformity in the api for the nanops: some cast their arguments to float64 right away, some only do that for int dtypes, ... there's the implicit assumption that the nanop will raise TypeError for non-numeric input, etc. I think it might be worth fleshing out this api in a future PR. If I can't get things to work in a satisfactorily manner I'll just go for the minimally viable implementation here, and work on that instead, before revisiting.

jvkersch · 2015-08-17T17:03:12Z

@jreback Would you be able to give this another review? Since last time, I took care of your earlier comments (explicit tests, folding _nanvar into nanvar, using a wide accumulator), but some unexpected side-effects plagued me a little:

Since nanstd called _nanvar directly, having it call nanvar instead meant fixing up some tests.
Numpy generates a RuntimeWarning whenever the number of dofs is not strictly positive. To avoid being swamped by those warnings, I turned off RuntimeWarnings when calling Numpy's var or nanvar.
Some tests for numeric_only called nanvar with non-numeric data and listened for a TypeError. This happened to work with the earlier implementation, but in some cases the new implementation raises a ValueError instead. Implementation-wise this works fine, too, but one test needed to be made a little more flexible.

Thanks!

jreback · 2015-08-19T11:11:05Z

pandas/core/nanops.py

+
+
+@disallow('M8')
+@bottleneck_switch(ddof=1)


why are you changing this. No real need to go to numpy at all. Either its hit in bottleneck, or we have our own algo which is modelled on nanvar.

@jreback I think I misunderstood one of your earlier suggestions. To clarify, are you suggesting that we don't go to Numpy at all in nanvar, and just have our own implementation (which used to be _nanvar)? I do think that would simplify matters a bit.

@jreback Done (and squashed)

jreback · 2015-08-28T12:35:55Z

pandas/tests/test_nanops.py

+    def test_nanvar_all_finite(self):
+        samples = self.samples
+        actual_variance = nanops.nanvar(samples)
+        np.testing.assert_almost_equal(


use tm.assert_almost_equal or tm.assert_numpy_array_equal as appropriate

I resorted to using the Numpy variants because I needed to set the decimal precision to something low; is there a Pandas version that also allows the precision to be adjusted?

I think we have a less_precise option where u can set the decimal places to compare (or defaults to 3 I think)

IIUC the number of decimal places is fixed to 5 or 3 (depending on whether check_less_precise is True) and I really need 2 ;-) I'll see if I can work around this.

jvkersch · 2015-08-31T14:08:48Z

@jreback I've added some tests for nanstd (though they are not as extensive as the nanvar tests). I couldn't get things to work reliably with tm.assert_array_equal (because of the fixed precision) so I'm hoping this can stay as-is (and I wouldn't mind revisiting this after #10788 has been addressed). Does this look good to you?

jreback · 2015-09-02T12:05:34Z

looks good. pls rebase/squash. ping when green.

A missing 'raise' caused any exception in check_fun_ddof to be ignored. Restoring the raise revealed some issues with the existing tests: 1. Optional keyword arguments were not passed into the function to be tested. 2. nanvar, nanstd, nansem were being tested on string data, but are not able to handle string inputs. 3. test_nansem compared the output of nanops.nansem to numpy.var, which probably should have been scipy.stats.sem, judging from the conditional at the top of the test. This commit also replaces all BaseExceptions by checks for AssertionError.

jvkersch · 2015-09-04T11:11:19Z

@jreback Done!

jreback · 2015-09-04T12:15:21Z

merged via 2d0d58c

@jvkersch awesome job!

thanks!

jvkersch · 2015-09-04T19:14:24Z

@jreback Thanks for the swift and thorough review!

jreback reviewed Jul 26, 2015
View reviewed changes

jreback reviewed Jul 27, 2015
View reviewed changes

jreback added Bug Numeric Operations Arithmetic, Comparison, and Logical operations labels Jul 27, 2015

jreback added this to the 0.17.0 milestone Jul 27, 2015

jreback reviewed Jul 27, 2015
View reviewed changes

jvkersch mentioned this pull request Jul 28, 2015

MAINT: use numpy.nanvar (requires numpy >= 1.8) #10689

Closed

jvkersch force-pushed the fix/nanvar_stable branch from 7a6d0e1 to e1a23d5 Compare July 28, 2015 17:29

jreback reviewed Jul 28, 2015
View reviewed changes

jvkersch force-pushed the fix/nanvar_stable branch 4 times, most recently from a68f4e0 to f445ea8 Compare August 8, 2015 17:56

sinhrks mentioned this pull request Aug 9, 2015

BUG: Series aggregation may output incorrect results dask/dask#558

Merged

jvkersch force-pushed the fix/nanvar_stable branch 2 times, most recently from 5ac4b68 to ac30ea7 Compare August 14, 2015 20:28

jreback modified the milestones: Next Major Release, 0.17.0 Aug 15, 2015

jvkersch force-pushed the fix/nanvar_stable branch from ac30ea7 to 79534d4 Compare August 17, 2015 15:19

jreback reviewed Aug 19, 2015
View reviewed changes

jvkersch force-pushed the fix/nanvar_stable branch 2 times, most recently from a9ef9ff to 60ce2c4 Compare August 27, 2015 14:29

jreback reviewed Aug 28, 2015
View reviewed changes

jvkersch force-pushed the fix/nanvar_stable branch 3 times, most recently from fa9caef to dba02ad Compare August 31, 2015 08:23

jvkersch force-pushed the fix/nanvar_stable branch from dba02ad to a60dc0a Compare August 31, 2015 14:10

jreback modified the milestones: 0.17.0, Next Major Release Sep 2, 2015

jvkersch force-pushed the fix/nanvar_stable branch from a60dc0a to 1d03df2 Compare September 3, 2015 12:10

Joris Vankerschaver added 2 commits September 4, 2015 10:49

BUG: Use stable algorithm for _nanvar.

8631741

jvkersch force-pushed the fix/nanvar_stable branch from 1d03df2 to 8631741 Compare September 4, 2015 09:50

jreback closed this Sep 4, 2015

jreback mentioned this pull request Sep 4, 2015

Variance is not calculated correctly in some cases + inconsistent definition #10242

Closed

jvkersch deleted the fix/nanvar_stable branch September 4, 2015 19:14

jreback mentioned this pull request Jan 6, 2016

Series.skew() for constant series returns inconsistent values #11974

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Use stable algorithm for _nanvar. #10679

BUG: Use stable algorithm for _nanvar. #10679

jvkersch commented Jul 26, 2015

jvkersch commented Jul 26, 2015

jreback Jul 26, 2015

TomAugspurger commented Jul 26, 2015

jreback commented Jul 26, 2015

shoyer commented Jul 27, 2015

jreback Jul 27, 2015

jvkersch Jul 28, 2015

jreback Jul 27, 2015

jvkersch commented Jul 28, 2015

jreback Jul 28, 2015

jreback commented Jul 28, 2015

jvkersch commented Aug 10, 2015

jvkersch commented Aug 17, 2015

jreback Aug 19, 2015

jvkersch Aug 23, 2015

jreback Aug 23, 2015

jvkersch Aug 28, 2015

jreback Aug 28, 2015

jvkersch Aug 29, 2015

jreback Aug 29, 2015

jvkersch Aug 29, 2015

jvkersch commented Aug 31, 2015

jreback commented Sep 2, 2015

jvkersch commented Sep 4, 2015

jreback commented Sep 4, 2015

jvkersch commented Sep 4, 2015



		@disallow('M8')
		@bottleneck_switch(ddof=1)

BUG: Use stable algorithm for _nanvar. #10679

BUG: Use stable algorithm for _nanvar. #10679

Conversation

jvkersch commented Jul 26, 2015

jvkersch commented Jul 26, 2015

Choose a reason for hiding this comment

TomAugspurger commented Jul 26, 2015

jreback commented Jul 26, 2015

shoyer commented Jul 27, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jvkersch commented Jul 28, 2015

Choose a reason for hiding this comment

jreback commented Jul 28, 2015

jvkersch commented Aug 10, 2015

jvkersch commented Aug 17, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jvkersch commented Aug 31, 2015

jreback commented Sep 2, 2015

jvkersch commented Sep 4, 2015

jreback commented Sep 4, 2015

jvkersch commented Sep 4, 2015