BUG: mean overflows for integer dtypes (fixes #10155) #10172

mortada · 2015-05-19T16:43:57Z

shoyer · 2015-05-19T16:58:16Z

pandas/core/nanops.py

@@ -254,7 +254,7 @@ def nansum(values, axis=None, skipna=True):
 @bottleneck_switch()
 def nanmean(values, axis=None, skipna=True):
    values, mask, dtype, dtype_max = _get_values(values, skipna, 0)
-    the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_max))
+    the_sum = _ensure_numeric(values.sum(axis, dtype=np.float64))


For float32 input, the sum should still be float32.

oh ... you are right. This means we actually have more to fix than the cases for int types. Because dtype_max would be float64 even when the inputs are float32... I'll update shortly.

mortada · 2015-05-21T04:23:02Z

There are technically two changes now:

old behavior

For int dtypes, mean() first sums up values in int and converts to float, therefore it can potential have int overflow
For float dtypes, mean() sums up values in float64 and returns the result in float64, even if the input dtype is a lower precision dtype such as float32

new behavior

For int dtypes, mean() sums up values in float64 and returns the result as float64
For float dtypes, mean() sums up values using the same input float dtype and returns the result in the same input dtype. Namely, float32 input will have both the computation and return value in float32

Both points in the old behavior are not consistent with numpy, and both points in the new behavior are.

shoyer · 2015-05-21T04:35:23Z

LGTM... just needs a release note

jreback · 2015-05-21T04:43:25Z

ok - in another issue should audit the rest of the nan funcs for overflow / dtype preserving effects

lets just put in place some tests

mortada · 2015-05-21T04:44:49Z

So I was actually quite puzzled by why the unit tests wouldn't pass for some of the virtual environments. And I think I tracked down the problem ... seems like a bug in numpy < 1.9.0. For python 3.4 this is not a problem because numpy >= 1.9.0 is required, but for other environments with older numpy versions this can fail

(python2):~/code/github/pandas$ nosetests -s pandas/tests/test_nanops.py 
............S.S....S....F....SS...
======================================================================
FAIL: test_nanmean_overflow (pandas.tests.test_nanops.TestnanopsDataFrame)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mortada/code/github/pandas/pandas/tests/test_nanops.py", line 336, in test_nanmean_overflow
    self.assertEqual(result, a)
AssertionError: 20150515061816464.0 != 20150515061816532

I verified that np.mean() produces this wrong value for older versions prior to 1.9.0.

mortada · 2015-05-21T05:34:56Z

Ok I've updated this with a numpy version check, travis should pass now. Also added a release note.

jreback · 2015-05-21T13:26:19Z

pandas/core/nanops.py

+        dtype_sum = np.float64
+    elif is_float_dtype(dtype):
+        dtype_sum = dtype
+        count = dtype.type(count)


why are you changing the count?

oh that's because for float dtypes count can be higher precision and therefore changing the dtype of the returned result. E.g. for float32 input count is still float64 and therefore the result would be cast to float64 when we do sum / count, and that's not what we want

shouldn't count always be integer type?

hmm, seems we are casting that, don't really remember why. You can try changing get_count to simply return an int64 which I think will always be correct.

@jreback actually changing count to int64 may not help, this is what I'm seeing:

In [1]: import numpy as np In [2]: (np.float32(100) / np.int64(100)).dtype Out[2]: dtype('float64') In [3]: (np.float32(100) / np.int32(100)).dtype Out[3]: dtype('float64') In [4]: (np.float32(100) / np.float32(100)).dtype Out[4]: dtype('float32')

It still gets cast to float64 unless we have the denominator as float32 as well.

hmm, ok.

maybe better to add pass thru the dtype to get_counts then as well (and cast to that inside get_counts), to avoid this specific code all over the place

sure sounds good, will update shortly

mortada · 2015-05-22T00:34:08Z

@jreback I added an optional dtype parameter to _get_counts() as you suggested

mortada · 2015-05-29T23:20:29Z

@jreback @shoyer this should be ready for review.

I'll create another issue/PR to audit the rest of the nan funcs for overflow and returned dtypes

jreback · 2015-05-29T23:37:16Z

pandas/tests/test_nanops.py

+        # is now consistent with numpy
+        from pandas import Series
+
+        # numpy < 1.9.0 is not computing this correctly


what does numpy do in < 1.9.0?

for numpy < 1.9.0: (wrong result)

In [1]: import numpy as np In [2]: np.__version__ Out[2]: '1.8.2' In [3]: a = 20150515061816532 In [4]: arr = np.array(np.ones(500) * a, dtype=np.int64) In [5]: arr.mean() Out[5]: 20150515061816464.0

numpy >= 1.9.0: (correct result)

In [1]: import numpy as np In [2]: np.__version__ Out[2]: '1.9.0' In [3]: a = 20150515061816532 In [4]: arr = np.array(np.ones(500) * a, dtype=np.int64) In [5]: arr.mean() Out[5]: 20150515061816532.0

shoyer · 2015-05-30T18:22:03Z

Indeed, I think this is good to go... I'll wait a little bit and then merge.

jreback · 2015-05-30T20:15:45Z

yep, lgtm

BUG: mean overflows for integer dtypes (fixes #10155)

shoyer · 2015-05-30T20:41:25Z

thanks @mortada !

mortada · 2015-05-30T23:05:47Z

cool thanks guys!

shoyer reviewed May 19, 2015
View reviewed changes

mortada force-pushed the mean_overflow branch 4 times, most recently from 4155bbb to 4ac5a80 Compare May 20, 2015 23:56

mortada force-pushed the mean_overflow branch 2 times, most recently from a2e4822 to 3d85fc7 Compare May 21, 2015 05:33

jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations labels May 21, 2015

jreback added this to the 0.17.0 milestone May 21, 2015

jreback reviewed May 21, 2015
View reviewed changes

mortada force-pushed the mean_overflow branch from 3d85fc7 to 420c5c0 Compare May 21, 2015 16:35

BUG: mean overflows for integer dtypes (fixes pandas-dev#10155)

3896e5e

mortada force-pushed the mean_overflow branch from 420c5c0 to 3896e5e Compare May 21, 2015 18:02

jreback reviewed May 29, 2015
View reviewed changes

shoyer added a commit that referenced this pull request May 30, 2015

Merge pull request #10172 from mortada/mean_overflow

ed000e9

BUG: mean overflows for integer dtypes (fixes #10155)

shoyer merged commit ed000e9 into pandas-dev:master May 30, 2015

mortada deleted the mean_overflow branch May 30, 2015 23:05

mortada mentioned this pull request Jun 2, 2015

ENH: make sure return dtypes for nan funcs are consistent #10251

Merged

jorisvandenbossche modified the milestones: 0.17.0, 0.16.2 Jun 2, 2015

c-garcia mentioned this pull request Sep 27, 2015

mean of int64 results in int64 instead of float64 #11199

Closed

rhshadrach mentioned this pull request Sep 24, 2022

BUG: in describe() result, mean is to NaN or Inf, when change float64 to float32 or float16 #48757

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: mean overflows for integer dtypes (fixes #10155) #10172

BUG: mean overflows for integer dtypes (fixes #10155) #10172

mortada commented May 19, 2015

shoyer May 19, 2015

mortada May 19, 2015

mortada commented May 21, 2015

shoyer commented May 21, 2015

jreback commented May 21, 2015

mortada commented May 21, 2015

mortada commented May 21, 2015

jreback May 21, 2015

mortada May 21, 2015

jreback May 21, 2015

jreback May 21, 2015

mortada May 21, 2015

jreback May 21, 2015

mortada May 21, 2015

mortada commented May 22, 2015

mortada commented May 29, 2015

jreback May 29, 2015

mortada May 30, 2015

jreback May 30, 2015

shoyer commented May 30, 2015

jreback commented May 30, 2015

shoyer commented May 30, 2015

mortada commented May 30, 2015

BUG: mean overflows for integer dtypes (fixes #10155) #10172

BUG: mean overflows for integer dtypes (fixes #10155) #10172

Conversation

mortada commented May 19, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mortada commented May 21, 2015

old behavior

new behavior

shoyer commented May 21, 2015

jreback commented May 21, 2015

mortada commented May 21, 2015

mortada commented May 21, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mortada commented May 22, 2015

mortada commented May 29, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented May 30, 2015

jreback commented May 30, 2015

shoyer commented May 30, 2015

mortada commented May 30, 2015