Skip to content

BUG: mean overflows for integer dtypes (fixes #10155) #10172

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 30, 2015

Conversation

mortada
Copy link
Contributor

@mortada mortada commented May 19, 2015

closes #10155

@@ -254,7 +254,7 @@ def nansum(values, axis=None, skipna=True):
@bottleneck_switch()
def nanmean(values, axis=None, skipna=True):
values, mask, dtype, dtype_max = _get_values(values, skipna, 0)
the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_max))
the_sum = _ensure_numeric(values.sum(axis, dtype=np.float64))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For float32 input, the sum should still be float32.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh ... you are right. This means we actually have more to fix than the cases for int types. Because dtype_max would be float64 even when the inputs are float32... I'll update shortly.

@mortada mortada force-pushed the mean_overflow branch 4 times, most recently from 4155bbb to 4ac5a80 Compare May 20, 2015 23:56
@mortada
Copy link
Contributor Author

mortada commented May 21, 2015

There are technically two changes now:

old behavior

  1. For int dtypes, mean() first sums up values in int and converts to float, therefore it can potential have int overflow
  2. For float dtypes, mean() sums up values in float64 and returns the result in float64, even if the input dtype is a lower precision dtype such as float32

new behavior

  1. For int dtypes, mean() sums up values in float64 and returns the result as float64
  2. For float dtypes, mean() sums up values using the same input float dtype and returns the result in the same input dtype. Namely, float32 input will have both the computation and return value in float32

Both points in the old behavior are not consistent with numpy, and both points in the new behavior are.

@shoyer
Copy link
Member

shoyer commented May 21, 2015

LGTM... just needs a release note

@jreback
Copy link
Contributor

jreback commented May 21, 2015

ok - in another issue should audit the rest of the nan funcs for overflow / dtype preserving effects

  • lets just put in place some tests

@mortada
Copy link
Contributor Author

mortada commented May 21, 2015

So I was actually quite puzzled by why the unit tests wouldn't pass for some of the virtual environments. And I think I tracked down the problem ... seems like a bug in numpy < 1.9.0. For python 3.4 this is not a problem because numpy >= 1.9.0 is required, but for other environments with older numpy versions this can fail

(python2):~/code/github/pandas$ nosetests -s pandas/tests/test_nanops.py 
............S.S....S....F....SS...
======================================================================
FAIL: test_nanmean_overflow (pandas.tests.test_nanops.TestnanopsDataFrame)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mortada/code/github/pandas/pandas/tests/test_nanops.py", line 336, in test_nanmean_overflow
    self.assertEqual(result, a)
AssertionError: 20150515061816464.0 != 20150515061816532

I verified that np.mean() produces this wrong value for older versions prior to 1.9.0.

@mortada mortada force-pushed the mean_overflow branch 2 times, most recently from a2e4822 to 3d85fc7 Compare May 21, 2015 05:33
@mortada
Copy link
Contributor Author

mortada commented May 21, 2015

Ok I've updated this with a numpy version check, travis should pass now. Also added a release note.

@jreback jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations labels May 21, 2015
@jreback jreback added this to the 0.17.0 milestone May 21, 2015
dtype_sum = np.float64
elif is_float_dtype(dtype):
dtype_sum = dtype
count = dtype.type(count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you changing the count?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh that's because for float dtypes count can be higher precision and therefore changing the dtype of the returned result. E.g. for float32 input count is still float64 and therefore the result would be cast to float64 when we do sum / count, and that's not what we want

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't count always be integer type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, seems we are casting that, don't really remember why. You can try changing get_count to simply return an int64 which I think will always be correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback actually changing count to int64 may not help, this is what I'm seeing:

In [1]: import numpy as np

In [2]: (np.float32(100) / np.int64(100)).dtype
Out[2]: dtype('float64')

In [3]: (np.float32(100) / np.int32(100)).dtype
Out[3]: dtype('float64')

In [4]: (np.float32(100) / np.float32(100)).dtype
Out[4]: dtype('float32')

It still gets cast to float64 unless we have the denominator as float32 as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, ok.

maybe better to add pass thru the dtype to get_counts then as well (and cast to that inside get_counts), to avoid this specific code all over the place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure sounds good, will update shortly

@mortada
Copy link
Contributor Author

mortada commented May 22, 2015

@jreback I added an optional dtype parameter to _get_counts() as you suggested

@mortada
Copy link
Contributor Author

mortada commented May 29, 2015

@jreback @shoyer this should be ready for review.

I'll create another issue/PR to audit the rest of the nan funcs for overflow and returned dtypes

# is now consistent with numpy
from pandas import Series

# numpy < 1.9.0 is not computing this correctly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does numpy do in < 1.9.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for numpy < 1.9.0: (wrong result)

In [1]: import numpy as np

In [2]: np.__version__
Out[2]: '1.8.2'

In [3]: a = 20150515061816532

In [4]: arr = np.array(np.ones(500) * a, dtype=np.int64)

In [5]: arr.mean()
Out[5]: 20150515061816464.0

numpy >= 1.9.0: (correct result)

In [1]: import numpy as np

In [2]: np.__version__
Out[2]: '1.9.0'

In [3]: a = 20150515061816532

In [4]: arr = np.array(np.ones(500) * a, dtype=np.int64)

In [5]: arr.mean()
Out[5]: 20150515061816532.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, ok

@shoyer
Copy link
Member

shoyer commented May 30, 2015

Indeed, I think this is good to go... I'll wait a little bit and then merge.

@jreback
Copy link
Contributor

jreback commented May 30, 2015

yep, lgtm

shoyer added a commit that referenced this pull request May 30, 2015
BUG: mean overflows for integer dtypes (fixes #10155)
@shoyer shoyer merged commit ed000e9 into pandas-dev:master May 30, 2015
@shoyer
Copy link
Member

shoyer commented May 30, 2015

thanks @mortada !

@mortada
Copy link
Contributor Author

mortada commented May 30, 2015

cool thanks guys!

@mortada mortada deleted the mean_overflow branch May 30, 2015 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Mean overflows for integer dtypes
4 participants