-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: mean overflows for integer dtypes (fixes #10155) #10172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -254,7 +254,7 @@ def nansum(values, axis=None, skipna=True): | |||
@bottleneck_switch() | |||
def nanmean(values, axis=None, skipna=True): | |||
values, mask, dtype, dtype_max = _get_values(values, skipna, 0) | |||
the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_max)) | |||
the_sum = _ensure_numeric(values.sum(axis, dtype=np.float64)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For float32
input, the sum should still be float32
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh ... you are right. This means we actually have more to fix than the cases for int types. Because dtype_max
would be float64
even when the inputs are float32... I'll update shortly.
4155bbb
to
4ac5a80
Compare
There are technically two changes now: old behavior
new behavior
Both points in the old behavior are not consistent with |
LGTM... just needs a release note |
ok - in another issue should audit the rest of the nan funcs for overflow / dtype preserving effects
|
So I was actually quite puzzled by why the unit tests wouldn't pass for some of the virtual environments. And I think I tracked down the problem ... seems like a bug in
I verified that |
a2e4822
to
3d85fc7
Compare
Ok I've updated this with a numpy version check, travis should pass now. Also added a release note. |
dtype_sum = np.float64 | ||
elif is_float_dtype(dtype): | ||
dtype_sum = dtype | ||
count = dtype.type(count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you changing the count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh that's because for float
dtypes count
can be higher precision and therefore changing the dtype of the returned result. E.g. for float32
input count is still float64
and therefore the result would be cast to float64
when we do sum / count
, and that's not what we want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't count always be integer type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, seems we are casting that, don't really remember why. You can try changing get_count
to simply return an int64
which I think will always be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback actually changing count to int64
may not help, this is what I'm seeing:
In [1]: import numpy as np
In [2]: (np.float32(100) / np.int64(100)).dtype
Out[2]: dtype('float64')
In [3]: (np.float32(100) / np.int32(100)).dtype
Out[3]: dtype('float64')
In [4]: (np.float32(100) / np.float32(100)).dtype
Out[4]: dtype('float32')
It still gets cast to float64
unless we have the denominator as float32
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, ok.
maybe better to add pass thru the dtype to get_counts
then as well (and cast to that inside get_counts
), to avoid this specific code all over the place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure sounds good, will update shortly
@jreback I added an optional |
# is now consistent with numpy | ||
from pandas import Series | ||
|
||
# numpy < 1.9.0 is not computing this correctly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does numpy do in < 1.9.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for numpy
< 1.9.0: (wrong result)
In [1]: import numpy as np
In [2]: np.__version__
Out[2]: '1.8.2'
In [3]: a = 20150515061816532
In [4]: arr = np.array(np.ones(500) * a, dtype=np.int64)
In [5]: arr.mean()
Out[5]: 20150515061816464.0
numpy
>= 1.9.0: (correct result)
In [1]: import numpy as np
In [2]: np.__version__
Out[2]: '1.9.0'
In [3]: a = 20150515061816532
In [4]: arr = np.array(np.ones(500) * a, dtype=np.int64)
In [5]: arr.mean()
Out[5]: 20150515061816532.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, ok
Indeed, I think this is good to go... I'll wait a little bit and then merge. |
yep, lgtm |
BUG: mean overflows for integer dtypes (fixes #10155)
thanks @mortada ! |
cool thanks guys! |
closes #10155