Mean overflows for integer dtypes #10155

su9204 · 2015-05-16T18:00:20Z

Say there is a array of type int64
for convenience, that me just put is some large number

test1 = pd.Series(20150515061816532, index=list(range(500)), dtype='int64')
test1.describe()
Out[152]:
count 5.000000e+02
mean -1.674297e+16
std 0.000000e+00
min 2.015052e+16
25% 2.015052e+16
50% 2.015052e+16
75% 2.015052e+16
max 2.015052e+16

Look at the mean, it overflow, and become negative. Obviously the mean should be 20150515061816532

In [153]: test1.sum()
Out[153]: -8371486542801285616 This is wrong.

The computation should have been sum them up as float, and devided by total count.

I think we need to examine other parts of code that involve similar situation.

su9204 · 2015-05-16T18:04:16Z

How do I label it as Bug?

mortada · 2015-05-16T23:59:23Z

For numpy arrays, this is the behavior:

In [1]: import numpy as np
In [2]: a = np.array(np.ones(500) * 20150515061816532, dtype='int64')
In [3]: a.sum()
Out[3]: -8371486542801285616
In [4]: a.mean()
Out[4]: 20150515061816532.0

Note that the sum() does still overflow, but the mean() is done with floating point calculations

jreback · 2015-05-17T00:16:24Z

I think that we could do the mean calc in np.errstate context manager - and catch the OverFlow (then do the calc as float); u normally have to do this in the same dtype

shoyer · 2015-05-17T00:48:37Z

I agree that we could probably fix mean (since it should be done with floating point), but I don't see how we can fix test1.sum(), unless we always convert sums to floating point. Otherwise we end up with methods with inconsistent type signatures.

su9204 · 2015-05-17T01:01:34Z

I don't see any reason why sum should not be in floating points space. When you need to sum up thousands of large numbers, say, trading volumes (int type) of all stocks across a market, it could potentially overflow. If the user, want int type, he/she can always try after the fact. Trying to maintain the sum in the integer space is simply wrong, and could cause uncaught error, imaging that overflown number get fed into next stage.

mortada · 2015-05-17T01:09:34Z

@jreback it doesn't seem like np.errstate would be able to catch int overflow for the reasons mentioned here: http://mail.scipy.org/pipermail/numpy-discussion/2009-April/041691.html

In [1]: import numpy as np

In [2]: np.int64(2**62) + np.int64(2**62)
/Users/mortada_mehyar/anaconda/bin/ipython:1: RuntimeWarning: overflow encountered in long_scalars
  #!/bin/bash /Users/mortada_mehyar/anaconda/bin/python.app
Out[2]: -9223372036854775808

In [3]: np.array([2**62, 2**62], dtype=np.int64)
Out[3]: array([4611686018427387904, 4611686018427387904])

In [4]: np.array([2**62, 2**62], dtype=np.int64).sum()
Out[4]: -9223372036854775808

Note that the last line does not produce a warning because it's an array operation.

su9204 · 2015-05-17T01:17:29Z

@mortada @jreback
Even if np.errstatee could catch this, one possibly should not do it. It is inconsistent. We can't say, ok, let me try this first, it may return you integer, or it may return you float.

jreback · 2015-05-17T01:22:59Z

@qqsusu you are missing the point here
switching to floats loses precision so you never want to do this unless you have to

mortada · 2015-05-17T18:19:26Z

I feel the same way as what @shoyer described above - we could change mean to always do floating point calculations. This would be consistent with numpy too.

However with sum it doesn't seem right to always convert to floats.

mortada · 2015-05-17T18:32:56Z

@qqsusu here's an example where precision would be lost by doing floats:

In [1]: import numpy as np
In [2]: a = 2 ** 55
In [2]: np.int64(a + 10) - np.int64(a)
Out[2]: 10

In [3]: np.float64(a + 10) - np.float(a)
Out[3]: 8.0

mortada · 2015-05-17T18:46:47Z

How about this as a proposal for sum:

If the dtype is int, we compute an additional sum with floats and compare that with the sum using ints. If they are different, we throw a warning. The returned value is still int, the same as before.

shoyer · 2015-05-18T09:20:38Z

If the dtype is int, we compute an additional sum with floats and compare that with the sum using ints. If they are different, we throw a warning. The returned value is still int, the same as before.

Compute the sum twice for dtype=int? That would be a serious performance regression.

jreback · 2015-05-18T11:32:26Z

xref is #9442 (for timedelta, though that is easily handled by other means).

so mean is easy, just cast to float, then astype back to the input type if they are equal (the scalar result).

you might want to see what/if the bottleneck guys do about this. (I suspect they don't do anything as numpy does not do anything).

mortada · 2015-05-18T14:59:22Z

@shoyer yeah definitely a performance hit, especially for something as basic as .sum(). Just for the sake of argument, I suppose it's possible to have something like a verify_results kwarg that lets the user choose whether to run this check or not. And by default it'd be False. However one can argue that the user can easily just do the floating sum check themselves without needing this kwarg.

shoyer · 2015-05-18T17:34:45Z

@jreback NumPy casts to float for computing mean, but it doesn't convert back to int -- nor should/does pandas:

In [11]: x = np.arange(11)

In [12]: x
Out[12]: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [13]: np.mean(x)
Out[13]: 5.0

In [14]: pd.Series(x).mean()
Out[14]: 5.0

Bottleneck does handle nanmean properly on integer arrays (no overflow), so this bug in pandas only appears when our implementation of mean is used (if bottleneck is not installed). We should probably just be using np.mean for integer dtypes, because there's no possibility of missing values with integers.

mortada · 2015-05-19T16:45:07Z

ok I submitted a PR for mean (with sum unchanged)

su9204 · 2015-05-19T16:54:30Z

We still need to raise an OverflowError for the sum overflow when it occurs. As long as things does not fail silently, it should be Ok.

shoyer · 2015-05-19T16:59:31Z

@qqsusu How would you propose checking for sum overflow? Numpy's sum notes: "Arithmetic is modular when using integer types, and no error is raised on overflow."

su9204 · 2015-05-19T17:16:08Z

@shoyer I am just throwing out some thinking / pseudo code here:
assuming data type int64,
sum = // some number; assuming sum > 0 here, one can check for negative sum as well.
inside the loop, for each number x to be added to sum
if( x > INT64_MAX - sum) {
raise OverflowError;
} else {
sum += x;
}

//continue;

similar logic, for adding two numbers, x and y

if (x > INT64_MAX - y) {
raise error;
} else {
return x + y;
}

mortada · 2015-05-19T17:21:08Z

@qqsusu as I mentioned in a comment above, this wouldn't actually be feasible because it'd be a huge performance hit, and that's why numpy also does not raise on int overflow for the array case. You can find the explanation here:

http://mail.scipy.org/pipermail/numpy-discussion/2009-April/041691.html

Unlike true floating point errors (where the hardware FPU sets a flag
whenever it does an atomic operation that overflows), we need to
implement the integer overflow detection ourselves. We do it on the
scalars, but not arrays because it would be too slow to implement for
every atomic operation on arrays.

BUG: mean overflows for integer dtypes (fixes #10155)

su9204 changed the title ~~A Serious bug in dataframe or Series.describe().~~ A Serious bug in dataframe or Series.sum() and .describe() May 16, 2015

jreback added Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations labels May 18, 2015

shoyer changed the title ~~A Serious bug in dataframe or Series.sum() and .describe()~~ Mean overflows for integer dtypes May 18, 2015

shoyer added the Bug label May 18, 2015

mortada added a commit to mortada/pandas that referenced this issue May 19, 2015

BUG: mean overflows for integer dtypes (fixes pandas-dev#10155)

a4a4008

mortada added a commit to mortada/pandas that referenced this issue May 19, 2015

BUG: mean overflows for integer dtypes (fixes pandas-dev#10155)

b5216ba

mortada added a commit to mortada/pandas that referenced this issue May 20, 2015

BUG: mean overflows for integer dtypes (fixes pandas-dev#10155)

d63667c

mortada added a commit to mortada/pandas that referenced this issue May 20, 2015

BUG: mean overflows for integer dtypes (fixes pandas-dev#10155)

4155bbb

mortada added a commit to mortada/pandas that referenced this issue May 20, 2015

BUG: mean overflows for integer dtypes (fixes pandas-dev#10155)

4ac5a80

mortada added a commit to mortada/pandas that referenced this issue May 21, 2015

BUG: mean overflows for integer dtypes (fixes pandas-dev#10155)

a2e4822

mortada added a commit to mortada/pandas that referenced this issue May 21, 2015

BUG: mean overflows for integer dtypes (fixes pandas-dev#10155)

3d85fc7

mortada mentioned this issue May 21, 2015

BUG: mean overflows for integer dtypes (fixes #10155) #10172

Merged

jreback added this to the 0.17.0 milestone May 21, 2015

mortada added a commit to mortada/pandas that referenced this issue May 21, 2015

BUG: mean overflows for integer dtypes (fixes pandas-dev#10155)

420c5c0

mortada added a commit to mortada/pandas that referenced this issue May 21, 2015

BUG: mean overflows for integer dtypes (fixes pandas-dev#10155)

3896e5e

shoyer closed this as completed in #10172 May 30, 2015

shoyer added a commit that referenced this issue May 30, 2015

Merge pull request #10172 from mortada/mean_overflow

ed000e9

BUG: mean overflows for integer dtypes (fixes #10155)

jorisvandenbossche modified the milestones: 0.17.0, 0.16.2 Jun 2, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mean overflows for integer dtypes #10155

Mean overflows for integer dtypes #10155

su9204 commented May 16, 2015

su9204 commented May 16, 2015

mortada commented May 16, 2015

jreback commented May 17, 2015

shoyer commented May 17, 2015

su9204 commented May 17, 2015

mortada commented May 17, 2015

su9204 commented May 17, 2015

jreback commented May 17, 2015

mortada commented May 17, 2015

mortada commented May 17, 2015

mortada commented May 17, 2015

shoyer commented May 18, 2015

jreback commented May 18, 2015

mortada commented May 18, 2015

shoyer commented May 18, 2015

mortada commented May 19, 2015

su9204 commented May 19, 2015

shoyer commented May 19, 2015

su9204 commented May 19, 2015

mortada commented May 19, 2015

Mean overflows for integer dtypes #10155

Mean overflows for integer dtypes #10155

Comments

su9204 commented May 16, 2015

su9204 commented May 16, 2015

mortada commented May 16, 2015

jreback commented May 17, 2015

shoyer commented May 17, 2015

su9204 commented May 17, 2015

mortada commented May 17, 2015

su9204 commented May 17, 2015

jreback commented May 17, 2015

mortada commented May 17, 2015

mortada commented May 17, 2015

mortada commented May 17, 2015

shoyer commented May 18, 2015

jreback commented May 18, 2015

mortada commented May 18, 2015

shoyer commented May 18, 2015

mortada commented May 19, 2015

su9204 commented May 19, 2015

shoyer commented May 19, 2015

su9204 commented May 19, 2015

mortada commented May 19, 2015