Skip to content

Mean overflows for integer dtypes #10155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
su9204 opened this issue May 16, 2015 · 20 comments · Fixed by #10172
Closed

Mean overflows for integer dtypes #10155

su9204 opened this issue May 16, 2015 · 20 comments · Fixed by #10172
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@su9204
Copy link

su9204 commented May 16, 2015

Say there is a array of type int64
for convenience, that me just put is some large number

test1 = pd.Series(20150515061816532, index=list(range(500)), dtype='int64')
test1.describe()
Out[152]:
count 5.000000e+02
mean -1.674297e+16
std 0.000000e+00
min 2.015052e+16
25% 2.015052e+16
50% 2.015052e+16
75% 2.015052e+16
max 2.015052e+16

Look at the mean, it overflow, and become negative. Obviously the mean should be 20150515061816532

In [153]: test1.sum()
Out[153]: -8371486542801285616 This is wrong.

The computation should have been sum them up as float, and devided by total count.

I think we need to examine other parts of code that involve similar situation.

@su9204 su9204 changed the title A Serious bug in dataframe or Series.describe(). A Serious bug in dataframe or Series.sum() and .describe() May 16, 2015
@su9204
Copy link
Author

su9204 commented May 16, 2015

How do I label it as Bug?

@mortada
Copy link
Contributor

mortada commented May 16, 2015

For numpy arrays, this is the behavior:

In [1]: import numpy as np
In [2]: a = np.array(np.ones(500) * 20150515061816532, dtype='int64')
In [3]: a.sum()
Out[3]: -8371486542801285616
In [4]: a.mean()
Out[4]: 20150515061816532.0

Note that the sum() does still overflow, but the mean() is done with floating point calculations

@jreback
Copy link
Contributor

jreback commented May 17, 2015

I think that we could do the mean calc in np.errstate context manager - and catch the OverFlow (then do the calc as float); u normally have to do this in the same dtype

@shoyer
Copy link
Member

shoyer commented May 17, 2015

I agree that we could probably fix mean (since it should be done with floating point), but I don't see how we can fix test1.sum(), unless we always convert sums to floating point. Otherwise we end up with methods with inconsistent type signatures.

@su9204
Copy link
Author

su9204 commented May 17, 2015

I don't see any reason why sum should not be in floating points space. When you need to sum up thousands of large numbers, say, trading volumes (int type) of all stocks across a market, it could potentially overflow. If the user, want int type, he/she can always try after the fact. Trying to maintain the sum in the integer space is simply wrong, and could cause uncaught error, imaging that overflown number get fed into next stage.

@mortada
Copy link
Contributor

mortada commented May 17, 2015

@jreback it doesn't seem like np.errstate would be able to catch int overflow for the reasons mentioned here: http://mail.scipy.org/pipermail/numpy-discussion/2009-April/041691.html

In [1]: import numpy as np

In [2]: np.int64(2**62) + np.int64(2**62)
/Users/mortada_mehyar/anaconda/bin/ipython:1: RuntimeWarning: overflow encountered in long_scalars
  #!/bin/bash /Users/mortada_mehyar/anaconda/bin/python.app
Out[2]: -9223372036854775808

In [3]: np.array([2**62, 2**62], dtype=np.int64)
Out[3]: array([4611686018427387904, 4611686018427387904])

In [4]: np.array([2**62, 2**62], dtype=np.int64).sum()
Out[4]: -9223372036854775808

Note that the last line does not produce a warning because it's an array operation.

@su9204
Copy link
Author

su9204 commented May 17, 2015

@mortada @jreback
Even if np.errstatee could catch this, one possibly should not do it. It is inconsistent. We can't say, ok, let me try this first, it may return you integer, or it may return you float.

@jreback
Copy link
Contributor

jreback commented May 17, 2015

@qqsusu you are missing the point here
switching to floats loses precision so you never want to do this unless you have to

@mortada
Copy link
Contributor

mortada commented May 17, 2015

I feel the same way as what @shoyer described above - we could change mean to always do floating point calculations. This would be consistent with numpy too.

However with sum it doesn't seem right to always convert to floats.

@mortada
Copy link
Contributor

mortada commented May 17, 2015

@qqsusu here's an example where precision would be lost by doing floats:

In [1]: import numpy as np
In [2]: a = 2 ** 55
In [2]: np.int64(a + 10) - np.int64(a)
Out[2]: 10

In [3]: np.float64(a + 10) - np.float(a)
Out[3]: 8.0

@mortada
Copy link
Contributor

mortada commented May 17, 2015

How about this as a proposal for sum:

If the dtype is int, we compute an additional sum with floats and compare that with the sum using ints. If they are different, we throw a warning. The returned value is still int, the same as before.

@shoyer
Copy link
Member

shoyer commented May 18, 2015

If the dtype is int, we compute an additional sum with floats and compare that with the sum using ints. If they are different, we throw a warning. The returned value is still int, the same as before.

Compute the sum twice for dtype=int? That would be a serious performance regression.

@jreback
Copy link
Contributor

jreback commented May 18, 2015

xref is #9442 (for timedelta, though that is easily handled by other means).

so mean is easy, just cast to float, then astype back to the input type if they are equal (the scalar result).

you might want to see what/if the bottleneck guys do about this. (I suspect they don't do anything as numpy does not do anything).

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations labels May 18, 2015
@mortada
Copy link
Contributor

mortada commented May 18, 2015

@shoyer yeah definitely a performance hit, especially for something as basic as .sum(). Just for the sake of argument, I suppose it's possible to have something like a verify_results kwarg that lets the user choose whether to run this check or not. And by default it'd be False. However one can argue that the user can easily just do the floating sum check themselves without needing this kwarg.

@shoyer
Copy link
Member

shoyer commented May 18, 2015

@jreback NumPy casts to float for computing mean, but it doesn't convert back to int -- nor should/does pandas:

In [11]: x = np.arange(11)

In [12]: x
Out[12]: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [13]: np.mean(x)
Out[13]: 5.0

In [14]: pd.Series(x).mean()
Out[14]: 5.0

Bottleneck does handle nanmean properly on integer arrays (no overflow), so this bug in pandas only appears when our implementation of mean is used (if bottleneck is not installed). We should probably just be using np.mean for integer dtypes, because there's no possibility of missing values with integers.

@shoyer shoyer changed the title A Serious bug in dataframe or Series.sum() and .describe() Mean overflows for integer dtypes May 18, 2015
@shoyer shoyer added the Bug label May 18, 2015
mortada added a commit to mortada/pandas that referenced this issue May 19, 2015
@mortada
Copy link
Contributor

mortada commented May 19, 2015

ok I submitted a PR for mean (with sum unchanged)

@su9204
Copy link
Author

su9204 commented May 19, 2015

We still need to raise an OverflowError for the sum overflow when it occurs. As long as things does not fail silently, it should be Ok.

@shoyer
Copy link
Member

shoyer commented May 19, 2015

@qqsusu How would you propose checking for sum overflow? Numpy's sum notes: "Arithmetic is modular when using integer types, and no error is raised on overflow."

@su9204
Copy link
Author

su9204 commented May 19, 2015

@shoyer I am just throwing out some thinking / pseudo code here:
assuming data type int64,
sum = // some number; assuming sum > 0 here, one can check for negative sum as well.
inside the loop, for each number x to be added to sum
if( x > INT64_MAX - sum) {
raise OverflowError;
} else {
sum += x;
}

//continue;

similar logic, for adding two numbers, x and y

if (x > INT64_MAX - y) {
raise error;
} else {
return x + y;
}

@mortada
Copy link
Contributor

mortada commented May 19, 2015

@qqsusu as I mentioned in a comment above, this wouldn't actually be feasible because it'd be a huge performance hit, and that's why numpy also does not raise on int overflow for the array case. You can find the explanation here:

http://mail.scipy.org/pipermail/numpy-discussion/2009-April/041691.html

Unlike true floating point errors (where the hardware FPU sets a flag
whenever it does an atomic operation that overflows), we need to
implement the integer overflow detection ourselves. We do it on the
scalars, but not arrays because it would be too slow to implement for
every atomic operation on arrays.

mortada added a commit to mortada/pandas that referenced this issue May 19, 2015
mortada added a commit to mortada/pandas that referenced this issue May 20, 2015
mortada added a commit to mortada/pandas that referenced this issue May 20, 2015
mortada added a commit to mortada/pandas that referenced this issue May 20, 2015
mortada added a commit to mortada/pandas that referenced this issue May 21, 2015
mortada added a commit to mortada/pandas that referenced this issue May 21, 2015
@jreback jreback added this to the 0.17.0 milestone May 21, 2015
mortada added a commit to mortada/pandas that referenced this issue May 21, 2015
mortada added a commit to mortada/pandas that referenced this issue May 21, 2015
shoyer added a commit that referenced this issue May 30, 2015
BUG: mean overflows for integer dtypes (fixes #10155)
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.17.0, 0.16.2 Jun 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants