-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Mean overflows for integer dtypes #10155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
How do I label it as Bug? |
For
Note that the |
I think that we could do the mean calc in np.errstate context manager - and catch the OverFlow (then do the calc as float); u normally have to do this in the same dtype |
I agree that we could probably fix |
I don't see any reason why sum should not be in floating points space. When you need to sum up thousands of large numbers, say, trading volumes (int type) of all stocks across a market, it could potentially overflow. If the user, want int type, he/she can always try after the fact. Trying to maintain the sum in the integer space is simply wrong, and could cause uncaught error, imaging that overflown number get fed into next stage. |
@jreback it doesn't seem like
Note that the last line does not produce a warning because it's an array operation. |
@qqsusu you are missing the point here |
I feel the same way as what @shoyer described above - we could change However with |
@qqsusu here's an example where precision would be lost by doing floats:
|
How about this as a proposal for If the dtype is int, we compute an additional sum with floats and compare that with the sum using ints. If they are different, we throw a warning. The returned value is still int, the same as before. |
Compute the sum twice for dtype=int? That would be a serious performance regression. |
xref is #9442 (for timedelta, though that is easily handled by other means). so you might want to see what/if the bottleneck guys do about this. (I suspect they don't do anything as numpy does not do anything). |
@shoyer yeah definitely a performance hit, especially for something as basic as |
@jreback NumPy casts to float for computing mean, but it doesn't convert back to int -- nor should/does pandas:
Bottleneck does handle |
ok I submitted a PR for |
We still need to raise an OverflowError for the sum overflow when it occurs. As long as things does not fail silently, it should be Ok. |
@shoyer I am just throwing out some thinking / pseudo code here: //continue; similar logic, for adding two numbers, x and y if (x > INT64_MAX - y) { |
@qqsusu as I mentioned in a comment above, this wouldn't actually be feasible because it'd be a huge performance hit, and that's why numpy also does not raise on int overflow for the array case. You can find the explanation here: http://mail.scipy.org/pipermail/numpy-discussion/2009-April/041691.html
|
BUG: mean overflows for integer dtypes (fixes #10155)
Say there is a array of type int64
for convenience, that me just put is some large number
test1 = pd.Series(20150515061816532, index=list(range(500)), dtype='int64')
test1.describe()
Out[152]:
count 5.000000e+02
mean -1.674297e+16
std 0.000000e+00
min 2.015052e+16
25% 2.015052e+16
50% 2.015052e+16
75% 2.015052e+16
max 2.015052e+16
Look at the mean, it overflow, and become negative. Obviously the mean should be 20150515061816532
In [153]: test1.sum()
Out[153]: -8371486542801285616 This is wrong.
The computation should have been sum them up as float, and devided by total count.
I think we need to examine other parts of code that involve similar situation.
The text was updated successfully, but these errors were encountered: