-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: DataFrame.sum() changes int columns to float when only int and float columns #26219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you provide a full example? See http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports |
It turns out it is not the numeric_only=True that causes the problem but apparently only when there are only int and float columns, as the following examples demonstrate. First the output:
Now the code: #!/usr/bin/env python3
import pandas as pd
print('**** Example 1: Dataframe with just an int column; sum() leaves it as int64 ****')
df1 = pd.DataFrame({'int_col': [1, 2]})
print(f'df1 before df1.sum():\n{df1}\n')
print(f'Dtypes before df1.sum():\n{df1.dtypes}\n')
df1.loc['Totals:'] = df1.sum()
print(f'df1 after df1.sum():\n{df1}\n')
print(f'Dtypes after df1.sum():\n{df1.dtypes}\n\n')
print('**** Example 2: Dataframe with int and float columns; sum() changes int64 to float64 ****')
df2 = pd.DataFrame({'int_col': [1, 2], 'float_col': [3.2, 4.7]})
print(f'df2 before df2.sum():\n{df2}\n')
print(f'Dtypes before df2.sum():\n{df2.dtypes}\n')
df2.loc['Totals:'] = df2.sum()
print(f'df2 after df2.sum():\n{df2}\n')
print(f'Dtypes after df2.sum():\n{df2.dtypes}\n\n')
print('**** Example 3: Dataframe with alpha column; sum() leaves int64 column unchanged ****')
df3 = pd.DataFrame({'int_col': [1, 2], 'float_col': [3.2, 4.7], 'alpha_col': ['a', 'b']})
print(f'df3 before df3.sum():\n{df3}\n')
print(f'Dtypes before df3.sum():\n{df3.dtypes}\n')
df3.loc['Totals:'] = df3.sum()
print(f'df3 after df3.sum():\n{df3}\n')
print(f'Dtypes after df3.sum():\n{df3.dtypes}\n') |
I found this which seems to explain the behavior in part, indicating that NaNs cannot be stored in integer series, but the only NaN introduced by my totals line in my production code occurs in the last column, not the int column, so it is still a bit unclear whether this is intended or accidental behavior. Maybe a NaN in any column causes the coercing. https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html When introducing NAs into an existing Series or DataFrame via reindex() or some other means, boolean and integer types will be promoted to a different dtype in order to store the NAs. The promotions are summarized in this table:
|
A possible work-around is doing the summation per column:
An alternative is saving the
When the sum of the integer columns are too large to be represented exactly as |
On May 2, 2019, at 5:43 AM, vadella ***@***.***> wrote:
df2.sum() is a Series with one dtype. In this case float64. This series gets assigned to the total-row, so this behaviour is logical in that sense. The reason why this doesn't happen to df3 is because df3.sum() is a Series with dtype object.
Thanks for the summary. How much overhead would it add if sum automatically cast any sum of mixed types to object, rather than the highest common type? Then behavior would be consistent and programmers would not have to special case the combination of only int and float columns.
A possible work-around is doing the summation per column:
def sum_per_column(df):
for label, column in df2.items():
column.loc["total"] = column.sum()
yield label, column
pd.DataFrame.from_dict(dict(sum_per_column(df2)))
An alternative is saving the dtypes and then going back to them:
dtypes_before = df2.dtypes
df2.loc["Totals: "] = df2.sum()
df2 = df2.astype(dtypes_before)
The second option seems simpler and more direct.
These are helpful but eliminating the need for more special case programming would be even better. Can I force the output type of the sum to be a series of objects to prevent the int coercion?
When the sum of the integer columns are too large to be represented exactly as Float64 (>2**53 if I recall correctly), this intermediate cast to float can lead to rounding differences with the exact, per column summation.
Another reason not to cast it.
Thanks much.
|
Thanks @vadella, I think you are correct. I believe we have an issue elsewhere for keeping the result as a 1-row DataFrame, which would avoid the castint issue. |
Hi guys, I am coming across a very similar problem related to the .sum() function. I am not sure if it the exact same thing but Core problem: using the .sum() function over a groupby of a dataframe of boolean values returns the results as floats instead of integers. I can quick fix it by doing .applymap(int) but I would rather not. Any suggestions? @TomAugspurger |
We have another issue open specifically about groupby().sum().
…On Thu, Aug 20, 2020 at 8:50 PM Jaydeep Mistry ***@***.***> wrote:
Hi guys, I am coming across a very similar problem related to the .sum()
function. I am not sure if it the exact same thing but
definitely an issue with using the .sum() function over boolean values
Core problem: using the .sum() function over a groupby of a dataframe of
boolean values returns the results as floats instead of integers. I can
quick fix it by doing .applymap(int) but I would rather not. Any
suggestions? @TomAugspurger <https://github.com/TomAugspurger>
See my example code below:
[image: image]
<https://user-images.githubusercontent.com/8127024/90842940-7938fc80-e32e-11ea-8424-ed39b2e39e8f.png>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#26219 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOISRQ2OQG3P4LF2QLTTSBXHF5ANCNFSM4HI2IO3Q>
.
|
After reading through this I have a workaround. If I want to sum a dataframe with some int and some float columns, I add a new dummy column containing text, then call |
Herein, I use int and float as abbrevs. for int64 and float64 dtypes.
In Pandas 0.24.2 running under Debian on Windows Subsytem for Linux, if I have a dataframe, df, with an int column and I execute:
to add a summation row, then the int column remains of type int, as expected. But if I instead use:
to filter to just numeric columns (which I need to do), then any int column is changed to a float, causing display formatting problems, e.g. when showing row and column numbers.
Is there an easy fix for this? Thanks.
May be related to issue #13416.
The text was updated successfully, but these errors were encountered: