Skip to content

BUG: DataFrame.sum() changes int columns to float when only int and float columns #26219

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rswgnu opened this issue Apr 26, 2019 · 9 comments
Closed
Labels
Needs Info Clarification about behavior needed to assess issue

Comments

@rswgnu
Copy link

rswgnu commented Apr 26, 2019

Herein, I use int and float as abbrevs. for int64 and float64 dtypes.

In Pandas 0.24.2 running under Debian on Windows Subsytem for Linux, if I have a dataframe, df, with an int column and I execute:

df.loc['Totals'] = df.sum()

to add a summation row, then the int column remains of type int, as expected. But if I instead use:

df.loc['Totals'] = df.sum(numeric_only=True)

to filter to just numeric columns (which I need to do), then any int column is changed to a float, causing display formatting problems, e.g. when showing row and column numbers.

Is there an easy fix for this? Thanks.

May be related to issue #13416.

@rswgnu rswgnu changed the title DataFrame.sum(numeric_only=True) changes int columns to float BUG: DataFrame.sum(numeric_only=True) changes int columns to float Apr 26, 2019
@TomAugspurger
Copy link
Contributor

Can you provide a full example? See http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@TomAugspurger TomAugspurger added the Needs Info Clarification about behavior needed to assess issue label Apr 26, 2019
@rswgnu
Copy link
Author

rswgnu commented Apr 27, 2019

It turns out it is not the numeric_only=True that causes the problem but apparently only when there are only int and float columns, as the following examples demonstrate.

First the output:

**** Example 1: Dataframe with just an int column; sum() leaves it as int64 ****
df1 before df1.sum():
   int_col
0        1
1        2

Dtypes before df1.sum():
int_col    int64
dtype: object

df1 after df1.sum():
         int_col
0              1
1              2
Totals:        3

Dtypes after df1.sum():
int_col    int64
dtype: object


**** Example 2: Dataframe with int and float columns; sum() changes int64 to float64 ****
df2 before df2.sum():
   int_col  float_col
0        1        3.2
1        2        4.7

Dtypes before df2.sum():
int_col        int64
float_col    float64
dtype: object

df2 after df2.sum():
         int_col  float_col
0            1.0        3.2
1            2.0        4.7
Totals:      3.0        7.9

Dtypes after df2.sum():
int_col      float64
float_col    float64
dtype: object


**** Example 3: Dataframe with alpha column; sum() leaves int64 unchanged ****
df3 before df3.sum():
   int_col  float_col alpha_col
0        1        3.2         a
1        2        4.7         b

Dtypes before df3.sum():
int_col        int64
float_col    float64
alpha_col     object
dtype: object

df3 after df3.sum():
         int_col  float_col alpha_col
0              1        3.2         a
1              2        4.7         b
Totals:        3        7.9        ab

Dtypes after df3.sum():
int_col        int64
float_col    float64
alpha_col     object
dtype: object

Now the code:

#!/usr/bin/env python3

import pandas as pd

print('**** Example 1: Dataframe with just an int column; sum() leaves it as int64 ****')
df1 = pd.DataFrame({'int_col': [1, 2]})
print(f'df1 before df1.sum():\n{df1}\n')
print(f'Dtypes before df1.sum():\n{df1.dtypes}\n')
df1.loc['Totals:'] = df1.sum()
print(f'df1 after df1.sum():\n{df1}\n')
print(f'Dtypes after df1.sum():\n{df1.dtypes}\n\n')


print('**** Example 2: Dataframe with int and float columns; sum() changes int64 to float64 ****')
df2 = pd.DataFrame({'int_col': [1, 2], 'float_col': [3.2, 4.7]})
print(f'df2 before df2.sum():\n{df2}\n')
print(f'Dtypes before df2.sum():\n{df2.dtypes}\n')
df2.loc['Totals:'] = df2.sum()
print(f'df2 after df2.sum():\n{df2}\n')
print(f'Dtypes after df2.sum():\n{df2.dtypes}\n\n')

print('**** Example 3: Dataframe with alpha column; sum() leaves int64 column unchanged ****')
df3 = pd.DataFrame({'int_col': [1, 2], 'float_col': [3.2, 4.7], 'alpha_col': ['a', 'b']})
print(f'df3 before df3.sum():\n{df3}\n')
print(f'Dtypes before df3.sum():\n{df3.dtypes}\n')
df3.loc['Totals:'] = df3.sum()
print(f'df3 after df3.sum():\n{df3}\n')
print(f'Dtypes after df3.sum():\n{df3.dtypes}\n')

@rswgnu rswgnu changed the title BUG: DataFrame.sum(numeric_only=True) changes int columns to float BUG: DataFrame.sum() changes int columns to float when only int and float columns Apr 27, 2019
@rswgnu
Copy link
Author

rswgnu commented Apr 27, 2019

I found this which seems to explain the behavior in part, indicating that NaNs cannot be stored in integer series, but the only NaN introduced by my totals line in my production code occurs in the last column, not the int column, so it is still a bit unclear whether this is intended or accidental behavior. Maybe a NaN in any column causes the coercing.


https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html

When introducing NAs into an existing Series or DataFrame via reindex() or some other means, boolean and integer types will be promoted to a different dtype in order to store the NAs. The promotions are summarized in this table:

Typeclass Promotion dtype for storing NAs
floating no change
object no change
integer cast to float64
boolean cast to object

@vadella
Copy link

vadella commented May 2, 2019

df2.sum() is a Series with one dtype. In this case float64. This series gets assigned to the total-row, so this behaviour is logical in that sense. The reason why this doesn't happen to df3 is because df3.sum() is a Series with dtype object.

A possible work-around is doing the summation per column:

def sum_per_column(df):
    for label, column in df2.items():
        column.loc["total"] = column.sum()
        yield label, column

pd.DataFrame.from_dict(dict(sum_per_column(df2)))

An alternative is saving the dtypes and then going back to them:

dtypes_before = df2.dtypes
df2.loc["Totals: "] = df2.sum()
df2 = df2.astype(dtypes_before)

When the sum of the integer columns are too large to be represented exactly as Float64 (>2**53 if I recall correctly), this intermediate cast to float can lead to rounding differences with the exact, per column summation.

@rswgnu
Copy link
Author

rswgnu commented May 2, 2019 via email

@TomAugspurger
Copy link
Contributor

Thanks @vadella, I think you are correct.

I believe we have an issue elsewhere for keeping the result as a 1-row DataFrame, which would avoid the castint issue.

@jrmistry
Copy link

Hi guys, I am coming across a very similar problem related to the .sum() function. I am not sure if it the exact same thing but
definitely an issue with using the .sum() function over boolean values

Core problem: using the .sum() function over a groupby of a dataframe of boolean values returns the results as floats instead of integers. I can quick fix it by doing .applymap(int) but I would rather not. Any suggestions? @TomAugspurger

See my example code below:
image

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 21, 2020 via email

@edmundhighcock
Copy link

After reading through this I have a workaround. If I want to sum a dataframe with some int and some float columns, I add a new dummy column containing text, then call sum(), then remove the dummy element from the resulting series... abracadabra! Thanks for the discussion which showed me what was going on!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

No branches or pull requests

5 participants