BUG: DataFrame.sum() changes int columns to float when only int and float columns #26219

rswgnu · 2019-04-26T20:54:58Z

Herein, I use int and float as abbrevs. for int64 and float64 dtypes.

In Pandas 0.24.2 running under Debian on Windows Subsytem for Linux, if I have a dataframe, df, with an int column and I execute:

df.loc['Totals'] = df.sum()

to add a summation row, then the int column remains of type int, as expected. But if I instead use:

df.loc['Totals'] = df.sum(numeric_only=True)

to filter to just numeric columns (which I need to do), then any int column is changed to a float, causing display formatting problems, e.g. when showing row and column numbers.

Is there an easy fix for this? Thanks.

May be related to issue #13416.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-04-26T21:59:27Z

Can you provide a full example? See http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

rswgnu · 2019-04-27T13:55:28Z

It turns out it is not the numeric_only=True that causes the problem but apparently only when there are only int and float columns, as the following examples demonstrate.

First the output:

**** Example 1: Dataframe with just an int column; sum() leaves it as int64 ****
df1 before df1.sum():
   int_col
0        1
1        2

Dtypes before df1.sum():
int_col    int64
dtype: object

df1 after df1.sum():
         int_col
0              1
1              2
Totals:        3

Dtypes after df1.sum():
int_col    int64
dtype: object


**** Example 2: Dataframe with int and float columns; sum() changes int64 to float64 ****
df2 before df2.sum():
   int_col  float_col
0        1        3.2
1        2        4.7

Dtypes before df2.sum():
int_col        int64
float_col    float64
dtype: object

df2 after df2.sum():
         int_col  float_col
0            1.0        3.2
1            2.0        4.7
Totals:      3.0        7.9

Dtypes after df2.sum():
int_col      float64
float_col    float64
dtype: object


**** Example 3: Dataframe with alpha column; sum() leaves int64 unchanged ****
df3 before df3.sum():
   int_col  float_col alpha_col
0        1        3.2         a
1        2        4.7         b

Dtypes before df3.sum():
int_col        int64
float_col    float64
alpha_col     object
dtype: object

df3 after df3.sum():
         int_col  float_col alpha_col
0              1        3.2         a
1              2        4.7         b
Totals:        3        7.9        ab

Dtypes after df3.sum():
int_col        int64
float_col    float64
alpha_col     object
dtype: object

Now the code:

#!/usr/bin/env python3

import pandas as pd

print('**** Example 1: Dataframe with just an int column; sum() leaves it as int64 ****')
df1 = pd.DataFrame({'int_col': [1, 2]})
print(f'df1 before df1.sum():\n{df1}\n')
print(f'Dtypes before df1.sum():\n{df1.dtypes}\n')
df1.loc['Totals:'] = df1.sum()
print(f'df1 after df1.sum():\n{df1}\n')
print(f'Dtypes after df1.sum():\n{df1.dtypes}\n\n')


print('**** Example 2: Dataframe with int and float columns; sum() changes int64 to float64 ****')
df2 = pd.DataFrame({'int_col': [1, 2], 'float_col': [3.2, 4.7]})
print(f'df2 before df2.sum():\n{df2}\n')
print(f'Dtypes before df2.sum():\n{df2.dtypes}\n')
df2.loc['Totals:'] = df2.sum()
print(f'df2 after df2.sum():\n{df2}\n')
print(f'Dtypes after df2.sum():\n{df2.dtypes}\n\n')

print('**** Example 3: Dataframe with alpha column; sum() leaves int64 column unchanged ****')
df3 = pd.DataFrame({'int_col': [1, 2], 'float_col': [3.2, 4.7], 'alpha_col': ['a', 'b']})
print(f'df3 before df3.sum():\n{df3}\n')
print(f'Dtypes before df3.sum():\n{df3.dtypes}\n')
df3.loc['Totals:'] = df3.sum()
print(f'df3 after df3.sum():\n{df3}\n')
print(f'Dtypes after df3.sum():\n{df3.dtypes}\n')

rswgnu · 2019-04-27T20:46:17Z

I found this which seems to explain the behavior in part, indicating that NaNs cannot be stored in integer series, but the only NaN introduced by my totals line in my production code occurs in the last column, not the int column, so it is still a bit unclear whether this is intended or accidental behavior. Maybe a NaN in any column causes the coercing.

https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html

When introducing NAs into an existing Series or DataFrame via reindex() or some other means, boolean and integer types will be promoted to a different dtype in order to store the NAs. The promotions are summarized in this table:

Typeclass	Promotion dtype for storing NAs
floating	no change
object	no change
integer	cast to float64
boolean	cast to object

vadella · 2019-05-02T09:43:13Z

df2.sum() is a Series with one dtype. In this case float64. This series gets assigned to the total-row, so this behaviour is logical in that sense. The reason why this doesn't happen to df3 is because df3.sum() is a Series with dtype object.

A possible work-around is doing the summation per column:

def sum_per_column(df):
    for label, column in df2.items():
        column.loc["total"] = column.sum()
        yield label, column

pd.DataFrame.from_dict(dict(sum_per_column(df2)))

An alternative is saving the dtypes and then going back to them:

dtypes_before = df2.dtypes
df2.loc["Totals: "] = df2.sum()
df2 = df2.astype(dtypes_before)

When the sum of the integer columns are too large to be represented exactly as Float64 (>2**53 if I recall correctly), this intermediate cast to float can lead to rounding differences with the exact, per column summation.

rswgnu · 2019-05-02T13:43:31Z

On May 2, 2019, at 5:43 AM, vadella ***@***.***> wrote: df2.sum() is a Series with one dtype. In this case float64. This series gets assigned to the total-row, so this behaviour is logical in that sense. The reason why this doesn't happen to df3 is because df3.sum() is a Series with dtype object.

Thanks for the summary. How much overhead would it add if sum automatically cast any sum of mixed types to object, rather than the highest common type? Then behavior would be consistent and programmers would not have to special case the combination of only int and float columns.

A possible work-around is doing the summation per column: def sum_per_column(df): for label, column in df2.items(): column.loc["total"] = column.sum() yield label, column pd.DataFrame.from_dict(dict(sum_per_column(df2))) An alternative is saving the dtypes and then going back to them: dtypes_before = df2.dtypes df2.loc["Totals: "] = df2.sum() df2 = df2.astype(dtypes_before)

The second option seems simpler and more direct. These are helpful but eliminating the need for more special case programming would be even better. Can I force the output type of the sum to be a series of objects to prevent the int coercion?

When the sum of the integer columns are too large to be represented exactly as Float64 (>2**53 if I recall correctly), this intermediate cast to float can lead to rounding differences with the exact, per column summation.

Another reason not to cast it. Thanks much.

TomAugspurger · 2019-05-03T16:25:12Z

Thanks @vadella, I think you are correct.

I believe we have an issue elsewhere for keeping the result as a 1-row DataFrame, which would avoid the castint issue.

jrmistry · 2020-08-21T01:50:09Z

Hi guys, I am coming across a very similar problem related to the .sum() function. I am not sure if it the exact same thing but
definitely an issue with using the .sum() function over boolean values

Core problem: using the .sum() function over a groupby of a dataframe of boolean values returns the results as floats instead of integers. I can quick fix it by doing .applymap(int) but I would rather not. Any suggestions? @TomAugspurger

See my example code below:

TomAugspurger · 2020-08-21T11:28:43Z

We have another issue open specifically about groupby().sum().

…

On Thu, Aug 20, 2020 at 8:50 PM Jaydeep Mistry ***@***.***> wrote: Hi guys, I am coming across a very similar problem related to the .sum() function. I am not sure if it the exact same thing but definitely an issue with using the .sum() function over boolean values Core problem: using the .sum() function over a groupby of a dataframe of boolean values returns the results as floats instead of integers. I can quick fix it by doing .applymap(int) but I would rather not. Any suggestions? @TomAugspurger <https://github.com/TomAugspurger> See my example code below: [image: image] <https://user-images.githubusercontent.com/8127024/90842940-7938fc80-e32e-11ea-8424-ed39b2e39e8f.png> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26219 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOISRQ2OQG3P4LF2QLTTSBXHF5ANCNFSM4HI2IO3Q> .

edmundhighcock · 2021-08-25T21:18:12Z

After reading through this I have a workaround. If I want to sum a dataframe with some int and some float columns, I add a new dummy column containing text, then call sum(), then remove the dummy element from the resulting series... abracadabra! Thanks for the discussion which showed me what was going on!

rswgnu changed the title ~~DataFrame.sum(numeric_only=True) changes int columns to float~~ BUG: DataFrame.sum(numeric_only=True) changes int columns to float Apr 26, 2019

TomAugspurger added the Needs Info Clarification about behavior needed to assess issue label Apr 26, 2019

rswgnu changed the title ~~BUG: DataFrame.sum(numeric_only=True) changes int columns to float~~ BUG: DataFrame.sum() changes int columns to float when only int and float columns Apr 27, 2019

TomAugspurger closed this as completed May 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame.sum() changes int columns to float when only int and float columns #26219

BUG: DataFrame.sum() changes int columns to float when only int and float columns #26219

rswgnu commented Apr 26, 2019 •

edited

Loading

TomAugspurger commented Apr 26, 2019

rswgnu commented Apr 27, 2019

rswgnu commented Apr 27, 2019 •

edited

Loading

vadella commented May 2, 2019

rswgnu commented May 2, 2019 via email

TomAugspurger commented May 3, 2019

jrmistry commented Aug 21, 2020

TomAugspurger commented Aug 21, 2020 via email

edmundhighcock commented Aug 25, 2021

BUG: DataFrame.sum() changes int columns to float when only int and float columns #26219

BUG: DataFrame.sum() changes int columns to float when only int and float columns #26219

Comments

rswgnu commented Apr 26, 2019 • edited Loading

TomAugspurger commented Apr 26, 2019

rswgnu commented Apr 27, 2019

rswgnu commented Apr 27, 2019 • edited Loading

vadella commented May 2, 2019

rswgnu commented May 2, 2019 via email

TomAugspurger commented May 3, 2019

jrmistry commented Aug 21, 2020

TomAugspurger commented Aug 21, 2020 via email

edmundhighcock commented Aug 25, 2021

rswgnu commented Apr 26, 2019 •

edited

Loading

rswgnu commented Apr 27, 2019 •

edited

Loading