possible bug when calculating mean of DataFrame? #11670

YAmikep · 2015-11-20T21:27:24Z

I'm trying to calculate the mean of all the columns of a DataFrame but it looks like having a value in the B column of row 6 prevents from calculating the mean on the C column. Possible bug? (pandas: 0.17.1)

import pandas as pd
from decimal import Decimal
d = [
    {'A': 2, 'B': None, 'C': Decimal('628.00')},
    {'A': 1, 'B': None, 'C': Decimal('383.00')},
    {'A': 3, 'B': None, 'C': Decimal('651.00')},
    {'A': 2, 'B': None, 'C': Decimal('575.00')},
    {'A': 4, 'B': None, 'C': Decimal('1114.00')},
    {'A': 1, 'B': 'TEST', 'C': Decimal('241.00')},
    {'A': 2, 'B': None, 'C': Decimal('572.00')},
    {'A': 4, 'B': None, 'C': Decimal('609.00')},
    {'A': 3, 'B': None, 'C': Decimal('820.00')},
    {'A': 5, 'B': None, 'C': Decimal('1223.00')}
]

df = pd.DataFrame(d)

In : df
Out:
   A     B        C
0  2  None   628.00
1  1  None   383.00
2  3  None   651.00
3  2  None   575.00
4  4  None  1114.00
5  1  TEST   241.00
6  2  None   572.00
7  4  None   609.00
8  3  None   820.00
9  5  None  1223.00

dtypes are equivalent:

In : df.dtypes
Out:
A     int64
B    object
C    object
dtype: object

In : df.head(5).dtypes
Out:
A     int64
B    object
C    object
dtype: object

But calling mean on the dataframe does not work when row 6 is present:

# no mean for C column: row 6 is present
In : df.mean()
Out:
A    2.7
dtype: float64

# mean for C column when row 6 is left out of the DF
In : df.head(5).mean()
Out:
A      2.4
B      NaN
C    670.2
dtype: float64

# no mean for C column when row 6 is part of the DF
In : df.head(6).mean()
Out:
A    2.166667
dtype: float64

Also, it works when I explicitely leave out column B for calculating the mean:

In : df[['A','B','C']].mean()
Out:
A    2.7
dtype: float64

In : df[['A','C']].mean()
Out:
A      2.7
C    681.6
dtype: float64

The text was updated successfully, but these errors were encountered:

jreback · 2015-11-20T22:05:28Z

we would have to jump thru lots of hoops for this to work I think. In general object dtype mean is really for strings. putting Decimal which is 'float-like' is just asking for trouble. So what is happening is that when the mean is tried on B-C block it errors, so that is excluded.

I'll mark it, but if you really want it fixed, then pls submit a pull-request.

YAmikep · 2015-11-20T22:23:21Z

I receive the data with Decimal objects. That's not my choice.
So should Decimal not be used with pandas and I should manually do a conversion for each column that has Decimal objects first?
df['C'] = df['C'].astype(float)

I think it gets confusing when df[['A','C']].mean() works and df[['A','B','C']].mean() does not.
If Decimal objects are not supported, I'd expect them not to work no matter what I do and the behavior would especially not change based on the values of another column.

Why isn't the mean function applied per column? Why is it applied on the B-C block and not on each column separately?

Thanks

TomAugspurger · 2015-11-20T22:31:34Z

Columns are internally consolidated into blocks, one per dtype (usually). The object dtypes are all lumped together into one block here. When the .mean() for that block fails all the columns in it ('B' & 'C') are treated as nuisance columns and excluded from the result.

You are better of sticking to numpy dtypes + the ones pandas has added on (datetimes, categoricals) if your application allows it. You might not want to cast to floats without checking whether that will mess up the people your giving the data back to :)

edit: as for why it's done at the block level vs once per column, I think the biggest reason is performance. We do have an issue about not consolidating blocks automatically, but that's not really meant to be exposed to a user and used in this kind of scenario (I think).

edit2: and if you really need to support this, you could try

def func(x):
    try:
        return np.nanmean(x)
    except:  # catch the actual error
         pass

df.apply(func)

but this will be slower

YAmikep · 2015-11-20T22:43:27Z

Thanks for the explanation for what happens under the hood.
I guess it will require some development to make mean work per column instead of per dtype block.

So are you suggesting to cast to a numpy float type for now?
df['C'] = df['C'].astype(numpy.float64)

jreback · 2015-11-20T22:53:53Z

yes, I would never use Decimal, it is completely non-performant. I understand there are people who need a fixed decimal type, but to be honest, that is just for display purposes.

kokes · 2018-12-16T15:26:08Z

Trying to reproduce OP's behaviour in 0.23.4 and failing to do so

In [12]: df.dtypes                                                                                                               
Out[12]: 
A     int64
B    object
C    object
dtype: object

In [13]: df.mean()                                                                                                               
Out[13]: 
A      2.7
C    681.6
dtype: float64

Has this block treatment of object columns changed? Because it now works as expected (well, almost, the mean itself is a float, but I guess that's fine).

(cc @jreback @TomAugspurger)

mroeschke · 2019-10-20T20:59:37Z

Yes this looks to work on master. Could use a test:

In [51]: df.mean()
Out[51]:
A      2.7
C    681.6
dtype: float64

In [54]: pd.__version__
Out[54]: '0.26.0.dev0+593.g9d45934af'

jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations Difficulty Intermediate labels Nov 20, 2015

jreback added this to the Next Major Release milestone Nov 20, 2015

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Difficulty Intermediate Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations labels Oct 20, 2019

pv8473h12 mentioned this issue Nov 2, 2019

GH:11670: possible bug when calculating mean of DataFrame? #29369

Merged

4 tasks

jreback added the Numeric Operations Arithmetic, Comparison, and Logical operations label Nov 3, 2019

jreback modified the milestones: Contributions Welcome, 1.0 Nov 3, 2019

jreback closed this as completed in #29369 Nov 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

possible bug when calculating mean of DataFrame? #11670

possible bug when calculating mean of DataFrame? #11670

YAmikep commented Nov 20, 2015

jreback commented Nov 20, 2015

YAmikep commented Nov 20, 2015

TomAugspurger commented Nov 20, 2015

YAmikep commented Nov 20, 2015

jreback commented Nov 20, 2015

kokes commented Dec 16, 2018

mroeschke commented Oct 20, 2019

possible bug when calculating mean of DataFrame? #11670

possible bug when calculating mean of DataFrame? #11670

Comments

YAmikep commented Nov 20, 2015

jreback commented Nov 20, 2015

YAmikep commented Nov 20, 2015

TomAugspurger commented Nov 20, 2015

YAmikep commented Nov 20, 2015

jreback commented Nov 20, 2015

kokes commented Dec 16, 2018

mroeschke commented Oct 20, 2019