Skip to content

possible bug when calculating mean of DataFrame? #11670

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
YAmikep opened this issue Nov 20, 2015 · 7 comments · Fixed by #29369
Closed

possible bug when calculating mean of DataFrame? #11670

YAmikep opened this issue Nov 20, 2015 · 7 comments · Fixed by #29369
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@YAmikep
Copy link

YAmikep commented Nov 20, 2015

I'm trying to calculate the mean of all the columns of a DataFrame but it looks like having a value in the B column of row 6 prevents from calculating the mean on the C column. Possible bug? (pandas: 0.17.1)

import pandas as pd
from decimal import Decimal
d = [
    {'A': 2, 'B': None, 'C': Decimal('628.00')},
    {'A': 1, 'B': None, 'C': Decimal('383.00')},
    {'A': 3, 'B': None, 'C': Decimal('651.00')},
    {'A': 2, 'B': None, 'C': Decimal('575.00')},
    {'A': 4, 'B': None, 'C': Decimal('1114.00')},
    {'A': 1, 'B': 'TEST', 'C': Decimal('241.00')},
    {'A': 2, 'B': None, 'C': Decimal('572.00')},
    {'A': 4, 'B': None, 'C': Decimal('609.00')},
    {'A': 3, 'B': None, 'C': Decimal('820.00')},
    {'A': 5, 'B': None, 'C': Decimal('1223.00')}
]

df = pd.DataFrame(d)

In : df
Out:
   A     B        C
0  2  None   628.00
1  1  None   383.00
2  3  None   651.00
3  2  None   575.00
4  4  None  1114.00
5  1  TEST   241.00
6  2  None   572.00
7  4  None   609.00
8  3  None   820.00
9  5  None  1223.00

dtypes are equivalent:

In : df.dtypes
Out:
A     int64
B    object
C    object
dtype: object

In : df.head(5).dtypes
Out:
A     int64
B    object
C    object
dtype: object

But calling mean on the dataframe does not work when row 6 is present:

# no mean for C column: row 6 is present
In : df.mean()
Out:
A    2.7
dtype: float64

# mean for C column when row 6 is left out of the DF
In : df.head(5).mean()
Out:
A      2.4
B      NaN
C    670.2
dtype: float64

# no mean for C column when row 6 is part of the DF
In : df.head(6).mean()
Out:
A    2.166667
dtype: float64

Also, it works when I explicitely leave out column B for calculating the mean:

In : df[['A','B','C']].mean()
Out:
A    2.7
dtype: float64

In : df[['A','C']].mean()
Out:
A      2.7
C    681.6
dtype: float64
@jreback
Copy link
Contributor

jreback commented Nov 20, 2015

we would have to jump thru lots of hoops for this to work I think. In general object dtype mean is really for strings. putting Decimal which is 'float-like' is just asking for trouble. So what is happening is that when the mean is tried on B-C block it errors, so that is excluded.

I'll mark it, but if you really want it fixed, then pls submit a pull-request.

@jreback jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations Difficulty Intermediate labels Nov 20, 2015
@jreback jreback added this to the Next Major Release milestone Nov 20, 2015
@YAmikep
Copy link
Author

YAmikep commented Nov 20, 2015

I receive the data with Decimal objects. That's not my choice.
So should Decimal not be used with pandas and I should manually do a conversion for each column that has Decimal objects first?
df['C'] = df['C'].astype(float)

I think it gets confusing when df[['A','C']].mean() works and df[['A','B','C']].mean() does not.
If Decimal objects are not supported, I'd expect them not to work no matter what I do and the behavior would especially not change based on the values of another column.

Why isn't the mean function applied per column? Why is it applied on the B-C block and not on each column separately?

Thanks

@TomAugspurger
Copy link
Contributor

Columns are internally consolidated into blocks, one per dtype (usually). The object dtypes are all lumped together into one block here. When the .mean() for that block fails all the columns in it ('B' & 'C') are treated as nuisance columns and excluded from the result.

You are better of sticking to numpy dtypes + the ones pandas has added on (datetimes, categoricals) if your application allows it. You might not want to cast to floats without checking whether that will mess up the people your giving the data back to :)

edit: as for why it's done at the block level vs once per column, I think the biggest reason is performance. We do have an issue about not consolidating blocks automatically, but that's not really meant to be exposed to a user and used in this kind of scenario (I think).

edit2: and if you really need to support this, you could try

def func(x):
    try:
        return np.nanmean(x)
    except:  # catch the actual error
         pass

df.apply(func)

but this will be slower

@YAmikep
Copy link
Author

YAmikep commented Nov 20, 2015

Thanks for the explanation for what happens under the hood.
I guess it will require some development to make mean work per column instead of per dtype block.

So are you suggesting to cast to a numpy float type for now?
df['C'] = df['C'].astype(numpy.float64)

@jreback
Copy link
Contributor

jreback commented Nov 20, 2015

yes, I would never use Decimal, it is completely non-performant. I understand there are people who need a fixed decimal type, but to be honest, that is just for display purposes.

@kokes
Copy link
Contributor

kokes commented Dec 16, 2018

Trying to reproduce OP's behaviour in 0.23.4 and failing to do so

In [12]: df.dtypes                                                                                                               
Out[12]: 
A     int64
B    object
C    object
dtype: object

In [13]: df.mean()                                                                                                               
Out[13]: 
A      2.7
C    681.6
dtype: float64

Has this block treatment of object columns changed? Because it now works as expected (well, almost, the mean itself is a float, but I guess that's fine).

(cc @jreback @TomAugspurger)

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Difficulty Intermediate Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations labels Oct 20, 2019
@mroeschke
Copy link
Member

Yes this looks to work on master. Could use a test:

In [51]: df.mean()
Out[51]:
A      2.7
C    681.6
dtype: float64

In [54]: pd.__version__
Out[54]: '0.26.0.dev0+593.g9d45934af'

@jreback jreback added the Numeric Operations Arithmetic, Comparison, and Logical operations label Nov 3, 2019
@jreback jreback modified the milestones: Contributions Welcome, 1.0 Nov 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants