-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
possible bug when calculating mean of DataFrame? #11670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
we would have to jump thru lots of hoops for this to work I think. In general I'll mark it, but if you really want it fixed, then pls submit a pull-request. |
I receive the data with Decimal objects. That's not my choice. I think it gets confusing when Why isn't the Thanks |
Columns are internally consolidated into blocks, one per dtype (usually). The object dtypes are all lumped together into one block here. When the You are better of sticking to numpy dtypes + the ones pandas has added on (datetimes, categoricals) if your application allows it. You might not want to cast to floats without checking whether that will mess up the people your giving the data back to :) edit: as for why it's done at the block level vs once per column, I think the biggest reason is performance. We do have an issue about not consolidating blocks automatically, but that's not really meant to be exposed to a user and used in this kind of scenario (I think). edit2: and if you really need to support this, you could try def func(x):
try:
return np.nanmean(x)
except: # catch the actual error
pass
df.apply(func) but this will be slower |
Thanks for the explanation for what happens under the hood. So are you suggesting to cast to a numpy float type for now? |
yes, I would never use |
Trying to reproduce OP's behaviour in 0.23.4 and failing to do so
Has this block treatment of object columns changed? Because it now works as expected (well, almost, the mean itself is a float, but I guess that's fine). (cc @jreback @TomAugspurger) |
Yes this looks to work on master. Could use a test:
|
I'm trying to calculate the mean of all the columns of a DataFrame but it looks like having a value in the B column of row 6 prevents from calculating the mean on the C column. Possible bug? (pandas: 0.17.1)
dtypes are equivalent:
But calling mean on the dataframe does not work when row 6 is present:
Also, it works when I explicitely leave out column B for calculating the mean:
The text was updated successfully, but these errors were encountered: