Skip to content

row sum and column mean works but row mean gives all NaN, on heterogenous data types #33202

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xuancong84 opened this issue Apr 1, 2020 · 4 comments

Comments

@xuancong84
Copy link

When different DataFrame rows are of different types, row sum works but row mean gives all NaN values.

df=pd.DataFrame({'a':[1,pd.to_timedelta('1D'),3.], 'b':[4,pd.to_timedelta('1.5D'),6.], 'c':[7,pd.to_timedelta('0.5D'),np.nan]})
# df=pd.DataFrame({'a':[1,2,3.], 'b':[4,5,6.], 'c':[7,8,np.nan]})
display(df.transpose())
display(df.transpose().sum(axis=0))
display(df.transpose().mean(axis=0))
display(df)
display(df.sum(axis=1))
display(df.mean(axis=1))  # bug

image

As shown above, the last statement shows the bug by returning all NaN values which is incorrect. Interestingly, if the DataFrame entries are of the same type (uncomment the 2nd line), this bug does not occur.

@jreback
Copy link
Contributor

jreback commented Apr 1, 2020

sum of object types generally might work but mean does not. try doing .infer_object() after transpose

why would you do this in any event? transpose in mixed types is really odd

@xuancong84
Copy link
Author

xuancong84 commented Apr 2, 2020

@jreback Thanks for your reply! I encounter this because I am doing clinical data statistical analysis. In clinical data, some are floating point, some are integers, some are time durations, etc., so when I compute the average value over some period of time, this bug surfaces.

Currently, transpose=>column-mean=>transpose-back is my work around for row-mean, because on exactly the same data, averaging the column has no problem, only averaging the row has the problem.

@xuancong84 xuancong84 changed the title row sum works but row mean gives all NaN, on heterogenous data types row sum and column mean works but row mean gives all NaN, on heterogenous data types Apr 2, 2020
@jreback
Copy link
Contributor

jreback commented Apr 2, 2020

pandas is column based; mixing dtypes while possible is not recommended; column based dtypes are efficient for both storage and computation

doing numeric operations in heterogeneous data doesn’t make any sense

@jreback jreback closed this as completed Apr 2, 2020
@xuancong84
Copy link
Author

xuancong84 commented Apr 2, 2020

So the issue is closed without getting the bug solved.

That is why I have suggested in #6963, "for maximum usage compatibility, we should treat a table as a symmetric rank-2 tensor". Due to low-level CPU optimization mechanical constraint, sum/mean/std/max/min/etc along row direction may incur a performance penalty, this is acceptable, but the operation should not fail and should work consistent with column operations.

Thus, incompetent programming and inherently deficient/defective architectural design does not allow this bug to be solved that easily ^_^
Anyway, you guys are very busy, you only need to solve this when more people report encountering this in their work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants