row sum and column mean works but row mean gives all NaN, on heterogenous data types #33202

xuancong84 · 2020-04-01T09:57:58Z

When different DataFrame rows are of different types, row sum works but row mean gives all NaN values.

df=pd.DataFrame({'a':[1,pd.to_timedelta('1D'),3.], 'b':[4,pd.to_timedelta('1.5D'),6.], 'c':[7,pd.to_timedelta('0.5D'),np.nan]})
# df=pd.DataFrame({'a':[1,2,3.], 'b':[4,5,6.], 'c':[7,8,np.nan]})
display(df.transpose())
display(df.transpose().sum(axis=0))
display(df.transpose().mean(axis=0))
display(df)
display(df.sum(axis=1))
display(df.mean(axis=1))  # bug

As shown above, the last statement shows the bug by returning all NaN values which is incorrect. Interestingly, if the DataFrame entries are of the same type (uncomment the 2nd line), this bug does not occur.

The text was updated successfully, but these errors were encountered:

jreback · 2020-04-01T10:41:10Z

sum of object types generally might work but mean does not. try doing .infer_object() after transpose

why would you do this in any event? transpose in mixed types is really odd

xuancong84 · 2020-04-02T02:17:40Z

@jreback Thanks for your reply! I encounter this because I am doing clinical data statistical analysis. In clinical data, some are floating point, some are integers, some are time durations, etc., so when I compute the average value over some period of time, this bug surfaces.

Currently, transpose=>column-mean=>transpose-back is my work around for row-mean, because on exactly the same data, averaging the column has no problem, only averaging the row has the problem.

jreback · 2020-04-02T02:24:09Z

pandas is column based; mixing dtypes while possible is not recommended; column based dtypes are efficient for both storage and computation

doing numeric operations in heterogeneous data doesn’t make any sense

xuancong84 · 2020-04-02T02:37:40Z

So the issue is closed without getting the bug solved.

That is why I have suggested in #6963, "for maximum usage compatibility, we should treat a table as a symmetric rank-2 tensor". Due to low-level CPU optimization mechanical constraint, sum/mean/std/max/min/etc along row direction may incur a performance penalty, this is acceptable, but the operation should not fail and should work consistent with column operations.

Thus, incompetent programming and inherently deficient/defective architectural design does not allow this bug to be solved that easily ^_^
Anyway, you guys are very busy, you only need to solve this when more people report encountering this in their work.

xuancong84 changed the title ~~row sum works but row mean gives all NaN, on heterogenous data types~~ row sum and column mean works but row mean gives all NaN, on heterogenous data types Apr 2, 2020

jreback closed this as completed Apr 2, 2020

MarcoGorelli mentioned this issue Aug 19, 2021

Can't use dt.ceil, dt.floor, dt.round with coarser target frequencies #15303

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

row sum and column mean works but row mean gives all NaN, on heterogenous data types #33202

row sum and column mean works but row mean gives all NaN, on heterogenous data types #33202

xuancong84 commented Apr 1, 2020

jreback commented Apr 1, 2020

xuancong84 commented Apr 2, 2020 •

edited

Loading

jreback commented Apr 2, 2020

xuancong84 commented Apr 2, 2020 •

edited

Loading

row sum and column mean works but row mean gives all NaN, on heterogenous data types #33202

row sum and column mean works but row mean gives all NaN, on heterogenous data types #33202

Comments

xuancong84 commented Apr 1, 2020

jreback commented Apr 1, 2020

xuancong84 commented Apr 2, 2020 • edited Loading

jreback commented Apr 2, 2020

xuancong84 commented Apr 2, 2020 • edited Loading

So the issue is closed without getting the bug solved.

xuancong84 commented Apr 2, 2020 •

edited

Loading

xuancong84 commented Apr 2, 2020 •

edited

Loading