You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
importnumpyimportpandasaspdidx=pd.date_range('1/1/2016', periods=100, freq='d')
z=pd.DataFrame({"z": numpy.zeros(len(idx))}, index=idx)
obj=pd.DataFrame({"obj": numpy.zeros(len(idx), dtype=numpy.dtype('O'))}, index=idx)
nan=pd.DataFrame({"nan": numpy.zeros(len(idx)) +float('nan')}, index=idx)
objnan=pd.DataFrame([], columns=['objnan'], index=idx[0:0]) # dtype('O') nan after joindf_list= [z, obj, nan, objnan]
defr1w(x):
returnx.resample('1w').sum()
resample_join_result=r1w(pd.DataFrame().join(df_list, how='outer'))
print(resample_join_result.shape) # (15, 2) --- I thought this should be (15, 4)join_resample_result=pd.DataFrame().join([r1w(x) forxindf_list], how='outer')
print(join_resample_result.shape) # (15, 4)forcolumnnameinjoin_resample_result.columns:
ifcolumnnamenotinresample_join_result.columns:
print("DataFrame.resample missing column: "+str(columnname) +" ("+str(join_result[columnname].dtype) +")")
Expected Output
I would have expected resample_join_result to have all four columns and be the same as join_resample_result, but they are not, because it seems pandas.DataFrame.resample drops dtype('O') (object) columns; while pandas.Series.resample converts those columns into numeric dtypes.
for upsampling this would work, but how would you do downsampling? you could maybe do max/min/first/last, but other functions like sum/mean/var won't work at all because its impossible to select a correct row for the non-numerics. So better to simply exclude. groupby does the same (except for sum which is weird because it works on strings).
Wow, thanks for your quick reply. I'm not sure I understand it all, but I think I will after I read it a few more times.
As a newbie, I am very surprised that a DataFrame method could ever return a different result than an index-based DataFrame.join of [df[x].method() for x in df], where the identically-named Series method has been applied to each column of the DataFrame. I haven't used pandas enough to understand why this should be. Based on the cross-referenced issue #12537 and comment about groupby, I am guessing that resample is not the only case that will exhibit this. Is there a way I can guess when that might occur?
I'd rather have an explicit NaN or a ValueError exception due to unsupported data type instead of a silent drop. This may be related to issues of control over the treatment of NaN. The pandas default (exclude/ignore) is usually pleasantly convenient, but there are times when I really do want NaN to propagate (and I am fine with saying to go away and use numpy). I can always explicitly drop non-numeric columns.
Re-reading the documentation, I see that you're careful not to encourage users to think of DataFrame as just an aligned list of Series. I need to think more about that.
Code Sample, a copy-pastable example if possible
Expected Output
I would have expected
resample_join_result
to have all four columns and be the same asjoin_resample_result
, but they are not, because it seemspandas.DataFrame.resample
drops dtype('O') (object) columns; whilepandas.Series.resample
converts those columns into numeric dtypes.output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: