resample drops dtype('O') columns from DataFrame (but not Series) #14084

rcyeh · 2016-08-24T22:25:55Z

Code Sample, a copy-pastable example if possible

import numpy
import pandas as pd

idx = pd.date_range('1/1/2016', periods=100, freq='d')

z = pd.DataFrame({"z": numpy.zeros(len(idx))}, index=idx)
obj = pd.DataFrame({"obj": numpy.zeros(len(idx), dtype=numpy.dtype('O'))}, index=idx)
nan = pd.DataFrame({"nan": numpy.zeros(len(idx)) + float('nan')}, index=idx)
objnan = pd.DataFrame([], columns=['objnan'], index=idx[0:0])  # dtype('O') nan after join

df_list = [z, obj, nan, objnan]

def r1w(x):
    return x.resample('1w').sum()

resample_join_result = r1w(pd.DataFrame().join(df_list, how='outer'))
print(resample_join_result.shape)  # (15, 2) --- I thought this should be (15, 4)

join_resample_result = pd.DataFrame().join([r1w(x) for x in df_list], how='outer')
print(join_resample_result.shape)  # (15, 4)

for columnname in join_resample_result.columns:
    if columnname not in resample_join_result.columns:
        print("DataFrame.resample missing column: " + str(columnname) + " (" + str(join_result[columnname].dtype) + ")")

Expected Output

I would have expected resample_join_result to have all four columns and be the same as join_resample_result, but they are not, because it seems pandas.DataFrame.resample drops dtype('O') (object) columns; while pandas.Series.resample converts those columns into numeric dtypes.

output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.22.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: 0.2.0

The text was updated successfully, but these errors were encountered:

jreback · 2016-08-24T22:36:49Z

xref #12537

for upsampling this would work, but how would you do downsampling? you could maybe do max/min/first/last, but other functions like sum/mean/var won't work at all because its impossible to select a correct row for the non-numerics. So better to simply exclude. groupby does the same (except for sum which is weird because it works on strings).

rcyeh · 2016-08-31T23:23:29Z

Wow, thanks for your quick reply. I'm not sure I understand it all, but I think I will after I read it a few more times.

As a newbie, I am very surprised that a DataFrame method could ever return a different result than an index-based DataFrame.join of [df[x].method() for x in df], where the identically-named Series method has been applied to each column of the DataFrame. I haven't used pandas enough to understand why this should be. Based on the cross-referenced issue #12537 and comment about groupby, I am guessing that resample is not the only case that will exhibit this. Is there a way I can guess when that might occur?

I'd rather have an explicit NaN or a ValueError exception due to unsupported data type instead of a silent drop. This may be related to issues of control over the treatment of NaN. The pandas default (exclude/ignore) is usually pleasantly convenient, but there are times when I really do want NaN to propagate (and I am fine with saying to go away and use numpy). I can always explicitly drop non-numeric columns.

Re-reading the documentation, I see that you're careful not to encourage users to think of DataFrame as just an aligned list of Series. I need to think more about that.

jreback · 2016-08-31T23:34:57Z

http://pandas.pydata.org/pandas-docs/stable/groupby.html#automatic-exclusion-of-nuisance-columns

this is standard practice on groupby as well (resample is just a time based grouper)

your view is incorrect and actually non obvious

join of op on Series only makes sense if he shape of the result is consistent

think about what a resample / groupby is actually doing when reducing and it will become clear

jreback closed this as completed Aug 24, 2016

jreback added Dtype Conversions Unexpected or buggy dtype conversions API Design Resample resample method labels Aug 24, 2016

jreback added this to the No action milestone Aug 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resample drops dtype('O') columns from DataFrame (but not Series) #14084

resample drops dtype('O') columns from DataFrame (but not Series) #14084

rcyeh commented Aug 24, 2016 •

edited

Loading

jreback commented Aug 24, 2016

rcyeh commented Aug 31, 2016 •

edited

Loading

jreback commented Aug 31, 2016

resample drops dtype('O') columns from DataFrame (but not Series) #14084

resample drops dtype('O') columns from DataFrame (but not Series) #14084

Comments

rcyeh commented Aug 24, 2016 • edited Loading

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

jreback commented Aug 24, 2016

rcyeh commented Aug 31, 2016 • edited Loading

jreback commented Aug 31, 2016

rcyeh commented Aug 24, 2016 •

edited

Loading

output of `pd.show_versions()`

rcyeh commented Aug 31, 2016 •

edited

Loading