Skip to content

resample drops dtype('O') columns from DataFrame (but not Series) #14084

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rcyeh opened this issue Aug 24, 2016 · 3 comments
Closed

resample drops dtype('O') columns from DataFrame (but not Series) #14084

rcyeh opened this issue Aug 24, 2016 · 3 comments
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions Resample resample method

Comments

@rcyeh
Copy link

rcyeh commented Aug 24, 2016

Code Sample, a copy-pastable example if possible

import numpy
import pandas as pd

idx = pd.date_range('1/1/2016', periods=100, freq='d')

z = pd.DataFrame({"z": numpy.zeros(len(idx))}, index=idx)
obj = pd.DataFrame({"obj": numpy.zeros(len(idx), dtype=numpy.dtype('O'))}, index=idx)
nan = pd.DataFrame({"nan": numpy.zeros(len(idx)) + float('nan')}, index=idx)
objnan = pd.DataFrame([], columns=['objnan'], index=idx[0:0])  # dtype('O') nan after join

df_list = [z, obj, nan, objnan]

def r1w(x):
    return x.resample('1w').sum()

resample_join_result = r1w(pd.DataFrame().join(df_list, how='outer'))
print(resample_join_result.shape)  # (15, 2) --- I thought this should be (15, 4)

join_resample_result = pd.DataFrame().join([r1w(x) for x in df_list], how='outer')
print(join_resample_result.shape)  # (15, 4)

for columnname in join_resample_result.columns:
    if columnname not in resample_join_result.columns:
        print("DataFrame.resample missing column: " + str(columnname) + " (" + str(join_result[columnname].dtype) + ")")

Expected Output

I would have expected resample_join_result to have all four columns and be the same as join_resample_result, but they are not, because it seems pandas.DataFrame.resample drops dtype('O') (object) columns; while pandas.Series.resample converts those columns into numeric dtypes.

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.22.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: 0.2.0
@jreback
Copy link
Contributor

jreback commented Aug 24, 2016

xref #12537

for upsampling this would work, but how would you do downsampling? you could maybe do max/min/first/last, but other functions like sum/mean/var won't work at all because its impossible to select a correct row for the non-numerics. So better to simply exclude. groupby does the same (except for sum which is weird because it works on strings).

@jreback jreback closed this as completed Aug 24, 2016
@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions API Design Resample resample method labels Aug 24, 2016
@jreback jreback added this to the No action milestone Aug 24, 2016
@rcyeh
Copy link
Author

rcyeh commented Aug 31, 2016

Wow, thanks for your quick reply. I'm not sure I understand it all, but I think I will after I read it a few more times.

As a newbie, I am very surprised that a DataFrame method could ever return a different result than an index-based DataFrame.join of [df[x].method() for x in df], where the identically-named Series method has been applied to each column of the DataFrame. I haven't used pandas enough to understand why this should be. Based on the cross-referenced issue #12537 and comment about groupby, I am guessing that resample is not the only case that will exhibit this. Is there a way I can guess when that might occur?

I'd rather have an explicit NaN or a ValueError exception due to unsupported data type instead of a silent drop. This may be related to issues of control over the treatment of NaN. The pandas default (exclude/ignore) is usually pleasantly convenient, but there are times when I really do want NaN to propagate (and I am fine with saying to go away and use numpy). I can always explicitly drop non-numeric columns.

Re-reading the documentation, I see that you're careful not to encourage users to think of DataFrame as just an aligned list of Series. I need to think more about that.

@jreback
Copy link
Contributor

jreback commented Aug 31, 2016

http://pandas.pydata.org/pandas-docs/stable/groupby.html#automatic-exclusion-of-nuisance-columns

this is standard practice on groupby as well (resample is just a time based grouper)

your view is incorrect and actually non obvious

join of op on Series only makes sense if he shape of the result is consistent

think about what a resample / groupby is actually doing when reducing and it will become clear

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions Resample resample method
Projects
None yet
Development

No branches or pull requests

2 participants