Skip to content

resample.apply flattens column index when more than 3 levels #16231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jmarrec opened this issue May 4, 2017 · 3 comments · Fixed by #31171
Closed

resample.apply flattens column index when more than 3 levels #16231

jmarrec opened this issue May 4, 2017 · 3 comments · Fixed by #31171
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions Resample resample method
Milestone

Comments

@jmarrec
Copy link
Contributor

jmarrec commented May 4, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

##### Case MultiIndex levels = 2
cols = pd.MultiIndex.from_tuples([('A', 'one'), ('A', 'two')])
ind = pd.DatetimeIndex(start='2017-01-01', freq='15Min', periods=8)
df = pd.DataFrame(np.random.randn(8,2), index=ind, columns=cols)

# Creates an agg dict to map the column to different functions
agg_dict = { col:(np.sum if col[1] == 'one' else np.mean) for col in df.columns }
resampled = df.resample('H').apply(lambda x: agg_dict[x.name](x))
try:
    assert isinstance(resampled.columns, pd.MultiIndex)
except AssertionError as e:
    e.args += ('Case nlevels={}'.format(df.columns.nlevels),)
    raise

##### Case MultiIndex levels = 3
cols = pd.MultiIndex.from_tuples([('A', 'i','one'), ('A', 'ii','two')])
ind = pd.DatetimeIndex(start='2017-01-01', freq='15Min', periods=8)
df = pd.DataFrame(np.random.randn(8,2), index=ind, columns=cols)

# Creates an agg dict to map the column to different functions
agg_dict = { col:(np.sum if col[2] == 'one' else np.mean) for col in df.columns }
resampled = df.resample('H').apply(lambda x: agg_dict[x.name](x))
try:
    assert isinstance(resampled.columns, pd.MultiIndex)
except AssertionError as e:
    e.args += ('Case nlevels={}'.format(df.columns.nlevels),)
    raise

##### Case MultiIndex levels = 4
cols = pd.MultiIndex.from_tuples([('A', 'a', '', 'one'), ('B', 'b', 'i', 'two')])
ind = pd.DatetimeIndex(start='2017-01-01', freq='15Min', periods=8)
df = pd.DataFrame(np.random.randn(8,2), index=ind, columns=cols)

agg_dict = { col:(np.sum if col[3] == 'one' else np.mean) for col in df.columns }
resampled = df.resample('H').apply(lambda x: agg_dict[x.name](x))
try:
    assert isinstance(resampled.columns, pd.MultiIndex)
except AssertionError as e:
    e.args += ('Case nlevels={}'.format(df.columns.nlevels),)
    raise

Problem description

With MultiIndexed columns that have 2 or 3 levels, the resample().apply() does return the same MultiIndexed columns. If you go to 4 levels, returned is a single-level column where only the first level is kept.
In the above code, only the case nlevels=4 raises.

Expected Output

The above code shouldn't raise.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Darwin OS-release: 16.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 34.3.0
Cython: 0.25.2
numpy: 1.12.0
scipy: 0.18.1
statsmodels: 0.8.0
xarray: 0.9.1
IPython: 5.3.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.2
matplotlib: 2.0.0
openpyxl: None
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: None
bs4: 4.5.3
html5lib: 0.9999999
httplib2: 0.10.3
apiclient: 1.6.2
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.5
boto: None
pandas_datareader: 0.3.0.post

@jreback
Copy link
Contributor

jreback commented May 4, 2017

this is probably also applicable to groupby, these use the same machinery.

welcome for you to have a look. This is prob pretty deep in the code.

@jreback jreback added Bug Difficulty Advanced Groupby Resample resample method Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 4, 2017
@jreback jreback added this to the Next Major Release milestone May 4, 2017
@jmarrec
Copy link
Contributor Author

jmarrec commented May 4, 2017

I honestly have no idea where to start on this one.

(For the record, a workaround is simply to do resampled.columns = df.columns after the resampling is done.)

@mroeschke
Copy link
Member

This above error doesn't raise an error on master. Could use some sort of simplified regression test.

In [178]: cols = pd.MultiIndex.from_tuples([('A', 'one'), ('A', 'two')])
     ...: ind = pd.DatetimeIndex(start='2017-01-01', freq='15Min', periods=8)
     ...: df = pd.DataFrame(np.random.randn(8,2), index=ind, columns=cols)
     ...:
     ...: # Creates an agg dict to map the column to different functions
     ...: agg_dict = { col:(np.sum if col[1] == 'one' else np.mean) for col in df.columns }
     ...: resampled = df.resample('H').apply(lambda x: agg_dict[x.name](x))
     ...: try:
     ...:     assert isinstance(resampled.columns, pd.MultiIndex)
     ...: except AssertionError as e:
     ...:     e.args += ('Case nlevels={}'.format(df.columns.nlevels),)
     ...:     raise
     ...:
     ...: ##### Case MultiIndex levels = 3
     ...: cols = pd.MultiIndex.from_tuples([('A', 'i','one'), ('A', 'ii','two')])
     ...: ind = pd.DatetimeIndex(start='2017-01-01', freq='15Min', periods=8)
     ...: df = pd.DataFrame(np.random.randn(8,2), index=ind, columns=cols)
     ...:
     ...: # Creates an agg dict to map the column to different functions
     ...: agg_dict = { col:(np.sum if col[2] == 'one' else np.mean) for col in df.columns }
     ...: resampled = df.resample('H').apply(lambda x: agg_dict[x.name](x))
     ...: try:
     ...:     assert isinstance(resampled.columns, pd.MultiIndex)
     ...: except AssertionError as e:
     ...:     e.args += ('Case nlevels={}'.format(df.columns.nlevels),)
     ...:     raise
     ...:
     ...: ##### Case MultiIndex levels = 4
     ...: cols = pd.MultiIndex.from_tuples([('A', 'a', '', 'one'), ('B', 'b', 'i', 'two')])
     ...: ind = pd.DatetimeIndex(start='2017-01-01', freq='15Min', periods=8)
     ...: df = pd.DataFrame(np.random.randn(8,2), index=ind, columns=cols)
     ...:
     ...: agg_dict = { col:(np.sum if col[3] == 'one' else np.mean) for col in df.columns }
     ...: resampled = df.resample('H').apply(lambda x: agg_dict[x.name](x))
     ...: try:
     ...:     assert isinstance(resampled.columns, pd.MultiIndex)
     ...: except AssertionError as e:
     ...:     e.args += ('Case nlevels={}'.format(df.columns.nlevels),)
     ...:     raise
     ...:
/anaconda3/envs/pandas-dev/bin/ipython:2: FutureWarning: Creating a DatetimeIndex by passing range endpoints is deprecated.  Use `pandas.date_range` instead.

/anaconda3/envs/pandas-dev/bin/ipython:16: FutureWarning: Creating a DatetimeIndex by passing range endpoints is deprecated.  Use `pandas.date_range` instead.
/anaconda3/envs/pandas-dev/bin/ipython:30: FutureWarning: Creating a DatetimeIndex by passing range endpoints is deprecated.  Use `pandas.date_range` instead.

In [179]: pd.__version__
Out[179]: '0.26.0.dev0+555.gf7d162b18'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Difficulty Advanced Groupby Resample resample method Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 14, 2019
@jbrockmendel jbrockmendel added the Resample resample method label Oct 16, 2019
@jreback jreback modified the milestones: Contributions Welcome, 1.1 Jan 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions Resample resample method
Projects
None yet
4 participants