Skip to content

Grouping, then resampling empty DataFrame leads to it losing column names and index #26411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
andrej opened this issue May 15, 2019 · 3 comments · Fixed by #39940
Closed

Grouping, then resampling empty DataFrame leads to it losing column names and index #26411

andrej opened this issue May 15, 2019 · 3 comments · Fixed by #39940
Labels
Milestone

Comments

@andrej
Copy link

andrej commented May 15, 2019

Code Sample

import pandas as pd

empty_df = pd.DataFrame([], columns=["a", "b"], index=pd.TimedeltaIndex([]))

resampled_df = empty_df.groupby("a").resample(rule=pd.to_timedelta("00:00:01")).mean()

print resampled_df
print resampled_df["b"]

Problem description

After grouping and subsequently resampling as shown above, the returned data frame seems to lose all its meta information, such as column names and what type of index it uses. This leads to a key error when trying to access columns of the data frame that were existent, but empty, before doing the grouping and resampling.

Perhaps somewhere in the code a "generic" empty data frame is returned without the attached information about column names and indices?

Expected Output

Empty DataFrame
Columns: [a, b]
Index: []
Series([], Name: b, dtype: object) 

or similar, and empty_df["b"] == resampled_df["b"] would make sense to me.

Actual Output

Empty DataFrame
Columns: []
Index: []
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 2927, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexes/base.py", line 2659, in get_lo
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'b'  

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.9.93-linuxkit-aufs machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: None.None pandas: 0.24.2 pytest: None pip: 8.1.1 setuptools: 20.7.0 Cython: None numpy: 1.16.2 scipy: 0.17.0 pyarrow: None xarray: None IPython: None sphinx: 1.3.6 patsy: None dateutil: 2.8.0 pytz: 2018.9 blosc: None bottleneck: None tables: 3.5.1 numexpr: 2.6.9 feather: None matplotlib: 1.5.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml.etree: 3.5.0 bs4: 4.4.1 html5lib: 0.999 sqlalchemy: 1.3.1 pymysql: 0.7.2.None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None
@TomAugspurger TomAugspurger added Groupby Resample resample method labels May 15, 2019
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone May 15, 2019
@TomAugspurger
Copy link
Contributor

Confirmed the bug on master, haven't looked at it any further.

@krsnik93
Copy link
Contributor

I have traced this down to this bit of code in core/groupby/generic.py (line 273):

if len(keys) == 0:
    return DataFrame(index=keys)

I have managed to preserve columns and dtypes by replacing it with this:

if len(keys) == 0:
    result = DataFrame(index=keys, columns=self.obj.columns)
    result = result.astype(self.obj.dtypes.to_dict())
    return result

Regarding the index, it could be kept with something like:

result = DataFrame(index=self.obj.index[:0], columns=self.obj.columns)

but this (even after modifications) breaks test_agg_apply_corner() in tests/groupby/aggregate/test_aggregate.py which expects groupby performed on float64 values to result in a Float64Index. In other words, it expects the index to be consistent with the values, not with the index of the starting DataFrame.

Feels like a trade-off and I am not quite sure how to proceed. Any thoughts?

@TomAugspurger
Copy link
Contributor

@krsnik93 thanks for digging into this. It may be easiest to discuss over a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants