Grouping, then resampling empty DataFrame leads to it losing column names and index #26411

andrej · 2019-05-15T11:13:04Z

Code Sample

import pandas as pd

empty_df = pd.DataFrame([], columns=["a", "b"], index=pd.TimedeltaIndex([]))

resampled_df = empty_df.groupby("a").resample(rule=pd.to_timedelta("00:00:01")).mean()

print resampled_df
print resampled_df["b"]

Problem description

After grouping and subsequently resampling as shown above, the returned data frame seems to lose all its meta information, such as column names and what type of index it uses. This leads to a key error when trying to access columns of the data frame that were existent, but empty, before doing the grouping and resampling.

Perhaps somewhere in the code a "generic" empty data frame is returned without the attached information about column names and indices?

Expected Output

Empty DataFrame
Columns: [a, b]
Index: []
Series([], Name: b, dtype: object)

or similar, and empty_df["b"] == resampled_df["b"] would make sense to me.

Actual Output

Empty DataFrame
Columns: []
Index: []
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 2927, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexes/base.py", line 2659, in get_lo
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'b'

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.9.93-linuxkit-aufs machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: None.None pandas: 0.24.2 pytest: None pip: 8.1.1 setuptools: 20.7.0 Cython: None numpy: 1.16.2 scipy: 0.17.0 pyarrow: None xarray: None IPython: None sphinx: 1.3.6 patsy: None dateutil: 2.8.0 pytz: 2018.9 blosc: None bottleneck: None tables: 3.5.1 numexpr: 2.6.9 feather: None matplotlib: 1.5.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml.etree: 3.5.0 bs4: 4.4.1 html5lib: 0.999 sqlalchemy: 1.3.1 pymysql: 0.7.2.None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-05-15T20:32:01Z

Confirmed the bug on master, haven't looked at it any further.

krsnik93 · 2019-05-18T11:52:34Z

I have traced this down to this bit of code in core/groupby/generic.py (line 273):

if len(keys) == 0:
    return DataFrame(index=keys)

I have managed to preserve columns and dtypes by replacing it with this:

if len(keys) == 0:
    result = DataFrame(index=keys, columns=self.obj.columns)
    result = result.astype(self.obj.dtypes.to_dict())
    return result

Regarding the index, it could be kept with something like:

result = DataFrame(index=self.obj.index[:0], columns=self.obj.columns)

but this (even after modifications) breaks test_agg_apply_corner() in tests/groupby/aggregate/test_aggregate.py which expects groupby performed on float64 values to result in a Float64Index. In other words, it expects the index to be consistent with the values, not with the index of the starting DataFrame.

Feels like a trade-off and I am not quite sure how to proceed. Any thoughts?

TomAugspurger · 2019-06-13T13:13:49Z

@krsnik93 thanks for digging into this. It may be easiest to discuss over a PR.

TomAugspurger added Groupby Resample resample method labels May 15, 2019

TomAugspurger added this to the Contributions Welcome milestone May 15, 2019

randomstuff mentioned this issue Sep 13, 2019

Resampling and counting empty Series does not have a correct dtype #28427

Closed

mroeschke added the Bug label May 11, 2020

This was referenced Feb 14, 2021

BUG: Empty result in df.groupby.agg on multiple keys has no columns #39809

Closed

BUG: Groupby ops on empty objects loses index, columns, dtypes #39940

Merged

jreback modified the milestones: Contributions Welcome, 1.3 Feb 21, 2021

jreback closed this as completed in #39940 Feb 24, 2021

rhshadrach mentioned this issue Mar 25, 2021

BUG: I got empty Dataframe with index from the summation of empty Dataframe with MultiIndex #40626

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grouping, then resampling empty DataFrame leads to it losing column names and index #26411

Grouping, then resampling empty DataFrame leads to it losing column names and index #26411

andrej commented May 15, 2019

TomAugspurger commented May 15, 2019

krsnik93 commented May 18, 2019

TomAugspurger commented Jun 13, 2019

Grouping, then resampling empty DataFrame leads to it losing column names and index #26411

Grouping, then resampling empty DataFrame leads to it losing column names and index #26411

Comments

andrej commented May 15, 2019

Code Sample

Problem description

Expected Output

Actual Output

Output of pd.show_versions()

TomAugspurger commented May 15, 2019

krsnik93 commented May 18, 2019

TomAugspurger commented Jun 13, 2019

Output of `pd.show_versions()`