Skip to content

Groupby + sum by multiple columns on an empty DataFrame drops list of columns #15106

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
karatheodory opened this issue Jan 11, 2017 · 9 comments · Fixed by #41493
Closed

Groupby + sum by multiple columns on an empty DataFrame drops list of columns #15106

karatheodory opened this issue Jan 11, 2017 · 9 comments · Fixed by #41493
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@karatheodory
Copy link

Code Sample

import pandas as pd

print pd.DataFrame(data=[[1,2,3]], columns=['A', 'B', 'C'])\
    .groupby(['A', 'B'])\
    .sum()\
    .reset_index()\
    .columns\
    .tolist()
# ['A', 'B', 'C']

print pd.DataFrame(data=[], columns=['A', 'B', 'C'])\
    .groupby(['A'])\
    .sum()\
    .reset_index()\
    .columns\
    .tolist()
# ['A', 'B', 'C']

print pd.DataFrame(data=[], columns=['A', 'B', 'C'])\
    .groupby(['A', 'B'])\
    .sum()\
    .reset_index()\
    .columns\
    .tolist()
# ['index']

Problem description

As the original list of columns is lost in the second case, I have to handle empty data frames differently, or add columns back by myself, both of which are inconvenient.

Expected Output

The list of columns is expected to be equal to the original one from data frame

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-57-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.11.2
scipy: None
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Jan 11, 2017

so to simplify

# this is ok
In [18]: pd.DataFrame(data=[], columns=['A', 'B', 'C']).groupby(['A','B']).C.sum()
Out[18]: Series([], Name: C, dtype: float64)

# this should be Index(['C'])
In [19]: pd.DataFrame(data=[], columns=['A', 'B', 'C']).groupby(['A','B']).sum().columns
Out[19]: Index([], dtype='object')

if you would like to trace and submit a PR to fix would be great!

@jreback jreback added Bug Difficulty Intermediate Groupby Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 11, 2017
@jreback jreback added this to the 0.20.0 milestone Jan 11, 2017
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 11, 2017

As a pointer, this is probably related to our automatically dropping nuisance columns (non-numeric cols like object) in numeric aggregations. Explicitly setting the dtypes works:

In [74]: pd.DataFrame([], columns=["A", "B", "C"]).astype(np.int64).groupby(['A', 'B']).sum().reset_index().columns.tolist()
Out[74]: ['A', 'B', 'C']

I wonder if this means it's not actually a bug? We are "correctly" dropping an object column after all.
But the original examples 2 and 3 do seem inconsistent.

@jreback
Copy link
Contributor

jreback commented Jan 11, 2017

@TomAugspurger

so this is a special case on sum

In [4]: pd.DataFrame([], columns=["A", "B", "C"]).groupby(['A','B']).mean()
DataError: No numeric types to aggregate

which 'works' on object dtypes, so they in fact are NOT nuiscance columns.

The reason for this is the fallback to np.sum, again which 'works' on object dtypes.

So this is 'correct'. Though not helpful.

@chrisaycock
Copy link
Contributor

chrisaycock commented Jan 30, 2017

Agreed with @jreback. mean() catches GroupByError and re-raises:

https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby.py#L1031

sum() does not:

https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby.py#L123

DataError inherits from GroupByError, which in turn inherits from Exception. So sum() invokes the fall-back np.sum() while mean() exits immediately. Should _groupby_function() explicitly catch and re-raise GroupByError as well?

@jreback
Copy link
Contributor

jreback commented Jan 30, 2017

@chrisaycock yeah I tried to fix this at some point. The issue is that we rely on np.sum in some cases. So this should be re-engineered a bit.

@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017
@jreback jreback modified the milestones: Interesting Issues, Next Major Release May 24, 2017
@jreback jreback modified the milestones: Interesting Issues, Next Major Release Nov 26, 2017
@valentas-kurauskas
Copy link

It is not consistent more generally with apply (my pandas version 0.23.4):

>>> pd.DataFrame(data=[[1,2,3]], columns=['A', 'B', 'C']).groupby(['A', 'B']).apply(lambda x:x)
   A  B  C
0  1  2  3
>>> pd.DataFrame(data=[], columns=['A', 'B', 'C']).groupby(['A', 'B']).apply(lambda x:x)
Empty DataFrame
Columns: []
Index: []

@kuraga
Copy link

kuraga commented Apr 16, 2019

Any news? I have @valentas-kurauskas 's issue.

@majiang
Copy link
Contributor

majiang commented Dec 24, 2019

The problem @valentas-kurauskas and @kuraga reported seems to be that, since the dataframe is empty, the function passed to apply is never called.
apply can't know the function's return type (which can be either of scaler, Series or DataFrame) so it can only drop the structure.

@mroeschke
Copy link
Member

The simplified version of OP's report seems to work now. Could use a test

In [11]: In [19]: pd.DataFrame(data=[], columns=['A', 'B', 'C']).groupby(['A','B']).sum().columns
Out[11]: Index(['C'], dtype='object')

In [12]: pd.__version__
Out[12]: '1.3.0.dev0+1567.g67c9385787'

@mroeschke mroeschke removed Bug Groupby Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 8, 2021
@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions labels May 8, 2021
@mroeschke mroeschke modified the milestones: Contributions Welcome, 1.3 May 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants