BUG: Grouby column naming #53128

matthewgilbert · 2023-05-07T14:30:03Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

# this collapses the columns to the level names
midx = pd.MultiIndex.from_tuples([("A", "", "a"), ("B", "", "b")], names=["l1", "l2", "l3"])
df = pd.DataFrame([(1, 3), (1, 2), (6, 1)], columns=midx)
print(df)
print()
print(df.groupby(by=["l1", "l3"], axis=1).sum())
print([i for i, _ in df.groupby(by=["l1", "l3"], axis=1)])

# this does not
midx = pd.MultiIndex.from_tuples([("A", "", "a"), ("B", "", "b"), ("B", "", "b1")], names=["l1", "l2", "l3"])
df = pd.DataFrame([(1, 3, 1), (1, 2, 4), (6, 1, 10)], columns=midx)
print(df)
print()
print(df.groupby(by=["l1", "l3"], axis=1).sum())
print([i for i, _ in df.groupby(by=["l1", "l3"], axis=1)])

Issue Description

Depending on the number of values in the groupby categories, whether or not the group names come from the column level values or column level names changes. I believe this a bug or at least very unexpected behaviour. This can be avoided by using level= instead of by= in the groupby but it definitely seems like a gotchya.

The first example above returns the following data which takes the level names as the outpout columns

# df
l1  A  B
l2      
l3  a  b
0   1  3
1   1  2
2   6  1

# groupby result
   l1  l3
0   1   3
1   1   2
2   6   1

# DataFrameGroupBy keys
['l1', 'l3']

Whereas this example takes the column value

# df
l1  A  B    
l2          
l3  a  b  b1
0   1  3   1
1   1  2   4
2   6  1  10

# groupby result
l1  A  B    
l3  a  b  b1
0   1  3   1
1   1  2   4
2   6  1  10

# DataFrameGroupBy keys
[('A', 'a'), ('B', 'b'), ('B', 'b1')]

Expected Behavior

I would expect the groupby on column levels to always return groups based on what the column values are, i.e. the same behaviour as df.groupby(level=["l1", "l3"], axis=1

Installed Versions

In [9]: pd.show_versions()

INSTALLED VERSIONS

commit : 86a4ee0
python : 3.9.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.14.0-1059-oem
Version : #67-Ubuntu SMP Mon Mar 13 14:22:10 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.0.dev0+707.g86a4ee01c7
numpy : 1.25.0.dev0+1357.ga2d21d8ac
pytz : 2023.3
dateutil : 2.8.2
setuptools : 58.0.4
pip : 21.2.4
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.29.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

topper-123 · 2023-05-12T11:41:47Z

Groupby with axis=1 will be deprecated in the upcoming v2.1, see #51203.

On that basis IMO this issue is not relevant any more, but I'll let it stay open for a while, if you/someone has additional comments.

rhshadrach · 2023-07-11T20:50:22Z

Agreed @topper-123 - closing.

matthewgilbert added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 7, 2023

topper-123 added Groupby Closing Candidate May be closeable, needs more eyeballs and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 12, 2023

rhshadrach closed this as completed Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Grouby column naming #53128

BUG: Grouby column naming #53128

matthewgilbert commented May 7, 2023

INSTALLED VERSIONS

topper-123 commented May 12, 2023

rhshadrach commented Jul 11, 2023

BUG: Grouby column naming #53128

BUG: Grouby column naming #53128

Comments

matthewgilbert commented May 7, 2023

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

topper-123 commented May 12, 2023

rhshadrach commented Jul 11, 2023