Skip to content

BUG: Inconsistent behaviour in DataFrameGroupBy when selecting a subset of columns #44821

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
maurosilber opened this issue Dec 8, 2021 · 3 comments · Fixed by #44947
Closed
3 tasks done
Milestone

Comments

@maurosilber
Copy link
Contributor

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame({k: range(10) for k in "ABC"})

# Group by and select columns
dg = df.groupby(df.A < 4)[["A", "B"]]


# Apply on GroupBy
print(dg.apply(lambda x: [x.shape, x.columns]))

# A
# False    [(6, 2), [A, B]]
# True     [(4, 2), [A, B]]
# dtype: object


# Iterate on GroupBy
for ix, dg_ix in dg:
    print(ix, dg_ix.shape, dg_ix.columns)

# False (6, 3) Index(['A', 'B', 'C'], dtype='object')
# True (4, 3) Index(['A', 'B', 'C'], dtype='object')

Issue Description

A subset of columns is not (always) selected with DataFrameGroupBy.__getitem__. Selecting a subset of column works for DataFrameGroupBy.apply but not for DataFrameGroupBy.__iter__.

Expected Behavior

I expected dg_ix to contain only A and B as columns.

# Iterate on GroupBy
for ix, dg_ix in dg:
    print(ix, dg_ix.shape, dg_ix.columns)

# False (6, 2) Index(['A', 'B'], dtype='object')
# True (4, 2) Index(['A', 'B'], dtype='object')

Installed Versions

INSTALLED VERSIONS

commit : 945c9ed
python : 3.9.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.14.18-300.fc35.x86_64
Version : #1 SMP Fri Nov 12 16:43:17 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.4
numpy : 1.21.4
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 59.1.0
Cython : None
pytest : 5.4.3
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 7.29.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.2
sqlalchemy : None
tables : None
tabulate : None
xarray : 0.20.1
xlrd : None
xlwt : None
numba : 0.53.1

@maurosilber maurosilber added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 8, 2021
@maurosilber
Copy link
Contributor Author

I found this old issues discussing selection of multiple columns in GroupBy objects:

@simonjayhawkins
Copy link
Member

Thanks @maurosilber for the report and investigation.

I expected dg_ix to contain only A and B as columns.

makes sense.

A subset of columns is not (always) selected with DataFrameGroupBy.__getitem__.

This looks like a bug to me with just the __iter__ method? (that was perhaps missed in the previous issue #5264)

PR to fix or further investigation welcome.

@simonjayhawkins simonjayhawkins added Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 17, 2021
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Dec 17, 2021
maurosilber added a commit to maurosilber/pandas that referenced this issue Dec 17, 2021
…#44821)

Fixes issue pandas-dev#44821.

When trying to iterate on a subset of columns in a GroupBy object,
it returned all columns, instead of the selected subset.

GroupBy.__iter__ used self.obj instead of self._selected_obj (see
PR pandas-dev#6570).
@maurosilber
Copy link
Contributor Author

I did a pull request, but the following test

@pytest.mark.parametrize(
"group_keys",
[
(1,),
(1, 2),
(2, 1),
(1, 1, 2),
(1, 2, 1),
(1, 1, 2, 2),
(1, 2, 3, 2, 3),
(1, 1, 2) * 4,
(1, 2, 3) * 5,
],
)
@pytest.mark.parametrize("window_size", [1, 2, 3, 4, 5, 8, 20])
def test_rolling_groupby_with_fixed_forward_many(group_keys, window_size):
# GH 43267
df = DataFrame(
{
"a": np.array(list(group_keys)),
"b": np.arange(len(group_keys), dtype=np.float64) + 17,
"c": np.arange(len(group_keys), dtype=np.int64),
}
)
indexer = FixedForwardWindowIndexer(window_size=window_size)
result = df.groupby("a")["b"].rolling(window=indexer, min_periods=1).sum()
result.index.names = ["a", "c"]
groups = df.groupby("a")[["a", "b"]]
manual = concat(
[
g.assign(
b=[
g["b"].iloc[i : i + window_size].sum(min_count=1)
for i in range(len(g))
]
)
for _, g in groups
]
)
manual = manual.set_index(["a", "c"])["b"]
tm.assert_series_equal(result, manual)

fails in line 452
manual = manual.set_index(["a", "c"])["b"]

with

KeyError: "None of ['c'] are in the columns"

Which should be expected, as it wasn't selected in line 440

groups = df.groupby("a")[["a", "b"]]

Should I correct the test by selecting "c"? It might break someone's code if it relies on this behaviour.

maurosilber added a commit to maurosilber/pandas that referenced this issue Dec 19, 2021
Fixes test due to changes in GroupBy.__iter__ (see pandas-dev#44821).

As the column `c` wasn't selected on the manual computation,
it failed when trying to set it as an index.
@jreback jreback modified the milestones: Contributions Welcome, 1.4 Dec 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants