Skip to content

BUG: GroupBy.apply() throws erroneous ValueError with duplicate axes #35441

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Aug 6, 2020
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,7 @@ Plotting
Groupby/resample/rolling
^^^^^^^^^^^^^^^^^^^^^^^^

- Bug in :meth:`DataFrameGroupBy.apply` that would some times throw an erroneous ``ValueError`` if the grouping axis had duplicate entries (:issue:`16646`)
-
-

Expand Down
8 changes: 4 additions & 4 deletions pandas/core/groupby/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -211,7 +211,7 @@ def apply(self, f: F, data: FrameOrSeries, axis: int = 0):
# group might be modified
group_axes = group.axes
res = f(group)
if not _is_indexed_like(res, group_axes):
if not _is_indexed_like(res, group_axes, axis):
mutated = True
result_values.append(res)

Expand Down Expand Up @@ -897,13 +897,13 @@ def agg_series(
return grouper.get_result()


def _is_indexed_like(obj, axes) -> bool:
def _is_indexed_like(obj, axes, axis: int) -> bool:
if isinstance(obj, Series):
if len(axes) > 1:
return False
return obj.index.equals(axes[0])
return obj.axes[axis].equals(axes[axis])
elif isinstance(obj, DataFrame):
return obj.index.equals(axes[0])
return obj.axes[axis].equals(axes[axis])

return False

Expand Down
27 changes: 20 additions & 7 deletions pandas/tests/groupby/test_apply.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,15 +63,8 @@ def test_apply_trivial():
tm.assert_frame_equal(result, expected)


@pytest.mark.xfail(
reason="GH#20066; function passed into apply "
"returns a DataFrame with the same index "
"as the one to create GroupBy object."
)
def test_apply_trivial_fail():
# GH 20066
# trivial apply fails if the constant dataframe has the same index
# with the one used to create GroupBy object.
df = pd.DataFrame(
{"key": ["a", "a", "b", "b", "a"], "data": [1.0, 2.0, 3.0, 4.0, 5.0]},
columns=["key", "data"],
Expand Down Expand Up @@ -1044,3 +1037,23 @@ def test_apply_with_date_in_multiindex_does_not_convert_to_timestamp():
tm.assert_frame_equal(result, expected)
for val in result.index.levels[1]:
assert type(val) is date


def test_apply_by_cols_equals_apply_by_rows_transposed():
# GH 16646
# Operating on the columns, or transposing and operating on the rows
# should give the same result. There was previously a bug where the
# by_rows operation would work fine, but by_cols would throw a ValueError

df = pd.DataFrame(
np.random.random([6, 4]),
columns=pd.MultiIndex.from_product([["A", "B"], [1, 2]]),
)

by_rows = df.T.groupby(axis=0, level=0).apply(
lambda x: x.droplevel(axis=0, level=0)
)
by_cols = df.groupby(axis=1, level=0).apply(lambda x: x.droplevel(axis=1, level=0))

tm.assert_frame_equal(by_cols, by_rows.T)
tm.assert_frame_equal(by_cols, df)
4 changes: 0 additions & 4 deletions pandas/tests/groupby/test_function.py
Original file line number Diff line number Diff line change
Expand Up @@ -940,10 +940,6 @@ def test_frame_describe_multikey(tsframe):
groupedT = tsframe.groupby({"A": 0, "B": 0, "C": 1, "D": 1}, axis=1)
result = groupedT.describe()
expected = tsframe.describe().T
expected.index = pd.MultiIndex(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One existing test previously had to manipulate the expected index, but that is no longer necessary.

@smithto1 I don't understand the change to the expected here, can you expand on why the new value is correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the bugfix in this PR, I think .describe() on axis=1 is now encountering the bug reported in #26805, #22848 #22546 where group_keys=True is ignored.

The behaviour of the new code with group_keys=True/False is the same as the old code with group_keys=False.

I'll try to take a look at the 'group_keys' bug; if that is fixed I think this test will probably be reverted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may be addressed by #34998 but for the axis=1 case.

levels=[[0, 1], expected.index],
codes=[[0, 0, 1, 1], range(len(expected.index))],
)
tm.assert_frame_equal(result, expected)


Expand Down