Skip to content

Bug 29764 groupby loses index name sometimes #33111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 changes: 1 addition & 2 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -132,9 +132,8 @@ Plotting
Groupby/resample/rolling
^^^^^^^^^^^^^^^^^^^^^^^^

- Bug in :meth:`DataFrame.groupby` does not always maintain column index name for ``any``, ``all``, ``bfill``, ``ffill``, ``shift`` (:issue:`29764`)
-
-


Reshaping
^^^^^^^^^
Expand Down
6 changes: 4 additions & 2 deletions pandas/core/groupby/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -1728,9 +1728,10 @@ def _wrap_aggregated_output(
-------
DataFrame
"""
idx_name = output.pop("idx_name", None)
indexed_output = {key.position: val for key, val in output.items()}
name = self._obj_with_exclusions._get_axis(1 - self.axis).name
columns = Index([key.label for key in output], name=name)
columns = Index([key.label for key in output], name=idx_name)

result = self.obj._constructor(indexed_output)
result.columns = columns
Expand Down Expand Up @@ -1762,8 +1763,9 @@ def _wrap_transformed_output(
-------
DataFrame
"""
idx_name = output.pop("idx_name", None)
indexed_output = {key.position: val for key, val in output.items()}
columns = Index(key.label for key in output)
columns = Index([key.label for key in output], name=idx_name)

result = self.obj._constructor(indexed_output)
result.columns = columns
Expand Down
5 changes: 4 additions & 1 deletion pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -2450,8 +2450,11 @@ def _get_cythonized_result(
grouper = self.grouper

labels, _, ngroups = grouper.group_info
output: Dict[base.OutputKey, np.ndarray] = {}
output: Dict[Union[base.OutputKey, str], Union[np.ndarray, str]] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this at all, output is very well defined here and we are now using it for multiple different things.

can we just pass the index name to _wrap_aggregated_output and _wrap_transformed_output ?

Copy link
Member Author

@phofl phofl Jun 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to implement this today or tomorrow.

We have to change the Series code too in this case, if I remember correctly. But I can tell you more after implementing this. Thanks for the feedback,

base_func = getattr(libgroupby, how)
obj = self._selected_obj
if isinstance(obj, DataFrame):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you doing here? this is very odd

Copy link
Member Author

@phofl phofl Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function call self._wrap_aggregated_output or self._wrap_transformed_output(output) convert the array containing the results in a DataFrame (or Series, but not important in our case). This is below the marked section. The name of the index is already lost at this point, so I added the name to the output dictionary, which is given as input for this functions. This is not relevant, if the result is a Series, so I check, if we have a DataFrame.

My first idea was to check if the original object was from the type DataFrameGroupBy, but I could not perform this check without importing DataFrameGroupBy during runtime. To avoid this, I used the method you see above.

Does this answer your question?

output["idx_name"] = obj.columns.name

error_msg = ""
for idx, obj in enumerate(self._iterate_slices()):
Expand Down
23 changes: 23 additions & 0 deletions pandas/tests/groupby/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -2069,3 +2069,26 @@ def test_group_on_two_row_multiindex_returns_one_tuple_key():
assert len(result) == 1
key = (1, 2)
assert (result[key] == expected[key]).all()


@pytest.mark.parametrize("func", ["sum", "any", "shift"])
def test_groupby_column_index_name_lost(func):
# GH: 29764 groupby loses index sometimes
expected = pd.Index(["a"], name="idx")
df = pd.DataFrame([[1]], columns=expected)
df_grouped = df.groupby([1])
result = getattr(df_grouped, func)().columns
tm.assert_index_equal(result, expected)


@pytest.mark.parametrize("func", ["ffill", "bfill"])
def test_groupby_column_index_name_lost_fill_funcs(func):
# GH: 29764 groupby loses index sometimes
df = pd.DataFrame(
[[1, 1.0, -1.0], [1, np.nan, np.nan], [1, 2.0, -2.0]],
columns=pd.Index(["type", "a", "b"], name="idx"),
)
df_grouped = df.groupby(["type"])[["a", "b"]]
result = getattr(df_grouped, func)().columns
expected = pd.Index(["a", "b"], name="idx")
tm.assert_index_equal(result, expected)