Skip to content

Bug 29764 groupby loses index name sometimes #33111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

phofl
Copy link
Member

@phofl phofl commented Mar 29, 2020

Some group by functions lost the column index name. For the functions running into _get_cythonized_result in groupby.py, the index name was just ignored when defining the column names. So the following functions _wrap_aggregated_output and _wrap_transformed_output in the class DataFrameGroupBy had not acces to this information, because it was already lost there. I collected the information beforehand and defined the Index Name accordingly.

We could refactor both methods a bit at a later stage, because at it is right now (and was before) the first few lines are duplicates.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were any other aggregation functions affected by the loss of index names?

base_func = getattr(libgroupby, how)
obj = self._selected_obj
if isinstance(obj, DataFrame):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you doing here? this is very odd

Copy link
Member Author

@phofl phofl Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function call self._wrap_aggregated_output or self._wrap_transformed_output(output) convert the array containing the results in a DataFrame (or Series, but not important in our case). This is below the marked section. The name of the index is already lost at this point, so I added the name to the output dictionary, which is given as input for this functions. This is not relevant, if the result is a Series, so I check, if we have a DataFrame.

My first idea was to check if the original object was from the type DataFrameGroupBy, but I could not perform this check without importing DataFrameGroupBy during runtime. To avoid this, I used the method you see above.

Does this answer your question?

@phofl phofl requested a review from jreback April 19, 2020 21:17
@jreback
Copy link
Contributor

jreback commented Jun 14, 2020

can you merge master and will have a look

@@ -2431,8 +2431,11 @@ def _get_cythonized_result(
grouper = self.grouper

labels, _, ngroups = grouper.group_info
output: Dict[base.OutputKey, np.ndarray] = {}
output: Dict[Union[base.OutputKey, str], Union[np.ndarray, str]] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this at all, output is very well defined here and we are now using it for multiple different things.

can we just pass the index name to _wrap_aggregated_output and _wrap_transformed_output ?

Copy link
Member Author

@phofl phofl Jun 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to implement this today or tomorrow.

We have to change the Series code too in this case, if I remember correctly. But I can tell you more after implementing this. Thanks for the feedback,

@jreback
Copy link
Contributor

jreback commented Jul 17, 2020

if u can merge master and update will look

@simonjayhawkins
Copy link
Member

@phofl can you move release note to 1.2 and merge upstream/master to resolve conflict

phofl added 2 commits August 4, 2020 15:37
� Conflicts:
�	doc/source/whatsnew/v1.1.0.rst
�	pandas/core/groupby/generic.py
�	pandas/tests/groupby/test_groupby.py
@phofl
Copy link
Member Author

phofl commented Aug 4, 2020

Merged master and moved whats new entry. I hope, that I will be able to change the functions as requested by jereback in the next few days. I have not that much time currently

@phofl
Copy link
Member Author

phofl commented Sep 4, 2020

Issue was fixed for sum and any with c9144ca. I close this PR and open an new one for the rest

@phofl phofl closed this Sep 4, 2020
@phofl phofl deleted the BUG_29764_groupby_loses_index_name_sometimes branch September 5, 2020 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GroupBy Doesn't Always Maintain Column Index Name
4 participants