Skip to content

BUG: DataFrameGroupBy.agg with lists doesn't respect as_index=False #52849

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rhshadrach opened this issue Apr 22, 2023 · 9 comments · Fixed by #53237
Closed

BUG: DataFrameGroupBy.agg with lists doesn't respect as_index=False #52849

rhshadrach opened this issue Apr 22, 2023 · 9 comments · Fixed by #53237
Assignees
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby
Milestone

Comments

@rhshadrach
Copy link
Member

rhshadrach commented Apr 22, 2023

df = DataFrame({"a1": [0, 0, 1], "a2": [2, 3, 3], "b": [4, 5, 6]})
# df = df.astype({"a1": "category", "a2": "category"})
gb = df.groupby(by=["a1", "a2"], as_index=False)
result = gb.agg(["sum"])
print(result)
#         b
#       sum
# a1 a2
# 0  2    4
#    3    5
# 1  3    6

The expected output is:

  a1 a2   b
        sum
0  0  2   4
1  0  3   5
2  1  3   6

In particular, the index is a RangeIndex and a1 as well as a2 are columns.

@rhshadrach rhshadrach added Bug Groupby Apply Apply, Aggregate, Transform, Map labels Apr 22, 2023
@szczekulskij
Copy link

This looks a bit hard, but I'll give it a try. If I won't be able to solve this within a week, I'll re-assign

@szczekulskij
Copy link

take

@szczekulskij
Copy link

There is also missing data in a1

@szczekulskij
Copy link

szczekulskij commented Apr 24, 2023

The single missing number aside (from a1 column) - as this seems like a separate issue.

What should be the expected output for gb.agg(["sum", "min"]) ?
It outputs something this, which looks reasonable - since how otherwise can we output multiple agg columns ?

#         b    
#       sum min
# a1 a2        
# 0  2    4   4
#    3    5   5
# 1  3    6   6

Therefore, if I understand correctly - if we're passing only one argument to list, we want to overwrite the list behaviour & output sth similar to gb.agg("sum"), right ? My concern is that some could argue that current output for gb.agg(["sum"]) is correct - that if you're passing in a list, you do want the output df to include "sum" annotation.

@rhshadrach
Copy link
Member Author

The single missing number aside (from a1 column) - as this seems like a separate issue.

Are you meaning something like the following?

#         b    
#       sum min
# a1 a2        
# 0  2    4   4
# 0  3    5   5
# 1  3    6   6

pandas prints a sparse representation for the index. A missing entry implicitly means it is repeated from above.

What should be the expected output for gb.agg(["sum", "min"]) ?

I originally had the incorrect columns in the OP; it should be a MultiIndex. The OP has been updated.

@szczekulskij
Copy link

Hey, I'm sorry but I'll need to drop this issue. Pandas is a bit too much for me right now, I'll come back to this in the future

@Charlie-XIAO
Copy link
Contributor

I believe this is because returning too early. As shown below, it didn't go through the as_index check. I will postpone its return until after checking as_index.

result = op.agg()
if not is_dict_like(func) and result is not None:
return result

@Charlie-XIAO
Copy link
Contributor

take

@Charlie-XIAO
Copy link
Contributor

But Categorical data seems to be tricky. The current

result = self._insert_inaxis_grouper(result)
result.index = default_index(len(result))

seems unable to deal with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants