-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Enh/group by.transform should accept similar arguments to group by.agg #58773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enh/group by.transform should accept similar arguments to group by.agg #58773
Conversation
f49fa6d
to
e8ae75f
Compare
pandas/core/groupby/generic.py
Outdated
|
||
@staticmethod | ||
def _named_agg_to_dict(*named_aggs: tuple[str, NamedAgg]) -> dict[str, NamedAgg]: | ||
valid_items = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you inline this function since it's only use once in this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, simple is best, will Inline it.
pandas/core/groupby/generic.py
Outdated
# Convert named aggregations to dictionary format | ||
transformed_func = self._named_agg_to_dict(*kwargs.items()) | ||
kwargs = {} | ||
if isinstance(transformed_func, dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this if
statement needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this check is redundant, will remove it.
pandas/core/groupby/generic.py
Outdated
) | ||
results.append(result) | ||
col_names.extend([(col, name) for col in result.columns]) | ||
output = concat(results, axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output = concat(results, axis=1) | |
output = concat(results, ignore_index=True, axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the concatenation of results using concat
with axis=1
align the columns correctly without needing to reset the index. The original index of the DataFrame will be preserved right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resulting .index
should be preserved, but the since you're overwriting the .columns
2 lines down, we don't need concat
to come up with the merged .column
result and do less work by just returning a RangeIndex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, makes sense, understood.
pandas/core/groupby/generic.py
Outdated
results.append(result) | ||
col_names.extend([(col, name) for col in result.columns]) | ||
output = concat(results, axis=1) | ||
output.columns = MultiIndex.from_tuples(col_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to use from_arrays
or use the MultiIndex
constructor instead? from_tuples
is slow and needs to do inference on the data types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to know, will go with from_arrays
.
pandas/core/groupby/generic.py
Outdated
col_names.extend([(col, name) for col in result.columns]) | ||
output = concat(results, axis=1) | ||
output.columns = MultiIndex.from_tuples(col_names) | ||
output.sort_index(axis=1, level=[0], sort_remaining=False, inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you avoid inplace=False
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, i will just return a new sorted DataFrame.
e63be2e
to
e340efb
Compare
pandas/core/groupby/generic.py
Outdated
transformed_func = { | ||
name: aggfunc | ||
for name, aggfunc in kwargs.items() | ||
if not isinstance(aggfunc[1], DataFrame) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what case is this a DataFrame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, i guess aggfunc
should always be a callable or a string, so that check is not needed.
pandas/core/groupby/generic.py
Outdated
for name, aggfunc in kwargs.items() | ||
if not isinstance(aggfunc[1], DataFrame) | ||
} | ||
kwargs = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can just not pass kwargs below.
pandas/core/groupby/generic.py
Outdated
if isinstance(func, dict): | ||
for name, named_agg in func.items(): | ||
column_name = named_agg.column | ||
agg_func = named_agg.aggfunc | ||
result = self._transform_single_column( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also support, e.g. .transform({"a": "sum", "b": "mean"})
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify @rhshadrach
that would do the same thing as .transform(["sum", "min"])
?
but with columns names "a" and "b" instead of "sum" and "min"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, .transform(["sum", "min"])
would also act on other columns if there are any; .transform({"a": "sum", "b": "mean"})
will only act on columns a
and b
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas/core/groupby/generic.py
Outdated
col_names = [] | ||
columns = [com.get_callable_name(f) or f for f in func] | ||
func_pairs = zip(columns, func) | ||
for idx, (name, func_item) in enumerate(func_pairs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idx is unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, actually, i will remove it
pandas/core/groupby/generic.py
Outdated
output = concat(results, ignore_index=True, axis=1) | ||
arrays = [list(x) for x in zip(*col_names)] | ||
output.columns = MultiIndex.from_arrays(arrays) | ||
output = output.sort_index(axis=1, level=[0], sort_remaining=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want the order to reflect that of the input. Why sort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess not actually needed, will remove that logic.
pandas/core/groupby/generic.py
Outdated
data = self._obj_with_exclusions[column_name] | ||
groupings = self._grouper.groupings | ||
result = data.groupby(groupings).transform( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use self._gotitem
instead of recreating the groupby object. Also, I think this function will return a DataFrame when there are duplicate column names. Can you add tests for the duplicate column name cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will go with self._gotitem
then instead and added the additional test.
e340efb
to
f2abbdf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abeltavares - force pushing to this branch makes GitHub lose track of the changes that have been reviewed, so the reviewer has to review everything again. Please don't force push if at all possible.
5d02596
to
78b6e90
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two other cases we will need to handle to achieve feature parity with the agg
equivalent:
- dict of lists, e.g.
{"a": ["sum", "min"], "b": "max"}
- SeriesGroupBy.transform with a list
While it'd be great to implement here, I'd be okay with raising a NotImplementedError
for these cases - but we should have a clear error message that these are intended to be implemented in the future.
keys_list = list(self.keys) if isinstance(self.keys, list) else [self.keys] | ||
for column in self.obj.columns: | ||
if column in keys_list: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iterate over self._obj_with_exclusions
instead. Then you don't need keys_list
.
engine_kwargs: dict | None = None, | ||
**kwargs, | ||
) -> Series: | ||
data = self._gotitem(column_name, ndim=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail if the input has duplicate column names. I would be okay with not supporting duplicate columns here (and raising a clear error message), but this would need further support from other maintainers. The other option is to make this work on duplicate column names.
def test_transform_with_list_like(): | ||
df = DataFrame({"col": list("aab"), "val": range(3), "another": range(3)}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add references to each test, e.g.
def test_transform_with_list_like():
# GH#58318
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.