-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
TYP: Groupby sum|prod|min|max|first|last methods #32302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On board with the concept; a few things
pandas/core/groupby/generic.py
Outdated
@@ -789,6 +811,36 @@ def count(self) -> Series: | |||
) | |||
return self._reindex_output(result, fill_value=0) | |||
|
|||
@Substitution(f="sum", no=True, mc=0, return_type="Series") | |||
@Appender(_agg_template) | |||
def sum(self, numeric_only=True, min_count=0) -> Series: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you annotate numeric_only
and min_count
on these as well?
d872b8c
to
81377e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
there's got to be a less verbose way of doing this, isnt there? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you not generating these like we do in genetic for the stat functions ?
Doing e.g. It's true that this is a bit repetitive, but always with slight variations. Getting correct parameter and return types is something I value highly in a library as complex as Pandas, so IMO this is worth it. |
cc @simonjayhawkins is this likely to improve at any point? would something like stub files help? I find this pattern annoying at debug-time, when it is an extra layer of calls to look through (since I have to double-check whether the super().foo(...) call is changing any kwargs). (core.window.expanding is another place where we do something like this, but in that case for docstrings) |
FWIW I don't find stepping through the functions as is necessarily any clearer than inheriting as done here. This is more verbose, but I don't think factory-generating these really adds all that much value and I do think would be nice to support annotations on these |
pandas/core/groupby/generic.py
Outdated
@doc(_agg_template, fname="last", no=False, mc=-1, return_type="DataFrame") | ||
def last(self, numeric_only: bool = False, min_count: int = -1) -> DataFrame: | ||
return super().last(numeric_only=numeric_only, min_count=min_count) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are duplicates from above except the return values. Can you use Generic instead? The exception would be for unique docstrings for the Series/Frame methods. Does something need to be added to make these docstrings visible in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc string parameter return_type
is also different, though. I could change these into e.g.:
@doc(_agg_template, fname="last", no=False, mc=-1, return_type="DataFrame")
def last(self, numeric_only: bool = False, min_count: int = -1) -> DataFrame:
last_compat = partial(self._get_loc, loc=-1)
return self._agg_general(
numeric_only=numeric_only,
min_count=min_count,
alias="last",
npfunc=last_compat,
)
on both SeriesGroupby and DataFrameGroupby (setting 'Dataframe' and 'Series' the relevant places). That would be repetitipus also, though, but it would avoid setting methods on the NDFrame class...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry if I wasn't clear, in the docs I see https://pandas.io/docs/reference/api/pandas.core.groupby.GroupBy.min.html but nothing for DataFrameGroupBy.min, whereas for most other methods there is a DataFrameGroupBy.
So my question is, how to make these visible, otherwise if the generic docstring is used, this duplication is unnecessary and GroupBy could be Generic to give the correct return type. see https://github.com/pandas-dev/pandas/pull/29480/files#r347723361 where the idea was rejected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see your point now, it's the issue of the doc reference.
I don't have a good solution to that. I've added doc strings to GroupBy methods and copied them over to SeriesGroupby and DataFrameGroupby. SeriesGroupby and DataFrameGroupby need methods themselves in order to get the correct type hint. feels repetitive, but I don't think we can make it better without dropping the type hint.
5247c92
to
d47c819
Compare
@topper-123 commit making _Groupby a generic class. dccbd00 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just add all of the functions here and avoid changing agg_general at all, where the changes you are attempting are non-trivial (do a pre-cursor PR).
ok conceptually with adding these methods to get better typing.
pandas/core/groupby/groupby.py
Outdated
) | ||
|
||
@staticmethod | ||
def _get_loc(x, axis: int = 0, *, loc: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what?
a) don't use staticmethod
b) what the heck is this for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be more ok with this if its not a nested function like this. I am also -1 on any use of staticmethod.
pandas/core/groupby/groupby.py
Outdated
numeric_only: bool = True, | ||
min_count: int = -1, | ||
): | ||
def get_loc_notna(x, loc: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you split this pr up
- a PR to refactor groupby_function so its clear what is going on
- the typing of functions
@topper-123 still active? I think this is a nice change and helpful when we think about bumping our mypy pin, so would love if there's an easy way to address comments |
I've been a bit busy lately. I can see if I can make something work over the coming days. |
can you merge master ping on green. |
d47c819
to
292896f
Compare
I've updated the PR and removed the return types, so this is only refactoring So, this does: >>> meth = df.groupby('x').sum
>>> reveal_type(meth)
Any # master
Revealed type is 'def (numeric_only: builtins.bool =, min_count: builtins.int =) -> Any' # this PR The Travis failure looks unrelated. |
292896f
to
14b071b
Compare
pandas/core/groupby/groupby.py
Outdated
) | ||
|
||
@staticmethod | ||
def _get_loc(x, axis: int = 0, *, loc: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be more ok with this if its not a nested function like this. I am also -1 on any use of staticmethod.
14b071b
to
34fd4b8
Compare
I've made a new commit that simplifies the PR wrt. the static method. |
34fd4b8
to
279a89e
Compare
Superceded by #34200. |
This makes two things:
sum
,prod
, etc.) on GroupbyThis ensues that mypy and IDE's always know that the return type of e.g.
df.groupby('a').sum()
is a DataFrame, which is nice when chaining.