Skip to content

TYP: Groupby sum|prod|min|max|first|last methods #32302

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Feb 27, 2020

This makes two things:

  • makes mypy know the signatures of the agg methods (sum, prod, etc.) on Groupby
  • gives the agg methods on Groupby objects a return type

This ensues that mypy and IDE's always know that the return type of e.g. df.groupby('a').sum() is a DataFrame, which is nice when chaining.

@topper-123 topper-123 changed the title CLN/TYP: Groupby agg methods TYP: Groupby agg methods Feb 27, 2020
@topper-123 topper-123 changed the title TYP: Groupby agg methods TYP: Groupby sum|prod|min|max|first|last methods Feb 27, 2020
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On board with the concept; a few things

@@ -789,6 +811,36 @@ def count(self) -> Series:
)
return self._reindex_output(result, fill_value=0)

@Substitution(f="sum", no=True, mc=0, return_type="Series")
@Appender(_agg_template)
def sum(self, numeric_only=True, min_count=0) -> Series:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you annotate numeric_only and min_count on these as well?

@WillAyd WillAyd added Groupby Typing type annotations, mypy/pyright type checking labels Feb 27, 2020
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jbrockmendel
Copy link
Member

there's got to be a less verbose way of doing this, isnt there?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you not generating these like we do in genetic for the stat functions ?

@topper-123
Copy link
Contributor Author

why are you not generating these like we do in genetic for the stat functions ?

Doing e.g. reveal_type(df.var)on master gives a revealed type of Any. I don't think it's possible to programatically add methods and also have mypy infer their types, parameters or return values. Mypy isn't advance enough to do that.

It's true that this is a bit repetitive, but always with slight variations. Getting correct parameter and return types is something I value highly in a library as complex as Pandas, so IMO this is worth it.

@jbrockmendel
Copy link
Member

I don't think it's possible to programatically add methods and also have mypy infer their types, parameters or return values. Mypy isn't advance enough to do that.

cc @simonjayhawkins is this likely to improve at any point? would something like stub files help?

I find this pattern annoying at debug-time, when it is an extra layer of calls to look through (since I have to double-check whether the super().foo(...) call is changing any kwargs).

(core.window.expanding is another place where we do something like this, but in that case for docstrings)

@WillAyd
Copy link
Member

WillAyd commented Feb 29, 2020

FWIW I don't find stepping through the functions as is necessarily any clearer than inheriting as done here. This is more verbose, but I don't think factory-generating these really adds all that much value and I do think would be nice to support annotations on these

@doc(_agg_template, fname="last", no=False, mc=-1, return_type="DataFrame")
def last(self, numeric_only: bool = False, min_count: int = -1) -> DataFrame:
return super().last(numeric_only=numeric_only, min_count=min_count)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are duplicates from above except the return values. Can you use Generic instead? The exception would be for unique docstrings for the Series/Frame methods. Does something need to be added to make these docstrings visible in the documentation.

Copy link
Contributor Author

@topper-123 topper-123 Mar 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc string parameter return_type is also different, though. I could change these into e.g.:

@doc(_agg_template, fname="last", no=False, mc=-1, return_type="DataFrame")
def last(self, numeric_only: bool = False, min_count: int = -1) -> DataFrame:
        last_compat = partial(self._get_loc, loc=-1)

        return self._agg_general(
            numeric_only=numeric_only,
            min_count=min_count,
            alias="last",
            npfunc=last_compat,
        )

on both SeriesGroupby and DataFrameGroupby (setting 'Dataframe' and 'Series' the relevant places). That would be repetitipus also, though, but it would avoid setting methods on the NDFrame class...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry if I wasn't clear, in the docs I see https://pandas.io/docs/reference/api/pandas.core.groupby.GroupBy.min.html but nothing for DataFrameGroupBy.min, whereas for most other methods there is a DataFrameGroupBy.

So my question is, how to make these visible, otherwise if the generic docstring is used, this duplication is unnecessary and GroupBy could be Generic to give the correct return type. see https://github.com/pandas-dev/pandas/pull/29480/files#r347723361 where the idea was rejected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see your point now, it's the issue of the doc reference.

I don't have a good solution to that. I've added doc strings to GroupBy methods and copied them over to SeriesGroupby and DataFrameGroupby. SeriesGroupby and DataFrameGroupby need methods themselves in order to get the correct type hint. feels repetitive, but I don't think we can make it better without dropping the type hint.

@topper-123 topper-123 force-pushed the groupby_agg_methods branch 4 times, most recently from 5247c92 to d47c819 Compare March 2, 2020 23:06
@simonjayhawkins
Copy link
Member

there's got to be a less verbose way of doing this, isnt there?

@topper-123 commit making _Groupby a generic class. dccbd00

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just add all of the functions here and avoid changing agg_general at all, where the changes you are attempting are non-trivial (do a pre-cursor PR).

ok conceptually with adding these methods to get better typing.

)

@staticmethod
def _get_loc(x, axis: int = 0, *, loc: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what?

a) don't use staticmethod
b) what the heck is this for

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be more ok with this if its not a nested function like this. I am also -1 on any use of staticmethod.

numeric_only: bool = True,
min_count: int = -1,
):
def get_loc_notna(x, loc: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you split this pr up

  1. a PR to refactor groupby_function so its clear what is going on
  2. the typing of functions

@WillAyd
Copy link
Member

WillAyd commented Mar 27, 2020

@topper-123 still active? I think this is a nice change and helpful when we think about bumping our mypy pin, so would love if there's an easy way to address comments

@topper-123
Copy link
Contributor Author

I've been a bit busy lately. I can see if I can make something work over the coming days.

@jreback
Copy link
Contributor

jreback commented Apr 7, 2020

can you merge master ping on green.

@topper-123 topper-123 force-pushed the groupby_agg_methods branch from d47c819 to 292896f Compare May 8, 2020 10:56
@topper-123
Copy link
Contributor Author

I've updated the PR and removed the return types, so this is only refactoring _add_numeric_operations and make mypy discover the methods.

So, this does:

>>> meth = df.groupby('x').sum
>>> reveal_type(meth)
Any  # master
Revealed type is 'def (numeric_only: builtins.bool =, min_count: builtins.int =) -> Any'  # this PR

The Travis failure looks unrelated.

@topper-123 topper-123 force-pushed the groupby_agg_methods branch from 292896f to 14b071b Compare May 9, 2020 17:55
)

@staticmethod
def _get_loc(x, axis: int = 0, *, loc: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be more ok with this if its not a nested function like this. I am also -1 on any use of staticmethod.

@topper-123 topper-123 force-pushed the groupby_agg_methods branch from 14b071b to 34fd4b8 Compare May 15, 2020 05:29
@topper-123
Copy link
Contributor Author

I've made a new commit that simplifies the PR wrt. the static method.

@topper-123 topper-123 force-pushed the groupby_agg_methods branch from 34fd4b8 to 279a89e Compare May 15, 2020 06:04
@topper-123 topper-123 closed this May 15, 2020
@topper-123
Copy link
Contributor Author

Superceded by #34200.

@topper-123 topper-123 deleted the groupby_agg_methods branch May 15, 2020 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants