Skip to content

CLN: type annotations in groupby.grouper, groupby.ops #29456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Nov 13, 2019

Conversation

jbrockmendel
Copy link
Member

@simonjayhawkins mypy is still giving a couple of complaints I could use your help sorting out:

pandas/core/groupby/ops.py:791: error: Signature of "groupings" incompatible with supertype "BaseGrouper"
pandas/core/groupby/ops.py:872: error: Argument 1 of "_chop" is incompatible with supertype "DataSplitter"; supertype defines the argument type as "NDFrame"
pandas/core/groupby/ops.py:884: error: Argument 1 of "_chop" is incompatible with supertype "DataSplitter"; supertype defines the argument type as "NDFrame"

For the groupings complaint, AFAICT the attribute has the same annotation, but in the subclass its a property instead of defined in __init__. For the other two, I annotated an argument with NDFrame in the base class and overrode with Series and DataFrame in the subclasses. What is the preferred idiom for this pattern?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good some comments

@@ -89,7 +89,7 @@ def __init__(

self._filter_empty_groups = self.compressed = len(groupings) != 1
self.axis = axis
self.groupings = groupings # type: Sequence[grouper.Grouping]
self.groupings = list(groupings) # type: List[grouper.Grouping]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this necessary? I think Sequence should have been allowable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an attempt to troubleshoot a (still-failing, mentioned in OP) complaint about a mismatch in the groupings annotations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonjayhawkins any thoughts on why mypy is still complaining about groupings?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right about the property and the attribute.

diff --git a/pandas/core/groupby/ops.py b/pandas/core/groupby/ops.py
index cbe012012..46a427434 100644
--- a/pandas/core/groupby/ops.py
+++ b/pandas/core/groupby/ops.py
@@ -90,12 +90,16 @@ class BaseGrouper:
 
         self._filter_empty_groups = self.compressed = len(groupings) != 1
         self.axis = axis
-        self.groupings = list(groupings)  # type: List[grouper.Grouping]
+        self._groupings = list(groupings)  # type: List[grouper.Grouping]
         self.sort = sort
         self.group_keys = group_keys
         self.mutated = mutated
         self.indexer = indexer
 
+    @property
+    def groupings(self):
+        return self._groupings
+
     @property
     def shape(self):
         return tuple(ping.ngroups for ping in self.groupings)

silences mypy if you're happy with this approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to work, updated

@gfyoung gfyoung added Groupby Typing type annotations, mypy/pyright type checking labels Nov 9, 2019
@jbrockmendel
Copy link
Member Author

Any issues remaining here?

sort : bool, default False
whether the resulting grouper should be sorted
"""
assert obj is not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for my own benefit in trying to reason about this code.

mutated = self.mutated
splitter = self._get_splitter(data, axis=axis)
group_keys = self._get_group_keys()
result_values = None

sdata = splitter._get_sorted_data()
sdata = splitter._get_sorted_data() # type: FrameOrSeries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

shouldn't need to add a type annotation here. maybe the return type of _get_sorted_data needs to be added.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_sorted_data return type is annotated, but mypy complains without this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can update to py3.6 syntax in a followon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer needed after e6c5f5a

return self.data.take(self.sort_idx, axis=self.axis)

def _chop(self, sdata, slice_obj: slice):
def _chop(self, sdata, slice_obj: slice) -> NDFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is NDFrame used? is _chop not generic? should DataSplitter be a generic class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont understand the question. Is "generic class" meaningfully different from "base class"? NDFrame is used because one subclass returns Series and the other returns DataFrame

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataSplitter.__init__ accepts FrameOrSeries. do we need to persist this type thoughout the class. i.e. make DataSplitter a generic class. see https://mypy.readthedocs.io/en/latest/generics.html#defining-generic-classes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so looking at the definition of _chop in the derived classes, i'm guessing this abstractmethod should be typed as

 def _chop(self, sdata: FrameOrSeries, slice_obj: slice) -> FrameOrSeries:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using FrameOrSeries here produces complaints:

pandas/core/groupby/ops.py:879: error: Argument 1 of "_chop" is incompatible with supertype "DataSplitter"; supertype defines the argument type as "FrameOrSeries"
pandas/core/groupby/ops.py:879: error: Return type "Series" of "_chop" incompatible with return type "FrameOrSeries" in supertype "DataSplitter"
pandas/core/groupby/ops.py:891: error: Argument 1 of "_chop" is incompatible with supertype "DataSplitter"; supertype defines the argument type as "FrameOrSeries"
pandas/core/groupby/ops.py:891: error: Return type "DataFrame" of "_chop" incompatible with return type "FrameOrSeries" in supertype "DataSplitter"

I'm getting close to saying "screw it" when dealing with this type of error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy won't be looking at the derived classes when it performs type checking. it'll be looking at the type hints on the base class when it checks other methods in the base class.

the abstractmethod should be generic since that is how the derived classes are typed Series -> Series and DataFrame -> DataFrame.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are mixing a few different paradigms here. The subclasses should probably be annotated with the type respective to the class, rather than using the TypeVar, i.e. you would never parametrize a SeriesSplitter with a DataFrame - it exclusively deals with Series objects

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you would never parametrize a SeriesSplitter with a DataFrame - it exclusively deals with Series objects

correct. but if a method of the base class is not overridden then the Series type in the derived class will become an NDFrame type after calling that method in the base class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are mixing a few different paradigms here.

There are some annotations in this PR that make it easier to reason about this code while reading it. The annotations in this sub-thread are not among them, so I do not particularly care about them. Let's focus for now on a minimal change needed to get this merged, as there are more bugfix PRs waiting in the wings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e6c5f5a fixes this.

@simonjayhawkins
Copy link
Member

Any issues remaining here?

I think should use py3.6 syntax for variable annotations going forward and would also prefer to not to see changes to .format but use f-strings instead.

@jbrockmendel
Copy link
Member Author

I think should use py3.6 syntax for variable annotations going forward and would also prefer to not to see changes to .format but use f-strings instead.

3.6 annotations ill try to get in the habit of. f-strings are going to take some getting used to

@simonjayhawkins
Copy link
Member

f-strings are going to take some getting used to

just thinking it'll reduce churn as they are likely to be updated anyway. so why makes changes twice?

@jbrockmendel
Copy link
Member Author

just thinking it'll reduce churn as they are likely to be updated anyway. so why makes changes twice?

No reason at all. That doesn't change the whole "I'm an old man and change will take some getting used to" thing.

@jbrockmendel
Copy link
Member Author

rebased+green

@jreback jreback added this to the 1.0 milestone Nov 12, 2019
@jreback
Copy link
Contributor

jreback commented Nov 12, 2019

lgtm (ex one comment), @WillAyd

@WillAyd WillAyd merged commit 4b3027f into pandas-dev:master Nov 13, 2019
@WillAyd
Copy link
Member

WillAyd commented Nov 13, 2019

Thanks @jbrockmendel

@jbrockmendel jbrockmendel deleted the cln-gb branch November 13, 2019 00:58
Reksbril pushed a commit to Reksbril/pandas that referenced this pull request Nov 18, 2019
* Annotate groupby.ops

* annotations, needs debugging

* whitespace

* types

* circular import

* fix msot mypy complaints

* fix mypy groupings

* merge cleanup
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
* Annotate groupby.ops

* annotations, needs debugging

* whitespace

* types

* circular import

* fix msot mypy complaints

* fix mypy groupings

* merge cleanup
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
* Annotate groupby.ops

* annotations, needs debugging

* whitespace

* types

* circular import

* fix msot mypy complaints

* fix mypy groupings

* merge cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants