Skip to content

Allow covariance in the agg dict passed to DataFrame or Series groupby.agg() #363

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 13, 2022

Conversation

gandhis1
Copy link
Contributor

@gandhis1 gandhis1 commented Oct 6, 2022

@gandhis1 gandhis1 marked this pull request as ready for review October 6, 2022 04:05
@@ -655,21 +660,19 @@ def test_types_groupby_agg() -> None:
assert_type(df.groupby("col1").agg(["min", "max"]), pd.DataFrame), pd.DataFrame
)
check(assert_type(df.groupby("col1").agg([min, max]), pd.DataFrame), pd.DataFrame)
agg_dict1: dict[Hashable, str] = {"col2": "min", "col3": "max", 0: "sum"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it also work when the type is dict[str, str]? Might need to change the keys to HashableT (or simply Any).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

I mentioned this in my other PR, but it wasn't clear to me what the consensus type was for a column or index label. I see Hashable in some places, and Scalar in others.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large unions such as Scalar as a return value don't provide that much benefit - if you require a specific return type, you will need to cast it. When using the Hashable protocol, you also need to cast it but it feels a bit cleaner to me (especially because it also contains covers some more unusually types that are not yet added to Scalar).

As an input argument, it could make sense to use Scalar, if it helps with overlapping overloads. Using Hashable + hoping that type checkers continue to pick the first overload + ignoring the overlapping overload, might actually be an okay'ish solution.

check(assert_type(df.groupby("col1").agg(agg_dict1), pd.DataFrame), pd.DataFrame)
agg_dict2 = {"col2": min, "col3": max, 0: min}
check(assert_type(df.groupby("col1").agg(agg_dict2), pd.DataFrame), pd.DataFrame)
# Here, MyPy infers dict[object, object], so it must be explicitly annotated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could do:

def func(x):
    return x.min()

and then use func in place of lambda x: x.min() below, and then you may not need to type agg_dict3

I'm of the position that lambda funcs should be replaced with defined functions for the purpose of static typing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't seem to make a difference:

tests/test_frame.py:674: error: Argument 1 has incompatible type "Dict[object, object]"; expected "Union[Union[Callable[..., Any], str, ufunc], List[Union[Callable[..., Any], str, ufunc]], Mapping[Any, Union[Union[Callable[..., Any], str, ufunc], List[Union[Callable[..., Any], str, ufunc]]]]]" [arg-type]

I changed it to a regular function, although the pattern of passing a lambda as a dict element in agg() is pretty common, I'd say.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments here:

  1. mypy isn't inferring the types of a dict correctly, even for something as simple as mydict = { "abc" : "def", 0: "xyz" }. In this case, a reveal_type() produces "builtins.dict[builtins.object, builtins.str]"
  2. With respect to lambda, there is not much we can do here. There is not a way to annotate the parameter types or return types of a lambda, independent of pandas.

You could also have done wrapped_min: Callable[[Any], Any] = lambda x: x.min() so that the type of the lambda is known. See discussion here: https://stackoverflow.com/a/33833896/1970354

AggFuncTypeDictSeries: TypeAlias = dict[Hashable, AggFuncTypeBase]
AggFuncTypeDictFrame: TypeAlias = dict[
Hashable, Union[AggFuncTypeBase, list[AggFuncTypeBase]]
AggFuncTypeDictSeries: TypeAlias = Mapping[HashableT, AggFuncTypeBase]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR but maybe for the future: It might be good to actually use a mapping that does not inherit from dict for tests that accept any mapping. Some pandas functions specifically check for dict/list/tuple, so it mightbe good to have Sequences/Mappings that do not inherit from list/tuple/dict.

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gandhis1

@Dr-Irv Dr-Irv merged commit f051cd7 into pandas-dev:main Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GroupBy agg rejects a dict argument
3 participants