Skip to content

GroupBy agg rejects a dict argument #361

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gandhis1 opened this issue Oct 5, 2022 · 5 comments · Fixed by #363
Closed

GroupBy agg rejects a dict argument #361

gandhis1 opened this issue Oct 5, 2022 · 5 comments · Fixed by #363

Comments

@gandhis1
Copy link
Contributor

gandhis1 commented Oct 5, 2022

I'll start with the code snippet and the errors:

    import pandas as pd
    df = pd.DataFrame(data=[range(10), ['a'] * 5 + ['b'] * 5]).T
    df.columns = ["value", "category"]
    agg_dict = {col: "sum" for col in df.columns}
    df.groupby("category").agg(agg_dict)

tests/test_pandas.py:324: error: Incompatible types in assignment (expression has type "List[str]", variable has type "Index") [assignment]
tests/test_pandas.py:326: error: Argument 1 has incompatible type "Dict[Any, str]"; expected "Union[Union[Callable[..., Any], str, ufunc], List[Union[Callable[..., Any], str, ufunc]], Dict[Hashable, Union[Union[Callable[..., Any], str, ufunc], List[Union[Callable[..., Any], str, ufunc]]]]]" [arg-type]

So there are a several different things happening here:

  • For starters, I am unable to assign a List[str] to df.columns. Not really related to the issue in the title, but might as well fix this at the same time.
  • When I create agg_dict, I would expect the type to be Dict[Hashable, str], since in a lot of places we are treating the column data type as Hashable. This may be dependent on the broader issue of an Index needing to be a generic type.
  • Even if I override the type of agg_dict to Dict[Hashable, str] this still is not accepted by agg(). There may be a covariance thing at play here, it may need to be Mapping instead of Dict, but I haven't investigated this too closely yet.

There's enough things going on here that I didn't want to just start changing things around in the annotations before I could pinpoint exactly what we should be changing.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Oct 6, 2022

I think the fix here is to do the following in _typing.pyi:
Currently:

AggFuncTypeDictSeries: TypeAlias = dict[Hashable, AggFuncTypeBase]
AggFuncTypeDictFrame: TypeAlias = dict[
    Hashable, Union[AggFuncTypeBase, list[AggFuncTypeBase]]
]

Change Hashable to HashableT, and I think this will fix it.

@gandhis1
Copy link
Contributor Author

gandhis1 commented Oct 6, 2022

That didn't seem to do it - it actually appears to be my third bullet point above - Dict is invariant, so you cannot pass any Hashable, while Mapping, which is what I changed it to, is covariant, so any sub-type of Hashable will suffice. PR coming

@bashtage
Copy link
Contributor

bashtage commented Oct 6, 2022

In general, the issue with using Mapping is whether any Mapping will actually work. The alternative is to just remove the types from dict[Hashable,str] so it becomes dict.

@gandhis1
Copy link
Contributor Author

gandhis1 commented Oct 6, 2022

I don't think an untyped dict is necessary here and comes with the disadvantage of basically losing the type checking. Is it possible that someone passes a Mapping that isn't supported? I'd have to look at the code, but as long as pandas is just looping through keys, and not invoking dictionary-specific methods like .keys() or .values(), the code would likely work with any Mapping. And plus, practically speaking, not sure it is necessary to protect against something that probably isn't going to happen. If we do, we can always create a new Protocol that is covariant but sub-classes from Mapping or something like that. So TLDR - I think it's more important to have logical types for the case that is actually likely to happen, and proper type checking for that case, even if it opens up the possibility of a false negative.

The part that I think is actually worth exploring more is enumerating all of the string literals from here: https://github.com/pandas-dev/pandas/blob/main/pandas/core/groupby/base.py

@Dr-Irv - Re HashableT - turns out it was necessary, but in conjunction with a covariant type like Mapping

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Oct 6, 2022

The part that I think is actually worth exploring more is enumerating all of the string literals from here: https://github.com/pandas-dev/pandas/blob/main/pandas/core/groupby/base.py

That would be a good idea to restrict the strings.

@Dr-Irv - Re HashableT - turns out it was necessary, but in conjunction with a covariant type like Mapping

OK, thanks for checking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants