`DataFrame.groupby` should return `_DataFrameGroupByScalar` when grouping by a PeriodIndex #674

Daverball · 2023-05-02T13:58:01Z

Describe the bug
When performing a groupby on a dataframe with a PeriodIndex the group identifier should be a Period, i.e. a scalar (On that note Period appears to be missing from the definition of Scalar in _typing.pyi)

To Reproduce

import pandas as pd
df = pd.DataFrame({'date': pd.date_range('2020-01-01', '2020-12-31'), 'days': 1})
index = pd.PeriodIndex(df.date, freq='M')
for period, group in df.groupby(index):
    period.start_time  # mypy: attr-defined — Tuple[Any, ...] has no attribute "start_time"

OS: Linux (OpenSUSE Tumbleweed)
Python 3.10
mypy 1.0.1
pandas-stubs 2.0.1.230501

Additional comments
This should probably extend to most other specialized Index types apart from the base Index and MultiIndex, for the latter it's obvious that it would be non-scalar, for the former it would depend on what Python object is referenced, since converting MultiIndex to and from an Index on a builtins.tuple is quite common, it's probably the right move to treat Index as non-scalar by default or return an Any variant of DataFrameGroupBy that could either be scalar or non-scalar.

The text was updated successfully, but these errors were encountered:

Dr-Irv · 2023-05-02T14:27:36Z

I'm not sure this can be done without a lot of work to make the internal _DataFrameGroupByNonScalar a Generic class with a parameter that corresponds to the specific type of object used in the by argument of groupby . For example, you'd have the same problem if you used df.groupby(list(index)) in your example. That would require fixing __iter__() for Index or us getting Generic index working.

The main issue here is that there are so many different types that can be passed to groupby() that inferring the resulting type of the iterator on the DataFrameGroupBy object is difficult in a static typing context.

I'll keep this open in case anyone wants to give this a try, but I think it will be very difficult to get this to work.

For now, you'd have to do a cast on period in your loop.

We could consider adding Period to Scalar, although that might make Scalar too wide for some other places where it is used. And then change

pandas-stubs/pandas-stubs/core/groupby/generic.pyi

Line 153 in 6c0eed6

def __iter__(self) -> Iterator[tuple[tuple, DataFrame]]: ...

to return Iterator[tuple[tuple | Scalar, DataFrame]] instead, or maybe it's more safe to not add Period to Scalar and use Iterator[tuple[tuple | Scalar | Period, DataFrame]] but that might break some existing type checking in other people's codes (including some of my staff...)

Dr-Irv · 2023-05-02T15:21:35Z

One other note - right now we have set up groupby() to recognize the most commonly used arguments for by, namely a single column name, or a sequence of column names. Differentiating between the case of having a list of values as the by argument versus a list of column names wouldn't be possible in a static typing context. Consider this example:

>>> import pandas as pd
>>> df = pd.DataFrame({0: range(2), 1: [i*i for i in range(2)]})
>>> df
   0  1
0  0  0
1  1  1
>>> for a,b in df.groupby([0, 1]):
...     print(a, b)
...
(0, 0)    0  1
0  0  0
(1, 1)    0  1
1  1  1
>>> df.columns=["x", "y"]
>>> for a,b in df.groupby([0, 1]):
...     print(a, b)
...
0    x  y
0  0  0
1    x  y
1  1  1

Just changing the type of the column names changes the type of the result of iterating on the groupby() object. We can't detect that difference in static typing.

Daverball · 2023-05-02T15:44:12Z

The quickest and least intrusive solution to me seems to be to extend the scalar overload on DataFrame.groupby with the index types where we know for a fact that the first element in the tuple yielded through __iter__ is a Scalar, which should be at least DatetimeIndex and TimedeltaIndex and possibly RangeIndex, IntervalIndex and CategoricalIndex.

And then also add one new overload to handle PeriodIndex/Series[Period] and create a _DataFrameGroupByPeriod which overwrites __iter__ with Iterator[tuple[Period, DataFrame]]. So it doesn't change semantics of groupby/Scalar for existing code.

This would not be a complete solution, but would at least improve the situation a little bit.

Dr-Irv · 2023-05-02T15:48:39Z

This would not be a complete solution, but would at least improve the situation a little bit.

That may work. As long as it doesn't break existing tests, I'm open to a PR with this proposed change (while adding appropriate tests).

Daverball · 2023-05-02T16:24:51Z

I'm happy to take a stab at it. I should have some time tomorrow.

Daverball · 2023-05-03T10:14:54Z

It looks like the inheritance pattern of _DataFrameGroupByScalar and _DataFrameGroupByNonScalar causes issues, since mypy will consider the overlapping overloads unsafe if the return types are not compatible.

One possible way to solve this would be to make __iter__ on _DataFrameGroupByNonScalar the default implementation (DataFrameGroupBy should probably have a default implementation for this method anyways) and derive the Scalar and Period case from DataFrameGroupBy, while making the non-Scalar case a pure TypeAlias of DataFrameGroupBy.

This will however mean that we will have to type: ignore[override] on the implementations of __iter__ in the Scalar and Period case. Whether we like it or not, any solution will inherently not be type safe, due to how dynamic and magical the implementation is, so the best we can do is be somewhat type safe for the common case. So ignoring the type checker here seems fine to me. Although I suppose we could also just ignore the overlapping overloads error, without changing the inheritance structure in DataFrameGroupBy, which you seem to already be doing in some places.

Daverball · 2023-05-03T11:42:20Z

I opted for ignoring the unsafe overlapping overload error for now, since it means fewer code changes.

Dr-Irv added Groupby Period Period data type Generic Index Issues where a Generic Index would help labels May 2, 2023

Daverball mentioned this issue May 3, 2023

Extends DataFrame.groupby overloads to recognize some scalar index types #679

Merged

2 tasks

Dr-Irv closed this as completed in #679 May 6, 2023

Daverball mentioned this issue May 10, 2023

Refactors Series.groupby to match DataFrame.groupby #690

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`DataFrame.groupby` should return `_DataFrameGroupByScalar` when grouping by a PeriodIndex #674

`DataFrame.groupby` should return `_DataFrameGroupByScalar` when grouping by a PeriodIndex #674

Daverball commented May 2, 2023 •

edited

Loading

Dr-Irv commented May 2, 2023

Dr-Irv commented May 2, 2023

Daverball commented May 2, 2023 •

edited

Loading

Dr-Irv commented May 2, 2023

Daverball commented May 2, 2023

Daverball commented May 3, 2023 •

edited

Loading

Daverball commented May 3, 2023

DataFrame.groupby should return _DataFrameGroupByScalar when grouping by a PeriodIndex #674

DataFrame.groupby should return _DataFrameGroupByScalar when grouping by a PeriodIndex #674

Comments

Daverball commented May 2, 2023 • edited Loading

Dr-Irv commented May 2, 2023

Dr-Irv commented May 2, 2023

Daverball commented May 2, 2023 • edited Loading

Dr-Irv commented May 2, 2023

Daverball commented May 2, 2023

Daverball commented May 3, 2023 • edited Loading

Daverball commented May 3, 2023

`DataFrame.groupby` should return `_DataFrameGroupByScalar` when grouping by a PeriodIndex #674

`DataFrame.groupby` should return `_DataFrameGroupByScalar` when grouping by a PeriodIndex #674

Daverball commented May 2, 2023 •

edited

Loading

Daverball commented May 2, 2023 •

edited

Loading

Daverball commented May 3, 2023 •

edited

Loading