Skip to content

DataFrame.groupby should return _DataFrameGroupByScalar when grouping by a PeriodIndex #674

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Daverball opened this issue May 2, 2023 · 7 comments · Fixed by #679
Closed
Labels
Generic Index Issues where a Generic Index would help Groupby Period Period data type

Comments

@Daverball
Copy link
Contributor

Daverball commented May 2, 2023

Describe the bug
When performing a groupby on a dataframe with a PeriodIndex the group identifier should be a Period, i.e. a scalar (On that note Period appears to be missing from the definition of Scalar in _typing.pyi)

To Reproduce

import pandas as pd
df = pd.DataFrame({'date': pd.date_range('2020-01-01', '2020-12-31'), 'days': 1})
index = pd.PeriodIndex(df.date, freq='M')
for period, group in df.groupby(index):
    period.start_time  # mypy: attr-defined ­— Tuple[Any, ...] has no attribute "start_time"
  • OS: Linux (OpenSUSE Tumbleweed)
  • Python 3.10
  • mypy 1.0.1
  • pandas-stubs 2.0.1.230501

Additional comments
This should probably extend to most other specialized Index types apart from the base Index and MultiIndex, for the latter it's obvious that it would be non-scalar, for the former it would depend on what Python object is referenced, since converting MultiIndex to and from an Index on a builtins.tuple is quite common, it's probably the right move to treat Index as non-scalar by default or return an Any variant of DataFrameGroupBy that could either be scalar or non-scalar.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented May 2, 2023

I'm not sure this can be done without a lot of work to make the internal _DataFrameGroupByNonScalar a Generic class with a parameter that corresponds to the specific type of object used in the by argument of groupby . For example, you'd have the same problem if you used df.groupby(list(index)) in your example. That would require fixing __iter__() for Index or us getting Generic index working.

The main issue here is that there are so many different types that can be passed to groupby() that inferring the resulting type of the iterator on the DataFrameGroupBy object is difficult in a static typing context.

I'll keep this open in case anyone wants to give this a try, but I think it will be very difficult to get this to work.

For now, you'd have to do a cast on period in your loop.

We could consider adding Period to Scalar, although that might make Scalar too wide for some other places where it is used. And then change

def __iter__(self) -> Iterator[tuple[tuple, DataFrame]]: ...

to return Iterator[tuple[tuple | Scalar, DataFrame]] instead, or maybe it's more safe to not add Period to Scalar and use Iterator[tuple[tuple | Scalar | Period, DataFrame]] but that might break some existing type checking in other people's codes (including some of my staff...)

@Dr-Irv Dr-Irv added Groupby Period Period data type Generic Index Issues where a Generic Index would help labels May 2, 2023
@Dr-Irv
Copy link
Collaborator

Dr-Irv commented May 2, 2023

One other note - right now we have set up groupby() to recognize the most commonly used arguments for by, namely a single column name, or a sequence of column names. Differentiating between the case of having a list of values as the by argument versus a list of column names wouldn't be possible in a static typing context. Consider this example:

>>> import pandas as pd
>>> df = pd.DataFrame({0: range(2), 1: [i*i for i in range(2)]})
>>> df
   0  1
0  0  0
1  1  1
>>> for a,b in df.groupby([0, 1]):
...     print(a, b)
...
(0, 0)    0  1
0  0  0
(1, 1)    0  1
1  1  1
>>> df.columns=["x", "y"]
>>> for a,b in df.groupby([0, 1]):
...     print(a, b)
...
0    x  y
0  0  0
1    x  y
1  1  1

Just changing the type of the column names changes the type of the result of iterating on the groupby() object. We can't detect that difference in static typing.

@Daverball
Copy link
Contributor Author

Daverball commented May 2, 2023

The quickest and least intrusive solution to me seems to be to extend the scalar overload on DataFrame.groupby with the index types where we know for a fact that the first element in the tuple yielded through __iter__ is a Scalar, which should be at least DatetimeIndex and TimedeltaIndex and possibly RangeIndex, IntervalIndex and CategoricalIndex.

And then also add one new overload to handle PeriodIndex/Series[Period] and create a _DataFrameGroupByPeriod which overwrites __iter__ with Iterator[tuple[Period, DataFrame]]. So it doesn't change semantics of groupby/Scalar for existing code.

This would not be a complete solution, but would at least improve the situation a little bit.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented May 2, 2023

This would not be a complete solution, but would at least improve the situation a little bit.

That may work. As long as it doesn't break existing tests, I'm open to a PR with this proposed change (while adding appropriate tests).

@Daverball
Copy link
Contributor Author

I'm happy to take a stab at it. I should have some time tomorrow.

@Daverball
Copy link
Contributor Author

Daverball commented May 3, 2023

It looks like the inheritance pattern of _DataFrameGroupByScalar and _DataFrameGroupByNonScalar causes issues, since mypy will consider the overlapping overloads unsafe if the return types are not compatible.

One possible way to solve this would be to make __iter__ on _DataFrameGroupByNonScalar the default implementation (DataFrameGroupBy should probably have a default implementation for this method anyways) and derive the Scalar and Period case from DataFrameGroupBy, while making the non-Scalar case a pure TypeAlias of DataFrameGroupBy.

This will however mean that we will have to type: ignore[override] on the implementations of __iter__ in the Scalar and Period case. Whether we like it or not, any solution will inherently not be type safe, due to how dynamic and magical the implementation is, so the best we can do is be somewhat type safe for the common case. So ignoring the type checker here seems fine to me. Although I suppose we could also just ignore the overlapping overloads error, without changing the inheritance structure in DataFrameGroupBy, which you seem to already be doing in some places.

@Daverball
Copy link
Contributor Author

I opted for ignoring the unsafe overlapping overload error for now, since it means fewer code changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Generic Index Issues where a Generic Index would help Groupby Period Period data type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants