Skip to content

REGR: fix regression in groupby aggregation with out-of-bounds datetimes #38094

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.1.5.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ Fixed regressions
- Fixed regression in inplace operations on :class:`Series` with ``ExtensionDtype`` with NumPy dtyped operand (:issue:`37910`)
- Fixed regression in metadata propagation for ``groupby`` iterator (:issue:`37343`)
- Fixed regression in indexing on a :class:`Series` with ``CategoricalDtype`` after unpickling (:issue:`37631`)
- Fixed regression in :meth:`DataFrame.groupby` aggregation with out-of-bounds datetime objects in an object-dtype column (:issues:`36003`)
- Fixed regression in ``df.groupby(..).rolling(..)`` with the resulting :class:`MultiIndex` when grouping by a label that is in the index (:issue:`37641`)

.. ---------------------------------------------------------------------------
Expand Down
4 changes: 4 additions & 0 deletions pandas/core/groupby/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -639,6 +639,10 @@ def agg_series(self, obj: Series, func: F):

try:
return self._aggregate_series_fast(obj, func)
except AssertionError:
# in some cases (eg GH-36003) an internal AssertionError can be
# raised if libreduction cannot handle this case
pass
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel I know you won't like this ... but, libreduction is for some reason I didn't yet figure out raising the AssertionError, which a user should not see. So suppressing that here fixes it.

(and specifically for AssertionError (not other errors), that might actually be OK, since AssertionErrors are almost always an indication that something internally in pandas went wrong (internal bug), and not a user error that should be bubbled up)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah don't like this, can you instead handle at a lower level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a further look, and I think this is inherent to the way that libreduction does the "sliding" series trick with replacing the underlying values.

The problem with this specific case if that the first Series it creates actually gets inferred as datetime64[ns] dtype (that's normal Series(..) behaviour, the first date is not out of bounds), but then for the next group, the blocks values are replaced with the object dtype array, which means you end up with an invalid DatetimeBlock with object dtype array.

I am not sure there is a way around that, except by not using this libreduction trick.
We could maybe check that the types are equal when setting the block values:

object.__setattr__(cached_typ._mgr._block, 'values', vslider.buf)

but that might also undo the performance gain of this trick.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what you're describing sounds similar to what motivated the libreduction edits in #35417. would porting those make a difference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try it out on that branch? Assuming I need to only look at the diff in reduction.pyx, I don't directly see how that would fix this issue, as it doesn't change how the Series' blocks' values are set within the loop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, doesn't fix it.

it looks like we're getting to DatetimeBlock.array_values with self.values of array([datetime.datetime(3005, 1, 1, 10, 30, 23, 540000)], dtype=object).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, a fix that does appear to work: in libreduction when defining self.cached_typ, pass dtype=self.buf.dtype to self.typ

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch. Will push that.

except ValueError as err:
if "Must produce aggregated value" in str(err):
# raised in libreduction
Expand Down
19 changes: 19 additions & 0 deletions pandas/tests/groupby/aggregate/test_aggregate.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
test .agg behavior / note that .apply is tested generally in test_groupby.py
"""
import datetime
import functools
from functools import partial

Expand Down Expand Up @@ -1156,3 +1157,21 @@ def test_agg_no_suffix_index():
result = df["A"].agg(["sum", lambda x: x.sum(), lambda x: x.sum()])
expected = Series([12, 12, 12], index=["sum", "<lambda>", "<lambda>"], name="A")
tm.assert_series_equal(result, expected)


def test_aggregate_datetime_objects():
# https://github.com/pandas-dev/pandas/issues/36003
# ensure we don't raise an error but keep object dtype for out-of-bounds
# datetimes
df = DataFrame(
{
"A": ["X", "Y"],
"B": [
datetime.datetime(2005, 1, 1, 10, 30, 23, 540000),
datetime.datetime(3005, 1, 1, 10, 30, 23, 540000),
],
}
)
result = df.groupby("A").B.max()
expected = df.set_index("A")["B"]
tm.assert_series_equal(result, expected)