-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REGR: fix regression in groupby aggregation with out-of-bounds datetimes #38094
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REGR: fix regression in groupby aggregation with out-of-bounds datetimes #38094
Conversation
pandas/core/groupby/ops.py
Outdated
except AssertionError: | ||
# in some cases (eg GH-36003) an internal AssertionError can be | ||
# raised if libreduction cannot handle this case | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel I know you won't like this ... but, libreduction is for some reason I didn't yet figure out raising the AssertionError, which a user should not see. So suppressing that here fixes it.
(and specifically for AssertionError (not other errors), that might actually be OK, since AssertionErrors are almost always an indication that something internally in pandas went wrong (internal bug), and not a user error that should be bubbled up)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah don't like this, can you instead handle at a lower level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a further look, and I think this is inherent to the way that libreduction does the "sliding" series trick with replacing the underlying values.
The problem with this specific case if that the first Series it creates actually gets inferred as datetime64[ns] dtype (that's normal Series(..)
behaviour, the first date is not out of bounds), but then for the next group, the blocks values are replaced with the object dtype array, which means you end up with an invalid DatetimeBlock with object dtype array.
I am not sure there is a way around that, except by not using this libreduction trick.
We could maybe check that the types are equal when setting the block values:
pandas/pandas/_libs/reduction.pyx
Line 55 in 78d1498
object.__setattr__(cached_typ._mgr._block, 'values', vslider.buf) |
but that might also undo the performance gain of this trick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what you're describing sounds similar to what motivated the libreduction edits in #35417. would porting those make a difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you try it out on that branch? Assuming I need to only look at the diff in reduction.pyx, I don't directly see how that would fix this issue, as it doesn't change how the Series' blocks' values are set within the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, doesn't fix it.
it looks like we're getting to DatetimeBlock.array_values
with self.values
of array([datetime.datetime(3005, 1, 1, 10, 30, 23, 540000)], dtype=object)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, a fix that does appear to work: in libreduction when defining self.cached_typ
, pass dtype=self.buf.dtype
to self.typ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch. Will push that.
pandas/core/groupby/ops.py
Outdated
except AssertionError: | ||
# in some cases (eg GH-36003) an internal AssertionError can be | ||
# raised if libreduction cannot handle this case | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah don't like this, can you instead handle at a lower level?
Thanks for the suggestion, @jbrockmendel ! |
@meeseeksdev backport 1.1.x |
…tion with out-of-bounds datetimes
…out-of-bounds datetimes (#38123) Co-authored-by: Joris Van den Bossche <[email protected]>
Closes #36003