-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REGR: fix regression in groupby aggregation with out-of-bounds datetimes #38094
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
jorisvandenbossche
merged 3 commits into
pandas-dev:master
from
jorisvandenbossche:regr-groupby-datetime-objects
Nov 27, 2020
Merged
Changes from 1 commit
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel I know you won't like this ... but, libreduction is for some reason I didn't yet figure out raising the AssertionError, which a user should not see. So suppressing that here fixes it.
(and specifically for AssertionError (not other errors), that might actually be OK, since AssertionErrors are almost always an indication that something internally in pandas went wrong (internal bug), and not a user error that should be bubbled up)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah don't like this, can you instead handle at a lower level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a further look, and I think this is inherent to the way that libreduction does the "sliding" series trick with replacing the underlying values.
The problem with this specific case if that the first Series it creates actually gets inferred as datetime64[ns] dtype (that's normal
Series(..)
behaviour, the first date is not out of bounds), but then for the next group, the blocks values are replaced with the object dtype array, which means you end up with an invalid DatetimeBlock with object dtype array.I am not sure there is a way around that, except by not using this libreduction trick.
We could maybe check that the types are equal when setting the block values:
pandas/pandas/_libs/reduction.pyx
Line 55 in 78d1498
but that might also undo the performance gain of this trick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what you're describing sounds similar to what motivated the libreduction edits in #35417. would porting those make a difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you try it out on that branch? Assuming I need to only look at the diff in reduction.pyx, I don't directly see how that would fix this issue, as it doesn't change how the Series' blocks' values are set within the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, doesn't fix it.
it looks like we're getting to
DatetimeBlock.array_values
withself.values
ofarray([datetime.datetime(3005, 1, 1, 10, 30, 23, 540000)], dtype=object)
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, a fix that does appear to work: in libreduction when defining
self.cached_typ
, passdtype=self.buf.dtype
toself.typ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch. Will push that.