Skip to content

REGR: fix regression in groupby aggregation with out-of-bounds datetimes #38094

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jorisvandenbossche
Copy link
Member

Closes #36003

@jorisvandenbossche jorisvandenbossche added Groupby Regression Functionality that used to work in a prior pandas version labels Nov 26, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.1.5 milestone Nov 26, 2020
except AssertionError:
# in some cases (eg GH-36003) an internal AssertionError can be
# raised if libreduction cannot handle this case
pass
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel I know you won't like this ... but, libreduction is for some reason I didn't yet figure out raising the AssertionError, which a user should not see. So suppressing that here fixes it.

(and specifically for AssertionError (not other errors), that might actually be OK, since AssertionErrors are almost always an indication that something internally in pandas went wrong (internal bug), and not a user error that should be bubbled up)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah don't like this, can you instead handle at a lower level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a further look, and I think this is inherent to the way that libreduction does the "sliding" series trick with replacing the underlying values.

The problem with this specific case if that the first Series it creates actually gets inferred as datetime64[ns] dtype (that's normal Series(..) behaviour, the first date is not out of bounds), but then for the next group, the blocks values are replaced with the object dtype array, which means you end up with an invalid DatetimeBlock with object dtype array.

I am not sure there is a way around that, except by not using this libreduction trick.
We could maybe check that the types are equal when setting the block values:

object.__setattr__(cached_typ._mgr._block, 'values', vslider.buf)

but that might also undo the performance gain of this trick.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what you're describing sounds similar to what motivated the libreduction edits in #35417. would porting those make a difference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try it out on that branch? Assuming I need to only look at the diff in reduction.pyx, I don't directly see how that would fix this issue, as it doesn't change how the Series' blocks' values are set within the loop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, doesn't fix it.

it looks like we're getting to DatetimeBlock.array_values with self.values of array([datetime.datetime(3005, 1, 1, 10, 30, 23, 540000)], dtype=object).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, a fix that does appear to work: in libreduction when defining self.cached_typ, pass dtype=self.buf.dtype to self.typ

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch. Will push that.

except AssertionError:
# in some cases (eg GH-36003) an internal AssertionError can be
# raised if libreduction cannot handle this case
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah don't like this, can you instead handle at a lower level?

@jorisvandenbossche jorisvandenbossche merged commit 91abd0a into pandas-dev:master Nov 27, 2020
@jorisvandenbossche jorisvandenbossche deleted the regr-groupby-datetime-objects branch November 27, 2020 20:12
@jorisvandenbossche
Copy link
Member Author

Thanks for the suggestion, @jbrockmendel !

@simonjayhawkins
Copy link
Member

@meeseeksdev backport 1.1.x

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Nov 27, 2020
simonjayhawkins pushed a commit that referenced this pull request Nov 27, 2020
…out-of-bounds datetimes (#38123)

Co-authored-by: Joris Van den Bossche <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

REGR: Column with datetime values too big to be converted to pd.Timestamp leads to assertion error in groupby
4 participants