-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG/TST: ensure groupby.agg preserves extension dtype #29144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG/TST: ensure groupby.agg preserves extension dtype #29144
Conversation
pandas/core/groupby/ops.py
Outdated
@@ -655,15 +655,15 @@ def agg_series(self, obj, func): | |||
return self._aggregate_series_fast(obj, func) | |||
except AssertionError: | |||
raise | |||
except ValueError as err: | |||
except (ValueError, AttributeError) as err: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the AttributeError in reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you confirm whether 29100 makes this change unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only solves the fact that an AttributeError needs to be catched (I removed that in the last commit), it does not make the tests pass (as I mentioned yesterday)
if "No result." in str(err): | ||
# raised in libreduction | ||
pass | ||
elif "Function does not reduce" in str(err): | ||
# raised in libreduction | ||
pass | ||
else: | ||
raise | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel the whole block is now basically an except (..): pass
(so I could make it much shorter), but you might have put in those specific checks as pointers to what to clean up later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so something is being done with a DecimalArray that raises a TypeError? and i guess by not-catching it here it is getting caught somewhere above that isn't doing re-casting appropriately? can we track down where that is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like in _aggregate_series_fast there is a call to libreduction that tries to assign this DecimalArray to a name that libreduction has typed as an ndarray, which raises TypeError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that is expected I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we get to decide what is "expected"; maybe i dont understand what you're getting at
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, can you do the same str(err)
checking so that we only let through the relevant TypeErrors? One of the other PRs afoot is specifically targeting other TypeErrors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an error message coming from another library though (cython?). Do we have guarantee that that is stable?
(the other two errors that are catched that way are raised by our own code in reduction.pyx)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have guarantee that that is stable?
If nothing else, you can check that both "ndarray" and "DecimalArray" are present. I'm sure you can figure something out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not only for DecimalArray, but for any kind of internal/external EA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But can add the check for just ndarray.
pandas/core/groupby/ops.py
Outdated
@@ -655,15 +655,15 @@ def agg_series(self, obj, func): | |||
return self._aggregate_series_fast(obj, func) | |||
except AssertionError: | |||
raise | |||
except ValueError as err: | |||
except (ValueError, TypeError) as err: | |||
if "No result." in str(err): | |||
# raised in libreduction | |||
pass | |||
elif "Function does not reduce" in str(err): | |||
# raised in libreduction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel is it the intention to keep this long term, or is this also planned to be cleaned-up in a follow-up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleaned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what's the idea of how to clean this up?
(we could also raise a more specific internal error (subclass) to catch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(we could also raise a more specific internal error (subclass) to catch)
That's what I'm leaning towards, yah
Hmm so I think I'm a little hesitant to merge this one. Didn't follow all of the back and forth but do we know a general purpose solution is? These Exception cleanups are extremely valuable; as evidenced by this issue the amount of catching and re-wiring that goes on is impossible to reason about, so putting this back in might have subtle side effects we don't know about. |
FWIW ive copied the test this introduces into a local branch and am figuring out what "the right" way to do this is. #29103 is a blocker for that I'm pretty sure |
Well, until three days ago, this was the solution; so it can't be that bad (and the "subtle side effects we don't know about" are then the ones we have had for years ..). Yes, those clean-ups are valuable, I don't disagree with that, but all I propose is to figure out the better solution while master is restored to the previous behaviour, and while we have tests in place to guarantee the behaviour (also for other open PRs). Further, I actually think this can be a good solution. The TypeError is fully expected if a non-ndarray is passed into the fastpath. For sure, we can also check that in advance, and then don't pass it in. But both are valid python idioms. |
@jorisvandenbossche i've got a local branch about ready that copy/pastes the tests from this PR and implements a longer-term fix. does the Decimal test here sufficiently cover the Geopandas breakage you mentioned before? |
@jbrockmendel OK, that sounds good. Then will merge this PR and you can remove the catched typeerror again in your upcoming PR.
I hope, but it's difficult to be sure :), as it is a quite different EA. In any case, with the change I did that fixed the decimal test, also fixed my geopandas test, so they seem quite related. And I added the test with a custom decimal method to mimic it a bit more. |
This wasn't approved by anyone - why merge this instead of helping @jbrockmendel ? |
As I mentioned before, to have tests in place to ensure Brock's PRs pass them (there are several open PRs that touch the groupby code, not just a single one). In the discussion above, I didn't hear any objection to the tests itself, so assumed the tests itself were OK. |
To come back to this a final time: Will, you are right, I shouldn't have merged before asking again given the discussion. Sorry about that. I was a bit frustrated with the discussion, but I should not have let my frustration ignore the discussion. |
Closes #29141
@jbrockmendel this is adding the cases I found in #29141. It does a small revert of catching an error to get them passing, but then your PR can fix it properly?