Skip to content

BUG/TST: ensure groupby.agg preserves extension dtype #29144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jorisvandenbossche
Copy link
Member

Closes #29141

@jbrockmendel this is adding the cases I found in #29141. It does a small revert of catching an error to get them passing, but then your PR can fix it properly?

@jorisvandenbossche jorisvandenbossche added Groupby ExtensionArray Extending pandas with custom dtypes or arrays. labels Oct 22, 2019
@jorisvandenbossche jorisvandenbossche added this to the 1.0 milestone Oct 22, 2019
@@ -655,15 +655,15 @@ def agg_series(self, obj, func):
return self._aggregate_series_fast(obj, func)
except AssertionError:
raise
except ValueError as err:
except (ValueError, AttributeError) as err:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the AttributeError in reference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the issue #29141. Now, since #29100 is already merged, the AttributeError might not be needed anymore for our tests to pass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you confirm whether 29100 makes this change unnecessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only solves the fact that an AttributeError needs to be catched (I removed that in the last commit), it does not make the tests pass (as I mentioned yesterday)

if "No result." in str(err):
# raised in libreduction
pass
elif "Function does not reduce" in str(err):
# raised in libreduction
pass
else:
raise
pass
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel the whole block is now basically an except (..): pass (so I could make it much shorter), but you might have put in those specific checks as pointers to what to clean up later?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so something is being done with a DecimalArray that raises a TypeError? and i guess by not-catching it here it is getting caught somewhere above that isn't doing re-casting appropriately? can we track down where that is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like in _aggregate_series_fast there is a call to libreduction that tries to assign this DecimalArray to a name that libreduction has typed as an ndarray, which raises TypeError

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that is expected I think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we get to decide what is "expected"; maybe i dont understand what you're getting at

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, can you do the same str(err) checking so that we only let through the relevant TypeErrors? One of the other PRs afoot is specifically targeting other TypeErrors

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an error message coming from another library though (cython?). Do we have guarantee that that is stable?
(the other two errors that are catched that way are raised by our own code in reduction.pyx)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have guarantee that that is stable?

If nothing else, you can check that both "ndarray" and "DecimalArray" are present. I'm sure you can figure something out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not only for DecimalArray, but for any kind of internal/external EA.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But can add the check for just ndarray.

@@ -655,15 +655,15 @@ def agg_series(self, obj, func):
return self._aggregate_series_fast(obj, func)
except AssertionError:
raise
except ValueError as err:
except (ValueError, TypeError) as err:
if "No result." in str(err):
# raised in libreduction
pass
elif "Function does not reduce" in str(err):
# raised in libreduction
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel is it the intention to keep this long term, or is this also planned to be cleaned-up in a follow-up?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleaned

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what's the idea of how to clean this up?
(we could also raise a more specific internal error (subclass) to catch)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(we could also raise a more specific internal error (subclass) to catch)

That's what I'm leaning towards, yah

@WillAyd
Copy link
Member

WillAyd commented Oct 23, 2019

Hmm so I think I'm a little hesitant to merge this one. Didn't follow all of the back and forth but do we know a general purpose solution is?

These Exception cleanups are extremely valuable; as evidenced by this issue the amount of catching and re-wiring that goes on is impossible to reason about, so putting this back in might have subtle side effects we don't know about.

@jbrockmendel
Copy link
Member

so putting this back in might have subtle side effects we don't know about.

FWIW ive copied the test this introduces into a local branch and am figuring out what "the right" way to do this is. #29103 is a blocker for that I'm pretty sure

@jorisvandenbossche
Copy link
Member Author

Hmm so I think I'm a little hesitant to merge this one. Didn't follow all of the back and forth but do we know a general purpose solution is?

Well, until three days ago, this was the solution; so it can't be that bad (and the "subtle side effects we don't know about" are then the ones we have had for years ..). Yes, those clean-ups are valuable, I don't disagree with that, but all I propose is to figure out the better solution while master is restored to the previous behaviour, and while we have tests in place to guarantee the behaviour (also for other open PRs).

Further, I actually think this can be a good solution. The TypeError is fully expected if a non-ndarray is passed into the fastpath. For sure, we can also check that in advance, and then don't pass it in. But both are valid python idioms.

@jbrockmendel
Copy link
Member

@jorisvandenbossche i've got a local branch about ready that copy/pastes the tests from this PR and implements a longer-term fix. does the Decimal test here sufficiently cover the Geopandas breakage you mentioned before?

@jorisvandenbossche
Copy link
Member Author

@jbrockmendel OK, that sounds good. Then will merge this PR and you can remove the catched typeerror again in your upcoming PR.

does the Decimal test here sufficiently cover the Geopandas breakage you mentioned before?

I hope, but it's difficult to be sure :), as it is a quite different EA. In any case, with the change I did that fixed the decimal test, also fixed my geopandas test, so they seem quite related. And I added the test with a custom decimal method to mimic it a bit more.

@jorisvandenbossche jorisvandenbossche merged commit f412191 into pandas-dev:master Oct 23, 2019
@jorisvandenbossche jorisvandenbossche deleted the GH29141-groupby-EA branch October 23, 2019 15:51
@WillAyd
Copy link
Member

WillAyd commented Oct 23, 2019

This wasn't approved by anyone - why merge this instead of helping @jbrockmendel ?

@jorisvandenbossche
Copy link
Member Author

As I mentioned before, to have tests in place to ensure Brock's PRs pass them (there are several open PRs that touch the groupby code, not just a single one). In the discussion above, I didn't hear any objection to the tests itself, so assumed the tests itself were OK.

@jorisvandenbossche
Copy link
Member Author

To come back to this a final time: Will, you are right, I shouldn't have merged before asking again given the discussion. Sorry about that. I was a bit frustrated with the discussion, but I should not have let my frustration ignore the discussion.

HawkinsBA pushed a commit to HawkinsBA/pandas that referenced this pull request Oct 29, 2019
Reksbril pushed a commit to Reksbril/pandas that referenced this pull request Nov 18, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
bongolegend pushed a commit to bongolegend/pandas that referenced this pull request Jan 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: regression on master in groupby agg with ExtensionArray
4 participants