-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Ensure valid Block mutation in SeriesBinGrouper. #32561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure valid Block mutation in SeriesBinGrouper. #32561
Conversation
Closes pandas-dev#31802 This "fixes" pandas-dev#31802 by expanding the number of cases where we swallow an exception in libreduction. Currently, we're creating an invalid Series in SeriesBinGrouper where the `.mgr_locs` doesn't match the values. See pandas-dev#31802 (comment) for more. For now, we simply catch more cases that fall back to Python. I've gone with a minimal change which addresses only issues hitting this exact exception. We might want to go broader, but that's not clear.
"func", | ||
[ | ||
cumsum_max, | ||
pytest.param(assert_block_lengths, marks=pytest.mark.xfail(reason="debatable")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we just catch ValueError in https://github.com/pandas-dev/pandas/pull/32561/files#diff-8c0985a9fca770c2028bed688dfc043fR641. "fixing" this would essentially require an except Exception
. Do people have an opinion here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am personally -1 on catching such specific error messages like this. This only fixes the exact bug report at hand, not the general issue that things can fail inside the libreduction code in several unforeseen ways.
I would personally just broaden the exception that is swallowed (or at least in releases, in master branch I am fine with being more strict in the hope to catch some bugs)
I could go either way. My long-term hope is to get away from using exceptions as control flow like this. I'm not sure whether |
Per comment in #31082, would it be viable to check in libreduction before setting incorrectly-shaped block.values? |
I don't have a firm grip on when this occurs, but it seems like it should be any time you have a change in the group size. But I might be incorrect. |
ad746ba changed things to also mutate the Block.mgr_locs in addition to the values, so the earlier discussion on what to catch is moot (as far as this PR is concerned). |
I think some unrelated (already merged in master) edits have snuck in |
Fixed the git snafu I think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. conflict in whatsnew, merge on green.
@@ -51,6 +52,30 @@ def test_series_bin_grouper(): | |||
tm.assert_almost_equal(counts, exp_counts) | |||
|
|||
|
|||
def assert_block_lengths(x): | |||
assert len(x) == len(x._data.blocks[0].mgr_locs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this fails, the assertion errors bubbles up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah
diff --git a/pandas/tests/groupby/test_bin_groupby.py b/pandas/tests/groupby/test_bin_groupby.py
index 152086c241..b6518c1962 100644
--- a/pandas/tests/groupby/test_bin_groupby.py
+++ b/pandas/tests/groupby/test_bin_groupby.py
@@ -53,7 +53,7 @@ def test_series_bin_grouper():
def assert_block_lengths(x):
- assert len(x) == len(x._data.blocks[0].mgr_locs)
+ assert len(x) == len(x._data.blocks[0].mgr_locs) + 1
return 0
___________________________________________________________________ test_mgr_locs_updated[assert_block_lengths] ____________________________________________________________________
func = <function assert_block_lengths at 0x122354b90>
@pytest.mark.parametrize("func", [cumsum_max, assert_block_lengths])
def test_mgr_locs_updated(func):
# https://github.com/pandas-dev/pandas/issues/31802
# Some operations may require creating new blocks, which requires
# valid mgr_locs
df = pd.DataFrame({"A": ["a", "a", "a"], "B": ["a", "b", "b"], "C": [1, 1, 1]})
> result = df.groupby(["A", "B"]).agg(func)
pandas/tests/groupby/test_bin_groupby.py:71:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/core/groupby/generic.py:939: in aggregate
return self._python_agg_general(func, *args, **kwargs)
pandas/core/groupby/groupby.py:926: in _python_agg_general
result, counts = self.grouper.agg_series(obj, f)
pandas/core/groupby/ops.py:640: in agg_series
return self._aggregate_series_fast(obj, func)
pandas/core/groupby/ops.py:665: in _aggregate_series_fast
result, counts = grouper.get_result()
pandas/_libs/reduction.pyx:377: in pandas._libs.reduction.SeriesGrouper.get_result
res, initialized = self._apply_to_group(cached_typ, cached_ityp,
pandas/_libs/reduction.pyx:195: in pandas._libs.reduction._BaseGrouper._apply_to_group
res = self.f(cached_typ)
pandas/core/groupby/groupby.py:913: in <lambda>
f = lambda x: func(x, *args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
x = Series([], Name: C, dtype: int64)
def assert_block_lengths(x):
> assert len(x) == len(x._data.blocks[0].mgr_locs) + 1
E assert 1 == (1 + 1)
E + where 1 = len(0 1\nName: C, dtype: int64)
E + and 1 = len(BlockPlacement(slice(0, 1, 1)))
E + where BlockPlacement(slice(0, 1, 1)) = IntBlock: 1 dtype: int64.mgr_locs
pandas/tests/groupby/test_bin_groupby.py:56: AssertionError
==================================================================== 1 failed, 1 passed, 9 deselected in 0.24s =====================================================================
Has anyone seen the 32bit failure elsewhere?
I guess https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=30468&view=logs&j=3a03f79d-0b41-5610-1aa4-b4a014d0bc70&t=4d05ed0e-1ed3-5bff-dd63-1e957f2766a9&l=74 had it on the npdev build. Is that a flaky test? |
Yes. Best guess is #32571 caused this, am troubleshooting now |
…32635) Co-authored-by: Tom Augspurger <[email protected]>
Closes #31802
This "fixes" #31802 by expanding the number of cases where we swallow anexception in libreduction. Currently, we're creating an invalid Series
in SeriesBinGrouper where the
.mgr_locs
doesn't match the values. See#31802 (comment)
for more.
For now, we simply catch more cases that fall back to Python. I've gonewith a minimal change which addresses only issues hitting this exact
exception. We might want to go broader, but that's not clear.
cc @jbrockmendel & @WillAyd