-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REGR: Prevent indexes that aren't directly backed by numpy from entering libreduction code paths #31238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REGR: Prevent indexes that aren't directly backed by numpy from entering libreduction code paths #31238
Conversation
pandas/_libs/reduction.pyx
Outdated
@@ -158,7 +158,7 @@ cdef class _BaseGrouper: | |||
if util.is_array(values) and not values.flags.contiguous: | |||
# e.g. Categorical has no `flags` attribute | |||
values = values.copy() | |||
index = dummy.index.values | |||
index = dummy.index.to_numpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for EA cases, we want to avoid going through libreduction altogether
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? We shouldn't be hitting this code path in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, yeah what does this do to the categorical grouping case. This should be a highly optimized path, is this still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? We shouldn't be hitting this code path in the first place?
Correct. This is called from core.apply.FrameApply.apply_standard after a check:
if (
self.result_type in ["reduce", None]
and not self.dtypes.apply(is_extension_array_dtype).any()
# Disallow complex_internals since libreduction shortcut
# cannot handle MultiIndex
and not isinstance(self.agg_axis, ABCMultiIndex)
):
The MultiIndex check may need to be expanded to include EAs.
I think this is called from one other place in non-test code, so the missing check may belong there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vaguely recall a _has_complex_...
property that indicated with the Index was backed by an ndarray. Was that removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._has_complex_internals
was equivalent to to isinstance(self, MultiIndex)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue in question doesn't actually go through FrameApply.apply_standard
, but rather _aggregate_series_fast
, which dispatches through libreduction
. The point still applies that we want to avoid EA backed indexes in FrameApply.apply_standard
, so I've modified the check to use _has_complex_internals
.
pandas/_libs/reduction.pyx
Outdated
@@ -158,7 +158,7 @@ cdef class _BaseGrouper: | |||
if util.is_array(values) and not values.flags.contiguous: | |||
# e.g. Categorical has no `flags` attribute | |||
values = values.copy() | |||
index = dummy.index.values | |||
index = dummy.index.to_numpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, yeah what does this do to the categorical grouping case. This should be a highly optimized path, is this still?
Okay, I've added back If adding back |
@@ -616,8 +616,8 @@ def agg_series(self, obj: Series, func): | |||
# TODO: can we get a performant workaround for EAs backed by ndarray? | |||
return self._aggregate_series_pure_python(obj, func) | |||
|
|||
elif isinstance(obj.index, MultiIndex): | |||
# MultiIndex; Pre-empt TypeError in _aggregate_series_fast | |||
elif obj.index._has_complex_internals: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now excludes PeriodIndex
, which previously worked fine since .values
converted to a numpy array. It looks more performant to exclude PeriodIndex
though, since we avoid the conversion to numpy:
In [1]: import numpy as np
...: import pandas as pd
...: from string import ascii_letters
...:
...: np.random.seed(123)
...: group = np.random.choice(list(ascii_letters), 10**5)
...: value = np.random.randint(12345, size=10**5)
...: index = pd.period_range("2000", freq="D", periods=10**5)
...: df = pd.DataFrame({"group": group, "value": value}, index=index)
In [2]: %timeit df.groupby("group").agg({"value": pd.Series.nunique})
17.8 ms ± 48.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # on this branch
95.9 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) # on master
@jreback I think this is fixing a regression. We should leave it on the 1.0.0 milestone. |
Merged master to fix the CI failure. |
Added an additional check against I don't immediately see any other areas where this check needs to occur but it's possible that there are additional cases. |
pandas/core/indexes/base.py
Outdated
""" | ||
Indicates if an index is not directly backed by a numpy array | ||
""" | ||
# used to disable groupby tricks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"tricks" -> "going through libreduction fastpath which would ..."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied that from the original implementation but agree it's not very helpful. Updated to something that's more informative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @jschendel.
Merging later tonight if there aren't any objections.
thanks |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon! If these instruction are inaccurate, feel free to suggest an improvement. |
@jschendel can you push a backport for this, didn't automatically work. |
…tly backed by numpy from entering libreduction code paths
… by numpy from entering libreduction code paths (#31378)
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
No whatsnew since this is a regression.