REGR: Prevent indexes that aren't directly backed by numpy from entering libreduction code paths #31238

jschendel · 2020-01-23T02:38:37Z

closes GroupBy aggregation fails if DataFrame has CategoricalIndex #31223
closes REGR: CategoricalIndex and IntervalIndex are missing _index_data attribute #31248
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff

No whatsnew since this is a regression.

jbrockmendel · 2020-01-23T05:40:38Z

pandas/_libs/reduction.pyx

@@ -158,7 +158,7 @@ cdef class _BaseGrouper:
        if util.is_array(values) and not values.flags.contiguous:
            # e.g. Categorical has no `flags` attribute
            values = values.copy()
-        index = dummy.index.values
+        index = dummy.index.to_numpy()


for EA cases, we want to avoid going through libreduction altogether

What do you mean? We shouldn't be hitting this code path in the first place?

hmm, yeah what does this do to the categorical grouping case. This should be a highly optimized path, is this still?

What do you mean? We shouldn't be hitting this code path in the first place?

Correct. This is called from core.apply.FrameApply.apply_standard after a check:

if ( self.result_type in ["reduce", None] and not self.dtypes.apply(is_extension_array_dtype).any() # Disallow complex_internals since libreduction shortcut # cannot handle MultiIndex and not isinstance(self.agg_axis, ABCMultiIndex) ):

The MultiIndex check may need to be expanded to include EAs.

I think this is called from one other place in non-test code, so the missing check may belong there.

I vaguely recall a _has_complex_... property that indicated with the Index was backed by an ndarray. Was that removed?

self._has_complex_internals was equivalent to to isinstance(self, MultiIndex)

The issue in question doesn't actually go through FrameApply.apply_standard, but rather _aggregate_series_fast, which dispatches through libreduction. The point still applies that we want to avoid EA backed indexes in FrameApply.apply_standard, so I've modified the check to use _has_complex_internals.

jreback · 2020-01-24T01:01:34Z

pandas/_libs/reduction.pyx

@@ -158,7 +158,7 @@ cdef class _BaseGrouper:
        if util.is_array(values) and not values.flags.contiguous:
            # e.g. Categorical has no `flags` attribute
            values = values.copy()
-        index = dummy.index.values
+        index = dummy.index.to_numpy()


hmm, yeah what does this do to the categorical grouping case. This should be a highly optimized path, is this still?

jschendel · 2020-01-26T21:35:36Z

Okay, I've added back Index._has_complex_internals but modified it to also be True for CategoricalIndex, IntervalIndex, and PeriodIndex since those indexes require (potentially expensive) conversion to get a numpy array of values. I don't want to exclude all extension indexes since DatetimeIndex and TimedeltaIndex don't require expensive conversion.

If adding back Index._has_complex_internals is acceptable, then I think we can also use it fix #31248 by using it to exclude things from libreduction in a similar manner. Was planning to do it in a follow-up but can do it here too if we want.

jschendel · 2020-01-26T21:42:30Z

pandas/core/groupby/ops.py

@@ -616,8 +616,8 @@ def agg_series(self, obj: Series, func):
            # TODO: can we get a performant workaround for EAs backed by ndarray?
            return self._aggregate_series_pure_python(obj, func)

-        elif isinstance(obj.index, MultiIndex):
-            # MultiIndex; Pre-empt TypeError in _aggregate_series_fast
+        elif obj.index._has_complex_internals:


This now excludes PeriodIndex, which previously worked fine since .values converted to a numpy array. It looks more performant to exclude PeriodIndex though, since we avoid the conversion to numpy:

In [1]: import numpy as np ...: import pandas as pd ...: from string import ascii_letters ...: ...: np.random.seed(123) ...: group = np.random.choice(list(ascii_letters), 10**5) ...: value = np.random.randint(12345, size=10**5) ...: index = pd.period_range("2000", freq="D", periods=10**5) ...: df = pd.DataFrame({"group": group, "value": value}, index=index) In [2]: %timeit df.groupby("group").agg({"value": pd.Series.nunique}) 17.8 ms ± 48.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # on this branch 95.9 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) # on master

TomAugspurger · 2020-01-27T14:01:37Z

@jreback I think this is fixing a regression. We should leave it on the 1.0.0 milestone.

…nsion-index

TomAugspurger · 2020-01-27T14:42:59Z

Merged master to fix the CI failure.

jschendel · 2020-01-27T16:06:27Z

Added an additional check against _has_complex_internals that fixes #31248 and the associated test cases.

I don't immediately see any other areas where this check needs to occur but it's possible that there are additional cases.

jbrockmendel · 2020-01-27T16:28:34Z

pandas/core/indexes/base.py

+        """
+        Indicates if an index is not directly backed by a numpy array
+        """
+        # used to disable groupby tricks


"tricks" -> "going through libreduction fastpath which would ..."?

Copied that from the original implementation but agree it's not very helpful. Updated to something that's more informative.

TomAugspurger

LGTM. Thanks @jschendel.

Merging later tonight if there aren't any objections.

jreback · 2020-01-28T01:57:22Z

thanks

lumberbot-app · 2020-01-28T01:57:26Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

$ git checkout 1.0.x
$ git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

$ git cherry-pick -m1 82acdeef8790ba0e30b671a830f9efcfc17e8479

You will likely have some merge/cherry-pick conflict here, fix them and commit:

$ git commit -am "Backport PR #31238: REGR: Prevent indexes that aren't directly backed by numpy from entering libreduction code paths"

Push to a named branch :

git push YOURFORK 1.0.x:auto-backport-of-pr-31238-on-1.0.x

Create a PR against branch 1.0.x, I would have named this PR:

"Backport PR #31238 on branch 1.0.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

jreback · 2020-01-28T02:01:30Z

@jschendel can you push a backport for this, didn't automatically work.

…tly backed by numpy from entering libreduction code paths

… by numpy from entering libreduction code paths (#31378)

REGR: Fix GroupBy aggregation with ExtensionArray backed index

dea76c1

jschendel added Groupby Regression Functionality that used to work in a prior pandas version ExtensionArray Extending pandas with custom dtypes or arrays. labels Jan 23, 2020

jschendel added this to the 1.0.0 milestone Jan 23, 2020

jbrockmendel reviewed Jan 23, 2020

View reviewed changes

jreback requested changes Jan 24, 2020

View reviewed changes

jreback removed this from the 1.0.0 milestone Jan 26, 2020

add back _has_complex_internals

ed3a9a3

jschendel commented Jan 26, 2020

View reviewed changes

TomAugspurger added this to the 1.0.0 milestone Jan 27, 2020

Merge remote-tracking branch 'upstream/master' into regr-groupby-exte…

2efd22b

…nsion-index

TomAugspurger mentioned this pull request Jan 27, 2020

Regression bugs when applying GroupBy Aggregations to Categorical columns #31256

Closed

jschendel mentioned this pull request Jan 27, 2020

REGR: CategoricalIndex and IntervalIndex are missing _index_data attribute #31248

Closed

prevent fast_apply case for _has_complex_internals

1018ca9

jschendel changed the title ~~REGR: Fix GroupBy aggregation with ExtensionArray backed index~~ REGR: Prevent indexes that aren't directly backed by numpy from entering libreduction code paths Jan 27, 2020

jbrockmendel reviewed Jan 27, 2020

View reviewed changes

TomAugspurger approved these changes Jan 27, 2020

View reviewed changes

update comment

cc400cd

jreback approved these changes Jan 28, 2020

View reviewed changes

jreback merged commit 82acdee into pandas-dev:master Jan 28, 2020

lumberbot-app bot added the Still Needs Manual Backport label Jan 28, 2020

jschendel added a commit to jschendel/pandas that referenced this pull request Jan 28, 2020

Backport PR pandas-dev#31238: REGR: Prevent indexes that aren't direc…

dc61dd7

…tly backed by numpy from entering libreduction code paths

jschendel deleted the regr-groupby-extension-index branch January 28, 2020 03:54

jschendel mentioned this pull request Jan 28, 2020

Backport PR #31238: REGR: Prevent indexes that aren't directly backedby numpy from entering libreduction code paths #31378

Merged

jreback pushed a commit that referenced this pull request Jan 28, 2020

Backport PR #31238: REGR: Prevent indexes that aren't directly backed…

b905f2b

… by numpy from entering libreduction code paths (#31378)

simonjayhawkins removed the Still Needs Manual Backport label Jan 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: Prevent indexes that aren't directly backed by numpy from entering libreduction code paths #31238

REGR: Prevent indexes that aren't directly backed by numpy from entering libreduction code paths #31238

jschendel commented Jan 23, 2020 •

edited

Loading

jbrockmendel Jan 23, 2020

TomAugspurger Jan 23, 2020

jreback Jan 24, 2020

jbrockmendel Jan 24, 2020

TomAugspurger Jan 24, 2020

jbrockmendel Jan 24, 2020

jschendel Jan 26, 2020 •

edited

Loading

jreback Jan 24, 2020

jschendel commented Jan 26, 2020 •

edited

Loading

jschendel Jan 26, 2020

TomAugspurger commented Jan 27, 2020

TomAugspurger commented Jan 27, 2020

jschendel commented Jan 27, 2020 •

edited

Loading

jbrockmendel Jan 27, 2020

jschendel Jan 28, 2020

TomAugspurger left a comment

jreback commented Jan 28, 2020

lumberbot-app bot commented Jan 28, 2020

jreback commented Jan 28, 2020

REGR: Prevent indexes that aren't directly backed by numpy from entering libreduction code paths #31238

REGR: Prevent indexes that aren't directly backed by numpy from entering libreduction code paths #31238

Conversation

jschendel commented Jan 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jschendel Jan 26, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jschendel commented Jan 26, 2020 • edited Loading

Choose a reason for hiding this comment

TomAugspurger commented Jan 27, 2020

TomAugspurger commented Jan 27, 2020

jschendel commented Jan 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger left a comment

Choose a reason for hiding this comment

jreback commented Jan 28, 2020

lumberbot-app bot commented Jan 28, 2020

jreback commented Jan 28, 2020

jschendel commented Jan 23, 2020 •

edited

Loading

jschendel Jan 26, 2020 •

edited

Loading

jschendel commented Jan 26, 2020 •

edited

Loading

jschendel commented Jan 27, 2020 •

edited

Loading