Skip to content

REGR: Prevent indexes that aren't directly backed by numpy from entering libreduction code paths #31238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 28, 2020

Conversation

jschendel
Copy link
Member

@jschendel jschendel commented Jan 23, 2020

No whatsnew since this is a regression.

@jschendel jschendel added Groupby Regression Functionality that used to work in a prior pandas version ExtensionArray Extending pandas with custom dtypes or arrays. labels Jan 23, 2020
@jschendel jschendel added this to the 1.0.0 milestone Jan 23, 2020
@@ -158,7 +158,7 @@ cdef class _BaseGrouper:
if util.is_array(values) and not values.flags.contiguous:
# e.g. Categorical has no `flags` attribute
values = values.copy()
index = dummy.index.values
index = dummy.index.to_numpy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for EA cases, we want to avoid going through libreduction altogether

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? We shouldn't be hitting this code path in the first place?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, yeah what does this do to the categorical grouping case. This should be a highly optimized path, is this still?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? We shouldn't be hitting this code path in the first place?

Correct. This is called from core.apply.FrameApply.apply_standard after a check:

        if (
            self.result_type in ["reduce", None]
            and not self.dtypes.apply(is_extension_array_dtype).any()
            # Disallow complex_internals since libreduction shortcut
            #  cannot handle MultiIndex
            and not isinstance(self.agg_axis, ABCMultiIndex)
        ):

The MultiIndex check may need to be expanded to include EAs.

I think this is called from one other place in non-test code, so the missing check may belong there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely recall a _has_complex_... property that indicated with the Index was backed by an ndarray. Was that removed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._has_complex_internals was equivalent to to isinstance(self, MultiIndex)

Copy link
Member Author

@jschendel jschendel Jan 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue in question doesn't actually go through FrameApply.apply_standard, but rather _aggregate_series_fast, which dispatches through libreduction. The point still applies that we want to avoid EA backed indexes in FrameApply.apply_standard, so I've modified the check to use _has_complex_internals.

@@ -158,7 +158,7 @@ cdef class _BaseGrouper:
if util.is_array(values) and not values.flags.contiguous:
# e.g. Categorical has no `flags` attribute
values = values.copy()
index = dummy.index.values
index = dummy.index.to_numpy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, yeah what does this do to the categorical grouping case. This should be a highly optimized path, is this still?

@jreback jreback removed this from the 1.0.0 milestone Jan 26, 2020
@jschendel
Copy link
Member Author

jschendel commented Jan 26, 2020

Okay, I've added back Index._has_complex_internals but modified it to also be True for CategoricalIndex, IntervalIndex, and PeriodIndex since those indexes require (potentially expensive) conversion to get a numpy array of values. I don't want to exclude all extension indexes since DatetimeIndex and TimedeltaIndex don't require expensive conversion.

If adding back Index._has_complex_internals is acceptable, then I think we can also use it fix #31248 by using it to exclude things from libreduction in a similar manner. Was planning to do it in a follow-up but can do it here too if we want.

@@ -616,8 +616,8 @@ def agg_series(self, obj: Series, func):
# TODO: can we get a performant workaround for EAs backed by ndarray?
return self._aggregate_series_pure_python(obj, func)

elif isinstance(obj.index, MultiIndex):
# MultiIndex; Pre-empt TypeError in _aggregate_series_fast
elif obj.index._has_complex_internals:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now excludes PeriodIndex, which previously worked fine since .values converted to a numpy array. It looks more performant to exclude PeriodIndex though, since we avoid the conversion to numpy:

In [1]: import numpy as np 
   ...: import pandas as pd 
   ...: from string import ascii_letters 
   ...:  
   ...: np.random.seed(123) 
   ...: group = np.random.choice(list(ascii_letters), 10**5) 
   ...: value = np.random.randint(12345, size=10**5) 
   ...: index = pd.period_range("2000", freq="D", periods=10**5) 
   ...: df = pd.DataFrame({"group": group, "value": value}, index=index)

In [2]: %timeit df.groupby("group").agg({"value": pd.Series.nunique})
17.8 ms ± 48.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # on this branch
95.9 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  # on master

@TomAugspurger TomAugspurger added this to the 1.0.0 milestone Jan 27, 2020
@TomAugspurger
Copy link
Contributor

@jreback I think this is fixing a regression. We should leave it on the 1.0.0 milestone.

@TomAugspurger
Copy link
Contributor

Merged master to fix the CI failure.

@jschendel
Copy link
Member Author

jschendel commented Jan 27, 2020

Added an additional check against _has_complex_internals that fixes #31248 and the associated test cases.

I don't immediately see any other areas where this check needs to occur but it's possible that there are additional cases.

@jschendel jschendel changed the title REGR: Fix GroupBy aggregation with ExtensionArray backed index REGR: Prevent indexes that aren't directly backed by numpy from entering libreduction code paths Jan 27, 2020
"""
Indicates if an index is not directly backed by a numpy array
"""
# used to disable groupby tricks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"tricks" -> "going through libreduction fastpath which would ..."?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied that from the original implementation but agree it's not very helpful. Updated to something that's more informative.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @jschendel.

Merging later tonight if there aren't any objections.

@jreback jreback merged commit 82acdee into pandas-dev:master Jan 28, 2020
@jreback
Copy link
Contributor

jreback commented Jan 28, 2020

thanks

@lumberbot-app
Copy link

lumberbot-app bot commented Jan 28, 2020

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
$ git checkout 1.0.x
$ git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
$ git cherry-pick -m1 82acdeef8790ba0e30b671a830f9efcfc17e8479
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
$ git commit -am "Backport PR #31238: REGR: Prevent indexes that aren't directly backed by numpy from entering libreduction code paths"
  1. Push to a named branch :
git push YOURFORK 1.0.x:auto-backport-of-pr-31238-on-1.0.x
  1. Create a PR against branch 1.0.x, I would have named this PR:

"Backport PR #31238 on branch 1.0.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

@jreback
Copy link
Contributor

jreback commented Jan 28, 2020

@jschendel can you push a backport for this, didn't automatically work.

jschendel added a commit to jschendel/pandas that referenced this pull request Jan 28, 2020
…tly backed by numpy from entering libreduction code paths
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
5 participants