-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Use Indexers to implement groupby rolling #34052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Use Indexers to implement groupby rolling #34052
Conversation
Here are preliminary benchmarks. The performance so far is fairly similar. I suspect that the fact that I have to reconstruct the resulting index is killing the performance With this benchmark:
|
try with a longer window as well |
I was accidentally still dispatching to the old implementation in this PR, here are the performance results with that removed
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you enhance the benchamarks for this (if we don't have?)
also if you can add a whatsnew note.
Hello @mroeschke! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-05-21 04:47:41 UTC |
@jreback ready for another look |
asv_bench/benchmarks/rolling.py
Outdated
df = pd.DataFrame( | ||
{"A": [str(i) for i in range(N)] * 10, "B": list(range(N)) * 10} | ||
) | ||
self.groupby_roll = df.groupby("A").rolling(window=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a timebased one as well
doc/source/whatsnew/v1.1.0.rst
Outdated
@@ -611,7 +611,7 @@ Performance improvements | |||
and :meth:`~pandas.core.groupby.groupby.Groupby.last` (:issue:`34178`) | |||
- Performance improvement in :func:`factorize` for nullable (integer and boolean) dtypes (:issue:`33064`). | |||
- Performance improvement in reductions (sum, prod, min, max) for nullable (integer and boolean) dtypes (:issue:`30982`, :issue:`33261`, :issue:`33442`). | |||
|
|||
- Performance improvement in ``groupby(..).rolling(..)`` (:issue:`34052`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a way of hitting the api here?
pandas/core/window/rolling.py
Outdated
np.concatenate(list(self._groupby.grouper.indices.values())) | ||
) | ||
|
||
# filter out the on from the object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe better to call super()._create_blocks(obj) (e.g. add obj as an optional arg that defaults to self._selected_obj)
Final benchmarks:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, some doc comment requests. ping on green.
center: Optional[bool] = None, | ||
closed: Optional[str] = None, | ||
) -> Tuple[np.ndarray, np.ndarray]: | ||
start_arrays = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some comments here on what you are doing
pandas/core/window/rolling.py
Outdated
@@ -147,12 +148,10 @@ def _validate_get_window_bounds_signature(window: BaseIndexer) -> None: | |||
f"get_window_bounds" | |||
) | |||
|
|||
def _create_blocks(self): | |||
def _create_blocks(self, obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you can type: Union[Series, DataFrame] (I think we have an annoation for that).
groupby_keys = [grouping.name for grouping in self._groupby.grouper._groupings] | ||
result_index_names = groupby_keys + grouped_index_name | ||
|
||
result_index_data = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you can add some commments here on what you are doing
pandas/core/window/rolling.py
Outdated
@property | ||
def _constructor(self): | ||
return Rolling | ||
|
||
def _gotitem(self, key, ndim, subset=None): | ||
def _create_blocks(self, obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you can type
@jreback Ping all green |
thanks @mroeschke very nice! |
Does this affect also "Expanding"? |
No this currently doesn't apply to expanding. PR's to make it apply to expanding welcome! |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Currently,
grouby.rolling
is implemented essentially asgroupby.apply(lambda x: x.rolling())
which can be potentially slow.This PR implements
groupby.rolling
by calculating bounds with aGroupbyRollingIndxer
and using the rolling aggregations in cython to compute the results.