-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Cythonized GroupBy any #19722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cythonized GroupBy any #19722
Changes from all commits
06f8840
c5809cb
0da6ece
a225c4d
c873e14
8aaaf32
fd5faf9
1237637
2ad0b50
b32b906
ae9126f
e1cdb82
aa338da
1dc67d9
977275b
055d8bf
0754294
3cd500f
1bd8dc9
ee4d0bd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -697,9 +697,10 @@ Performance Improvements | |
- Improved performance of :func:`DataFrame.median` with ``axis=1`` when bottleneck is not installed (:issue:`16468`) | ||
- Improved performance of :func:`MultiIndex.get_loc` for large indexes, at the cost of a reduction in performance for small ones (:issue:`18519`) | ||
- Improved performance of pairwise ``.rolling()`` and ``.expanding()`` with ``.cov()`` and ``.corr()`` operations (:issue:`17917`) | ||
- Improved performance of :func:`DataFrameGroupBy.rank` (:issue:`15779`) | ||
- Improved performance of :func:`pandas.core.groupby.GroupBy.rank` (:issue:`15779`) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For whatever reason this one wouldn't render a link. Still debugging but didn't want it to hold up the rest of the review There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm ok There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it may be that this needs to be DataFrameGroupBy.rank (or maybe SeriesGroupBy.rank) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea not entirely sure. It was DataFrameGroupBy.rank previously and wasn't rendering, although that was without the entire namespace so perhaps that's it. Do you want me to open an issue on its own for this or is this typically caught in a "whatsnew cleanup" prior to release? |
||
- Improved performance of variable ``.rolling()`` on ``.min()`` and ``.max()`` (:issue:`19521`) | ||
- Improved performance of ``GroupBy.ffill`` and ``GroupBy.bfill`` (:issue:`11296`) | ||
- Improved performance of :func:`pandas.core.groupby.GroupBy.ffill` and :func:`pandas.core.groupby.GroupBy.bfill` (:issue:`11296`) | ||
- Improved performance of :func:`pandas.core.groupby.GroupBy.any` and :func:`pandas.core.groupby.GroupBy.all` (:issue:`15435`) | ||
|
||
.. _whatsnew_0230.docs: | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1219,6 +1219,53 @@ class GroupBy(_GroupBy): | |
""" | ||
_apply_whitelist = _common_apply_whitelist | ||
|
||
def _bool_agg(self, val_test, skipna): | ||
"""Shared func to call any / all Cython GroupBy implementations""" | ||
|
||
def objs_to_bool(vals): | ||
try: | ||
vals = vals.astype(np.bool) | ||
except ValueError: # for objects | ||
vals = np.array([bool(x) for x in vals]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we have an asv that hits this (the object path) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We do not but happy to add. As far as objects are concerned, I see we have existing tests for I'd suggest we group all of these into a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok to move |
||
|
||
return vals.view(np.uint8) | ||
|
||
def result_to_bool(result): | ||
return result.astype(np.bool, copy=False) | ||
|
||
return self._get_cythonized_result('group_any_all', self.grouper, | ||
aggregate=True, | ||
cython_dtype=np.uint8, | ||
needs_values=True, | ||
needs_mask=True, | ||
pre_processing=objs_to_bool, | ||
post_processing=result_to_bool, | ||
val_test=val_test, skipna=skipna) | ||
|
||
@Substitution(name='groupby') | ||
@Appender(_doc_template) | ||
def any(self, skipna=True): | ||
"""Returns True if any value in the group is truthful, else False | ||
|
||
Parameters | ||
---------- | ||
skipna : bool, default True | ||
Flag to ignore nan values during truth testing | ||
""" | ||
return self._bool_agg('any', skipna) | ||
|
||
@Substitution(name='groupby') | ||
@Appender(_doc_template) | ||
def all(self, skipna=True): | ||
"""Returns True if all values in the group are truthful, else False | ||
|
||
Parameters | ||
---------- | ||
skipna : bool, default True | ||
Flag to ignore nan values during truth testing | ||
""" | ||
return self._bool_agg('all', skipna) | ||
|
||
@Substitution(name='groupby') | ||
@Appender(_doc_template) | ||
def count(self): | ||
|
@@ -1485,6 +1532,8 @@ def _fill(self, direction, limit=None): | |
|
||
return self._get_cythonized_result('group_fillna_indexer', | ||
self.grouper, needs_mask=True, | ||
cython_dtype=np.int64, | ||
result_is_index=True, | ||
direction=direction, limit=limit) | ||
|
||
@Substitution(name='groupby') | ||
|
@@ -1873,33 +1922,81 @@ def cummax(self, axis=0, **kwargs): | |
|
||
return self._cython_transform('cummax', numeric_only=False) | ||
|
||
def _get_cythonized_result(self, how, grouper, needs_mask=False, | ||
needs_ngroups=False, **kwargs): | ||
def _get_cythonized_result(self, how, grouper, aggregate=False, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Had to add some new keywords to make this more dynamic and support aggregation in addition to the transformations it was already doing with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok for now. these should be subclasses of common routines (in a separate module), then this will be much simpler / cleaner (but that's for later) |
||
cython_dtype=None, needs_values=False, | ||
needs_mask=False, needs_ngroups=False, | ||
result_is_index=False, | ||
pre_processing=None, post_processing=None, | ||
**kwargs): | ||
"""Get result for Cythonized functions | ||
|
||
Parameters | ||
---------- | ||
how : str, Cythonized function name to be called | ||
grouper : Grouper object containing pertinent group info | ||
aggregate : bool, default False | ||
Whether the result should be aggregated to match the number of | ||
groups | ||
cython_dtype : default None | ||
Type of the array that will be modified by the Cython call. If | ||
`None`, the type will be inferred from the values of each slice | ||
needs_values : bool, default False | ||
Whether the values should be a part of the Cython call | ||
signature | ||
needs_mask : bool, default False | ||
Whether boolean mask needs to be part of the Cython call signature | ||
Whether boolean mask needs to be part of the Cython call | ||
signature | ||
needs_ngroups : bool, default False | ||
Whether number of groups part of the Cython call signature | ||
Whether number of groups is part of the Cython call signature | ||
result_is_index : bool, default False | ||
Whether the result of the Cython operation is an index of | ||
values to be retrieved, instead of the actual values themselves | ||
pre_processing : function, default None | ||
Function to be applied to `values` prior to passing to Cython | ||
Raises if `needs_values` is False | ||
post_processing : function, default None | ||
Function to be applied to result of Cython function | ||
**kwargs : dict | ||
Extra arguments to be passed back to Cython funcs | ||
|
||
Returns | ||
------- | ||
`Series` or `DataFrame` with filled values | ||
""" | ||
if result_is_index and aggregate: | ||
raise ValueError("'result_is_index' and 'aggregate' cannot both " | ||
"be True!") | ||
if post_processing: | ||
if not callable(pre_processing): | ||
raise ValueError("'post_processing' must be a callable!") | ||
if pre_processing: | ||
if not callable(pre_processing): | ||
raise ValueError("'pre_processing' must be a callable!") | ||
if not needs_values: | ||
raise ValueError("Cannot use 'pre_processing' without " | ||
"specifying 'needs_values'!") | ||
|
||
labels, _, ngroups = grouper.group_info | ||
output = collections.OrderedDict() | ||
base_func = getattr(libgroupby, how) | ||
|
||
for name, obj in self._iterate_slices(): | ||
indexer = np.zeros_like(labels, dtype=np.int64) | ||
func = partial(base_func, indexer, labels) | ||
if aggregate: | ||
result_sz = ngroups | ||
else: | ||
result_sz = len(obj.values) | ||
|
||
if not cython_dtype: | ||
cython_dtype = obj.values.dtype | ||
|
||
result = np.zeros(result_sz, dtype=cython_dtype) | ||
func = partial(base_func, result, labels) | ||
if needs_values: | ||
vals = obj.values | ||
if pre_processing: | ||
vals = pre_processing(vals) | ||
func = partial(func, vals) | ||
|
||
if needs_mask: | ||
mask = isnull(obj.values).view(np.uint8) | ||
func = partial(func, mask) | ||
|
@@ -1908,9 +2005,19 @@ def _get_cythonized_result(self, how, grouper, needs_mask=False, | |
func = partial(func, ngroups) | ||
|
||
func(**kwargs) # Call func to modify indexer values in place | ||
output[name] = algorithms.take_nd(obj.values, indexer) | ||
|
||
return self._wrap_transformed_output(output) | ||
if result_is_index: | ||
result = algorithms.take_nd(obj.values, result) | ||
|
||
if post_processing: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is all ok, assuming we are going to move this out of here to a separate module/class soonish. |
||
result = post_processing(result) | ||
|
||
output[name] = result | ||
|
||
if aggregate: | ||
return self._wrap_aggregated_output(output) | ||
else: | ||
return self._wrap_transformed_output(output) | ||
|
||
@Substitution(name='groupby') | ||
@Appender(_doc_template) | ||
|
@@ -1930,7 +2037,9 @@ def shift(self, periods=1, freq=None, axis=0): | |
return self.apply(lambda x: x.shift(periods, freq, axis)) | ||
|
||
return self._get_cythonized_result('group_shift_indexer', | ||
self.grouper, needs_ngroups=True, | ||
self.grouper, cython_dtype=np.int64, | ||
needs_ngroups=True, | ||
result_is_index=True, | ||
periods=periods) | ||
|
||
@Substitution(name='groupby') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't end up removing any other tests after adding this here so you could argue there's some bloat with this. That said, my plan was to take care of #19733 first and then come back to clean up where possible. The reason for that is some of the existing object tests are wrapped up with
datetime
tests as well (seeclass FirstLast
) so it would add some complexity to try and unwind that before completing the change mentionedThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats fine