-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REF: avoid try/except in wrapping in cython_agg_blocks #38164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
68fcd7f
0327ad9
58a1f14
588e502
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -60,7 +60,7 @@ | |
validate_func_kwargs, | ||
) | ||
import pandas.core.algorithms as algorithms | ||
from pandas.core.arrays import ExtensionArray | ||
from pandas.core.arrays import Categorical, ExtensionArray | ||
from pandas.core.base import DataError, SpecificationError | ||
import pandas.core.common as com | ||
from pandas.core.construction import create_series_with_explicit_dtype | ||
|
@@ -1028,38 +1028,64 @@ def _cython_agg_blocks( | |
if numeric_only: | ||
data = data.get_numeric_data(copy=False) | ||
|
||
no_result = object() | ||
|
||
def cast_agg_result(result, values: ArrayLike, how: str) -> ArrayLike: | ||
# see if we can cast the values to the desired dtype | ||
# this may not be the original dtype | ||
assert not isinstance(result, DataFrame) | ||
assert result is not no_result | ||
|
||
dtype = maybe_cast_result_dtype(values.dtype, how) | ||
result = maybe_downcast_numeric(result, dtype) | ||
|
||
if isinstance(values, ExtensionArray) and isinstance(result, np.ndarray): | ||
# e.g. values was an IntegerArray | ||
# (1, N) case can occur if values was Categorical | ||
# and result is ndarray[object] | ||
# TODO(EA2D): special casing not needed with 2D EAs | ||
assert result.ndim == 1 or result.shape[0] == 1 | ||
try: | ||
# Cast back if feasible | ||
result = type(values)._from_sequence( | ||
result.ravel(), dtype=values.dtype | ||
) | ||
except (ValueError, TypeError): | ||
# reshape to be valid for non-Extension Block | ||
result = result.reshape(1, -1) | ||
if isinstance(values, Categorical) and isinstance(result, np.ndarray): | ||
# If the Categorical op didn't raise, it is dtype-preserving | ||
result = type(values)._from_sequence(result.ravel(), dtype=values.dtype) | ||
# Note this will have result.dtype == dtype from above | ||
|
||
elif isinstance(result, np.ndarray) and result.ndim == 1: | ||
# We went through a SeriesGroupByPath and need to reshape | ||
# GH#32223 includes case with IntegerArray values | ||
result = result.reshape(1, -1) | ||
# test_groupby_duplicate_columns gets here with | ||
# result.dtype == int64, values.dtype=object, how="min" | ||
|
||
return result | ||
|
||
def py_fallback(bvalues: ArrayLike) -> ArrayLike: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just a general note, inline functions are really hard to follow (of course in-line really lone if/else are even harder).....request would be to move this code to module level functions (later is totally fine) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yah, there are a few follow-ups in the works There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
yah lets do it in a follow-up rather than clog the CI |
||
# if self.grouper.aggregate fails, we fall back to a pure-python | ||
# solution | ||
|
||
# We get here with a) EADtypes and b) object dtype | ||
obj: FrameOrSeriesUnion | ||
|
||
# call our grouper again with only this block | ||
if isinstance(bvalues, ExtensionArray): | ||
# TODO(EA2D): special case not needed with 2D EAs | ||
obj = Series(bvalues) | ||
else: | ||
obj = DataFrame(bvalues.T) | ||
if obj.shape[1] == 1: | ||
# Avoid call to self.values that can occur in DataFrame | ||
# reductions; see GH#28949 | ||
obj = obj.iloc[:, 0] | ||
|
||
# Create SeriesGroupBy with observed=True so that it does | ||
# not try to add missing categories if grouping over multiple | ||
# Categoricals. This will done by later self._reindex_output() | ||
# Doing it here creates an error. See GH#34951 | ||
sgb = get_groupby(obj, self.grouper, observed=True) | ||
result = sgb.aggregate(lambda x: alt(x, axis=self.axis)) | ||
|
||
assert isinstance(result, (Series, DataFrame)) # for mypy | ||
# In the case of object dtype block, it may have been split | ||
# in the operation. We un-split here. | ||
result = result._consolidate() | ||
assert isinstance(result, (Series, DataFrame)) # for mypy | ||
assert len(result._mgr.blocks) == 1 | ||
|
||
# unwrap DataFrame to get array | ||
result = result._mgr.blocks[0].values | ||
return result | ||
|
||
def blk_func(bvalues: ArrayLike) -> ArrayLike: | ||
|
||
try: | ||
|
@@ -1077,35 +1103,7 @@ def blk_func(bvalues: ArrayLike) -> ArrayLike: | |
assert how == "ohlc" | ||
raise | ||
|
||
# We get here with a) EADtypes and b) object dtype | ||
obj: FrameOrSeriesUnion | ||
# call our grouper again with only this block | ||
if isinstance(bvalues, ExtensionArray): | ||
# TODO(EA2D): special case not needed with 2D EAs | ||
obj = Series(bvalues) | ||
else: | ||
obj = DataFrame(bvalues.T) | ||
if obj.shape[1] == 1: | ||
# Avoid call to self.values that can occur in DataFrame | ||
# reductions; see GH#28949 | ||
obj = obj.iloc[:, 0] | ||
|
||
# Create SeriesGroupBy with observed=True so that it does | ||
# not try to add missing categories if grouping over multiple | ||
# Categoricals. This will done by later self._reindex_output() | ||
# Doing it here creates an error. See GH#34951 | ||
sgb = get_groupby(obj, self.grouper, observed=True) | ||
result = sgb.aggregate(lambda x: alt(x, axis=self.axis)) | ||
|
||
assert isinstance(result, (Series, DataFrame)) # for mypy | ||
# In the case of object dtype block, it may have been split | ||
# in the operation. We un-split here. | ||
result = result._consolidate() | ||
assert isinstance(result, (Series, DataFrame)) # for mypy | ||
assert len(result._mgr.blocks) == 1 | ||
|
||
# unwrap DataFrame to get array | ||
result = result._mgr.blocks[0].values | ||
result = py_fallback(bvalues) | ||
|
||
return cast_agg_result(result, bvalues, how) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this no longer needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because its only relevant for Categorical, and never raises
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the comment that values could be IntegerArray wasn't correct anymore?
And why is
cast_agg_result
no longer be called with other extension arrays?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IntegerArray is now handled correctly within _cython_operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see, thanks for the explanation
But so that means that for a generic EA (not implemented by ourselves), we won't try to cast back? (not sure if there are good use cases for that, but still)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear what this is referring to.
We fall back to the SeriesGroupBy implementation so there shouldn't be a regression.
Many of these are ordinal-only, so using something like values_for_argsort is likely to be better than casting to object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to your "i am trying to get rid of that altogether"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
For the functions that go through _cython_operation, there are only a handful of them and we can identify which are dtype-preserving and which are not. Some custom logic is in
dtypes.cast.maybe_cast_result_dtype
, and the rest at this point is in_ea_wrap_cython_operation
. Basically anything that is "custom logic" represents a way we are treating pandas-internal EAs different from 3rd-party, which is something we want to avoid. Identifying and isolating that custom logic will make it easier to de-custom.The only other place where the "try to cast back" that i am looking to get rid of altogether is used is in
_aggregate_series_pure_python
, and #38254 removes that (buggy) usage.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we want no custom handing of our own EAs at all?
For example what you wrote now in
_ea_wrap_cython_operation
, are this things you want to further remove?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, in the sense that we also want world peace and unlimited ice cream, yes. For the foreseeable future we're going to have some non-zero amount of custom handling.
In the short run I'm adding handling for FloatingArray and improving the handling for IntegerArray, so not removing at all.