-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: use correct dtype in groupby cython ops when it is known (without try/except) #38291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good
Is there somewhere a list of all possible values for how
in this _cython_operation
?
(just wondering if all ops with possible dtype changes are covered)
pandas/core/dtypes/cast.py
Outdated
@@ -357,12 +357,15 @@ def maybe_cast_result_dtype(dtype: DtypeObj, how: str) -> DtypeObj: | |||
The desired dtype of the result. | |||
""" | |||
from pandas.core.arrays.boolean import BooleanDtype | |||
from pandas.core.arrays.floating import Float64Dtype | |||
from pandas.core.arrays.integer import Int64Dtype | |||
|
|||
if how in ["add", "cumsum", "sum"] and (dtype == np.dtype(bool)): | |||
return np.dtype(np.int64) | |||
elif how in ["add", "cumsum", "sum"] and isinstance(dtype, BooleanDtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elif how in ["add", "cumsum", "sum"] and isinstance(dtype, BooleanDtype): | |
elif how in ["add", "cumsum", "sum"] and isinstance(dtype, (BooleanDtype, IntegerDtype)): |
A bit more off-topic here, but: for int dtypes with lower precision, we actually want int64 for those (eg sum of int8 gives int64)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to the extent that we can separate fixing of these from the avoiding the try/except goal, I'd like to do that
groupby.ops L349
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obviously can move things later. Do we have sufficient testing for the changes?
return cls._from_sequence(res_values) | ||
return res_values | ||
|
||
elif is_float_dtype(values.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so would really like to move this entire wrapping to a method on EA / generic casting. We do this in multiple places (e.g. also on _reduce operatiosn), and this is likely leading to missing functionaility in various places.
Yes.
Agreed. While it may not look it, this is a big improvement over what we had a week ago, where casting was being done in like 15 different places, and not at all obvious what we were special-casing. Now it is down to two and very explicit what we are special-casing (i.e. what we need to improve) |
ok great. i am fine with merging this tehn. |
Thanks for the pointer to the list! |
We clearly didn't have sufficient testing .. as this apparently broke some cases that worked before (now, not necessarily the fault of this PR! but so good that I added some tests) So some of the tests are still failing, because before we were using |
elif how in ["add", "cumsum", "sum"] and isinstance(dtype, BooleanDtype): | ||
return Int64Dtype() | ||
from pandas.core.arrays.floating import Float64Dtype | ||
from pandas.core.arrays.integer import Int64Dtype, _IntegerDtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the linter is going to complain about _IntegerDtype. we can either find a non-private thing to import or add it to the whitelist in scripts._validate_unwanted_patterns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the linter is going to complain about _IntegerDtype
Apparantly it's not complaining at the moment, but indeed something we can de-privatize internally
pandas/core/groupby/ops.py
Outdated
dtype = maybe_cast_result_dtype(orig_values.dtype, how) | ||
if is_extension_array_dtype(dtype): | ||
cls = dtype.construct_array_type() | ||
return cls._from_sequence(res_values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to pass the dtype here as well (to ensure lower precision gets preserved)
@jreback I added this to the 1.2 milestone on purpose, as it's partly fixing a regression on master (which might not have been directly clear from the PR) |
("sum", "large_int"), | ||
# ("std", "always_float"), | ||
("var", "always_float"), | ||
# ("sem", "always_float"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, the reason those are commented out, it because of std
(and thus sem
) is taking a different code path (not fully sure why, though, it is taking a different path compard to eg var
)
And count
is not actually a cython function, so also not yet covered by this PR (so maybe we should move those tests out of the test_cython.py
file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the tests are passing now, let's get this in for 1.2.
We can improve on it in further PRs.
@jorisvandenbossche the entire point of an rc is that we do not need to keep waiting for the release thus this was not needed to do in such a hurry for the rc let's pls try to just release on time and less about trying to get every last PR in |
Addresses part of #37494. And xref https://github.com/pandas-dev/pandas/pull/38162/files#r536191377
cc @jorisvandenbossche