Skip to content

ENH: use correct dtype in groupby cython ops when it is known (without try/except) #38291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Dec 8, 2020

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Dec 4, 2020

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good

Is there somewhere a list of all possible values for how in this _cython_operation?
(just wondering if all ops with possible dtype changes are covered)

@@ -357,12 +357,15 @@ def maybe_cast_result_dtype(dtype: DtypeObj, how: str) -> DtypeObj:
The desired dtype of the result.
"""
from pandas.core.arrays.boolean import BooleanDtype
from pandas.core.arrays.floating import Float64Dtype
from pandas.core.arrays.integer import Int64Dtype

if how in ["add", "cumsum", "sum"] and (dtype == np.dtype(bool)):
return np.dtype(np.int64)
elif how in ["add", "cumsum", "sum"] and isinstance(dtype, BooleanDtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
elif how in ["add", "cumsum", "sum"] and isinstance(dtype, BooleanDtype):
elif how in ["add", "cumsum", "sum"] and isinstance(dtype, (BooleanDtype, IntegerDtype)):

A bit more off-topic here, but: for int dtypes with lower precision, we actually want int64 for those (eg sum of int8 gives int64)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to the extent that we can separate fixing of these from the avoiding the try/except goal, I'd like to do that

@jorisvandenbossche jorisvandenbossche added this to the 1.2 milestone Dec 4, 2020
@jbrockmendel
Copy link
Member Author

Is there somewhere a list of all possible values for how in this _cython_operation?

groupby.ops L349

    _cython_functions = {
        "aggregate": {
            "add": "group_add",
            "prod": "group_prod",
            "min": "group_min",
            "max": "group_max",
            "mean": "group_mean",
            "median": "group_median",
            "var": "group_var",
            "first": "group_nth",
            "last": "group_last",
            "ohlc": "group_ohlc",
        },
        "transform": {
            "cumprod": "group_cumprod",
            "cumsum": "group_cumsum",
            "cummin": "group_cummin",
            "cummax": "group_cummax",
            "rank": "group_rank",
        },
    }

@jreback jreback added Groupby Refactor Internal refactoring of code labels Dec 4, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obviously can move things later. Do we have sufficient testing for the changes?

return cls._from_sequence(res_values)
return res_values

elif is_float_dtype(values.dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so would really like to move this entire wrapping to a method on EA / generic casting. We do this in multiple places (e.g. also on _reduce operatiosn), and this is likely leading to missing functionaility in various places.

@jbrockmendel
Copy link
Member Author

Do we have sufficient testing for the changes?

Yes.

so would really like to move this entire wrapping to a method on EA / generic casting. We do this in multiple places (e.g. also on _reduce operatiosn), and this is likely leading to missing functionaility in various places.

Agreed. While it may not look it, this is a big improvement over what we had a week ago, where casting was being done in like 15 different places, and not at all obvious what we were special-casing. Now it is down to two and very explicit what we are special-casing (i.e. what we need to improve)

@jreback
Copy link
Contributor

jreback commented Dec 4, 2020

Do we have sufficient testing for the changes?

Yes.

so would really like to move this entire wrapping to a method on EA / generic casting. We do this in multiple places (e.g. also on _reduce operatiosn), and this is likely leading to missing functionaility in various places.

Agreed. While it may not look it, this is a big improvement over what we had a week ago, where casting was being done in like 15 different places, and not at all obvious what we were special-casing. Now it is down to two and very explicit what we are special-casing (i.e. what we need to improve)

ok great. i am fine with merging this tehn.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 5, 2020

Thanks for the pointer to the list!

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 5, 2020

Do we have sufficient testing for the changes?

We clearly didn't have sufficient testing .. as this apparently broke some cases that worked before (now, not necessarily the fault of this PR! but so good that I added some tests)

So some of the tests are still failing, because before we were using maybe_cast_result for integer dtypes, but now you changed that to use a straight _from_sequence, which fails in some of the cases.
Another part of the failing tests is simply because of not yet assigning the correct dtype in maybe_cast_result_dtype, related to my inline comments above.

@jorisvandenbossche jorisvandenbossche changed the title REF: groupby op casting without try/except ENH: use correct dtype in groupby cython ops when it is known (without try/except) Dec 5, 2020
elif how in ["add", "cumsum", "sum"] and isinstance(dtype, BooleanDtype):
return Int64Dtype()
from pandas.core.arrays.floating import Float64Dtype
from pandas.core.arrays.integer import Int64Dtype, _IntegerDtype
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the linter is going to complain about _IntegerDtype. we can either find a non-private thing to import or add it to the whitelist in scripts._validate_unwanted_patterns

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the linter is going to complain about _IntegerDtype

Apparantly it's not complaining at the moment, but indeed something we can de-privatize internally

@jreback jreback removed this from the 1.2 milestone Dec 7, 2020
dtype = maybe_cast_result_dtype(orig_values.dtype, how)
if is_extension_array_dtype(dtype):
cls = dtype.construct_array_type()
return cls._from_sequence(res_values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to pass the dtype here as well (to ensure lower precision gets preserved)

@jorisvandenbossche jorisvandenbossche added this to the 1.2 milestone Dec 8, 2020
@jorisvandenbossche
Copy link
Member

@jreback I added this to the 1.2 milestone on purpose, as it's partly fixing a regression on master (which might not have been directly clear from the PR)

("sum", "large_int"),
# ("std", "always_float"),
("var", "always_float"),
# ("sem", "always_float"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the reason those are commented out, it because of std (and thus sem) is taking a different code path (not fully sure why, though, it is taking a different path compard to eg var)

And count is not actually a cython function, so also not yet covered by this PR (so maybe we should move those tests out of the test_cython.py file)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the tests are passing now, let's get this in for 1.2.
We can improve on it in further PRs.

@jorisvandenbossche jorisvandenbossche merged commit 5bfa653 into pandas-dev:master Dec 8, 2020
@jreback
Copy link
Contributor

jreback commented Dec 8, 2020

@jorisvandenbossche the entire point of an rc is that we do not need to keep waiting for the release

thus this was not needed to do in such a hurry for the rc

let's pls try to just release on time and less about trying to get every last PR in

@jbrockmendel jbrockmendel deleted the gb-cast branch December 8, 2020 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants