ENH: use correct dtype in groupby cython ops when it is known (without try/except) #38291

jbrockmendel · 2020-12-04T17:09:05Z

Addresses part of #37494. And xref https://github.com/pandas-dev/pandas/pull/38162/files#r536191377

jorisvandenbossche

Thanks, looks good

Is there somewhere a list of all possible values for how in this _cython_operation?
(just wondering if all ops with possible dtype changes are covered)

pandas/core/dtypes/cast.py

jorisvandenbossche · 2020-12-04T19:04:17Z

pandas/core/dtypes/cast.py

@@ -357,12 +357,15 @@ def maybe_cast_result_dtype(dtype: DtypeObj, how: str) -> DtypeObj:
        The desired dtype of the result.
    """
    from pandas.core.arrays.boolean import BooleanDtype
+    from pandas.core.arrays.floating import Float64Dtype
    from pandas.core.arrays.integer import Int64Dtype

    if how in ["add", "cumsum", "sum"] and (dtype == np.dtype(bool)):
        return np.dtype(np.int64)
    elif how in ["add", "cumsum", "sum"] and isinstance(dtype, BooleanDtype):


Suggested change

elif how in ["add", "cumsum", "sum"] and isinstance(dtype, BooleanDtype):

elif how in ["add", "cumsum", "sum"] and isinstance(dtype, (BooleanDtype, IntegerDtype)):

A bit more off-topic here, but: for int dtypes with lower precision, we actually want int64 for those (eg sum of int8 gives int64)

to the extent that we can separate fixing of these from the avoiding the try/except goal, I'd like to do that

pandas/core/dtypes/cast.py

jbrockmendel · 2020-12-04T20:56:39Z

Is there somewhere a list of all possible values for how in this _cython_operation?

groupby.ops L349

    _cython_functions = {
        "aggregate": {
            "add": "group_add",
            "prod": "group_prod",
            "min": "group_min",
            "max": "group_max",
            "mean": "group_mean",
            "median": "group_median",
            "var": "group_var",
            "first": "group_nth",
            "last": "group_last",
            "ohlc": "group_ohlc",
        },
        "transform": {
            "cumprod": "group_cumprod",
            "cumsum": "group_cumsum",
            "cummin": "group_cummin",
            "cummax": "group_cummax",
            "rank": "group_rank",
        },
    }

jreback

obviously can move things later. Do we have sufficient testing for the changes?

jreback · 2020-12-04T21:40:17Z

pandas/core/groupby/ops.py

+                return cls._from_sequence(res_values)
+            return res_values
+
+        elif is_float_dtype(values.dtype):


so would really like to move this entire wrapping to a method on EA / generic casting. We do this in multiple places (e.g. also on _reduce operatiosn), and this is likely leading to missing functionaility in various places.

jbrockmendel · 2020-12-04T22:31:57Z

Do we have sufficient testing for the changes?

Yes.

so would really like to move this entire wrapping to a method on EA / generic casting. We do this in multiple places (e.g. also on _reduce operatiosn), and this is likely leading to missing functionaility in various places.

Agreed. While it may not look it, this is a big improvement over what we had a week ago, where casting was being done in like 15 different places, and not at all obvious what we were special-casing. Now it is down to two and very explicit what we are special-casing (i.e. what we need to improve)

jreback · 2020-12-04T23:25:26Z

Do we have sufficient testing for the changes?

Yes.

so would really like to move this entire wrapping to a method on EA / generic casting. We do this in multiple places (e.g. also on _reduce operatiosn), and this is likely leading to missing functionaility in various places.

Agreed. While it may not look it, this is a big improvement over what we had a week ago, where casting was being done in like 15 different places, and not at all obvious what we were special-casing. Now it is down to two and very explicit what we are special-casing (i.e. what we need to improve)

ok great. i am fine with merging this tehn.

jorisvandenbossche · 2020-12-05T15:32:09Z

Thanks for the pointer to the list!

jorisvandenbossche · 2020-12-05T16:36:19Z

Do we have sufficient testing for the changes?

We clearly didn't have sufficient testing .. as this apparently broke some cases that worked before (now, not necessarily the fault of this PR! but so good that I added some tests)

So some of the tests are still failing, because before we were using maybe_cast_result for integer dtypes, but now you changed that to use a straight _from_sequence, which fails in some of the cases.
Another part of the failing tests is simply because of not yet assigning the correct dtype in maybe_cast_result_dtype, related to my inline comments above.

jbrockmendel · 2020-12-05T20:48:42Z

pandas/core/dtypes/cast.py

-    elif how in ["add", "cumsum", "sum"] and isinstance(dtype, BooleanDtype):
-        return Int64Dtype()
+    from pandas.core.arrays.floating import Float64Dtype
+    from pandas.core.arrays.integer import Int64Dtype, _IntegerDtype


i think the linter is going to complain about _IntegerDtype. we can either find a non-private thing to import or add it to the whitelist in scripts._validate_unwanted_patterns

i think the linter is going to complain about _IntegerDtype

Apparantly it's not complaining at the moment, but indeed something we can de-privatize internally

jorisvandenbossche · 2020-12-07T20:28:07Z

pandas/core/groupby/ops.py

+            dtype = maybe_cast_result_dtype(orig_values.dtype, how)
+            if is_extension_array_dtype(dtype):
+                cls = dtype.construct_array_type()
+                return cls._from_sequence(res_values)


I think you need to pass the dtype here as well (to ensure lower precision gets preserved)

…-cast

jorisvandenbossche · 2020-12-08T08:26:26Z

@jreback I added this to the 1.2 milestone on purpose, as it's partly fixing a regression on master (which might not have been directly clear from the PR)

jorisvandenbossche · 2020-12-08T10:19:27Z

pandas/tests/groupby/aggregate/test_cython.py

+        ("sum", "large_int"),
+        # ("std", "always_float"),
+        ("var", "always_float"),
+        # ("sem", "always_float"),


BTW, the reason those are commented out, it because of std (and thus sem) is taking a different code path (not fully sure why, though, it is taking a different path compard to eg var)

And count is not actually a cython function, so also not yet covered by this PR (so maybe we should move those tests out of the test_cython.py file)

jorisvandenbossche

Since the tests are passing now, let's get this in for 1.2.
We can improve on it in further PRs.

jreback · 2020-12-08T11:53:49Z

@jorisvandenbossche the entire point of an rc is that we do not need to keep waiting for the release

thus this was not needed to do in such a hurry for the rc

let's pls try to just release on time and less about trying to get every last PR in

jbrockmendel added 2 commits December 4, 2020 08:01

REF: groupby op casting without try/except

279c4d1

Float64

e6dc529

jorisvandenbossche reviewed Dec 4, 2020

View reviewed changes

jorisvandenbossche added this to the 1.2 milestone Dec 4, 2020

jreback added Groupby Refactor Internal refactoring of code labels Dec 4, 2020

jreback requested changes Dec 4, 2020

View reviewed changes

jreback approved these changes Dec 4, 2020

View reviewed changes

jorisvandenbossche added 2 commits December 5, 2020 17:27

add tests for expected dtype of cython agg ops with nullable dtypes

ea79027

fix casting to float numpy array for FloatingArray

97fcd22

jorisvandenbossche changed the title ~~REF: groupby op casting without try/except~~ ENH: use correct dtype in groupby cython ops when it is known (without try/except) Dec 5, 2020

jorisvandenbossche added 2 commits December 5, 2020 17:47

fix tests

b04d91f

update rules of known result dtypes

202bee8

jbrockmendel commented Dec 5, 2020

View reviewed changes

jreback removed this from the 1.2 milestone Dec 7, 2020

jorisvandenbossche reviewed Dec 7, 2020

View reviewed changes

jbrockmendel added 2 commits December 7, 2020 18:33

Merge branch 'master' of https://github.com/pandas-dev/pandas into gb…

e888b3e

…-cast

retain dtype

2566ec4

jorisvandenbossche added this to the 1.2 milestone Dec 8, 2020

jorisvandenbossche reviewed Dec 8, 2020

View reviewed changes

jorisvandenbossche approved these changes Dec 8, 2020

View reviewed changes

jorisvandenbossche merged commit 5bfa653 into pandas-dev:master Dec 8, 2020

jbrockmendel deleted the gb-cast branch December 8, 2020 16:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: use correct dtype in groupby cython ops when it is known (without try/except) #38291

ENH: use correct dtype in groupby cython ops when it is known (without try/except) #38291

jbrockmendel commented Dec 4, 2020 •

edited by jorisvandenbossche

Loading

jorisvandenbossche left a comment

jorisvandenbossche Dec 4, 2020

jbrockmendel Dec 5, 2020

jbrockmendel commented Dec 4, 2020

jreback left a comment

jreback Dec 4, 2020

jbrockmendel commented Dec 4, 2020

jreback commented Dec 4, 2020

jorisvandenbossche commented Dec 5, 2020 •

edited

Loading

jorisvandenbossche commented Dec 5, 2020 •

edited

Loading

jbrockmendel Dec 5, 2020

jorisvandenbossche Dec 8, 2020

jorisvandenbossche Dec 7, 2020

jorisvandenbossche commented Dec 8, 2020

jorisvandenbossche Dec 8, 2020

jorisvandenbossche left a comment •

edited

Loading

jreback commented Dec 8, 2020

	elif how in ["add", "cumsum", "sum"] and isinstance(dtype, BooleanDtype):
	elif how in ["add", "cumsum", "sum"] and isinstance(dtype, (BooleanDtype, IntegerDtype)):

ENH: use correct dtype in groupby cython ops when it is known (without try/except) #38291

ENH: use correct dtype in groupby cython ops when it is known (without try/except) #38291

Conversation

jbrockmendel commented Dec 4, 2020 • edited by jorisvandenbossche Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Dec 4, 2020

Choose a reason for hiding this comment

jbrockmendel Dec 5, 2020

Choose a reason for hiding this comment

jbrockmendel commented Dec 4, 2020

jreback left a comment

Choose a reason for hiding this comment

jreback Dec 4, 2020

Choose a reason for hiding this comment

jbrockmendel commented Dec 4, 2020

jreback commented Dec 4, 2020

jorisvandenbossche commented Dec 5, 2020 • edited Loading

jorisvandenbossche commented Dec 5, 2020 • edited Loading

jbrockmendel Dec 5, 2020

Choose a reason for hiding this comment

jorisvandenbossche Dec 8, 2020

Choose a reason for hiding this comment

jorisvandenbossche Dec 7, 2020

Choose a reason for hiding this comment

jorisvandenbossche commented Dec 8, 2020

jorisvandenbossche Dec 8, 2020

Choose a reason for hiding this comment

jorisvandenbossche left a comment • edited Loading

Choose a reason for hiding this comment

jreback commented Dec 8, 2020

jbrockmendel commented Dec 4, 2020 •

edited by jorisvandenbossche

Loading

jorisvandenbossche commented Dec 5, 2020 •

edited

Loading

jorisvandenbossche commented Dec 5, 2020 •

edited

Loading

jorisvandenbossche left a comment •

edited

Loading