BUG: groupby().agg fails on categorical column #31470

charlesdong1991 · 2020-01-30T19:20:30Z

closes REGR: groupby().agg fails on categorical column in pandas 1.0.0 #31450
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pep8speaks · 2020-01-30T19:20:34Z

Hello @charlesdong1991! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-02-04 08:07:08 UTC

jbrockmendel · 2020-01-30T19:29:02Z

is this limited to "first"? i think the conclusion from previous thread is that we need to be more systematic about when we call _try_cast

charlesdong1991 · 2020-01-30T19:32:44Z

@jbrockmendel yeah, very right, i was about to change something in _try_cast or before _try_cast, but haven't seen the follow-up discussions yet, and due to the added change of isinstance(result[0], dtype.type) in _try_cast, the case of first won't be casted back which it should be.
so I just added as an experiment to see if the problem could be fixed, and prepare a more generic solution if so.

WillAyd

Hmm not a fan of the approach here - I think should try and avoid special-casing / adding more groupings to base.

This works for DataFrame selection but not Series right? If so I think should dive deeper into why the former can naturally support but the latter can't to find resolution

TomAugspurger · 2020-01-31T12:10:40Z

Haven't had a chance to look closely yet, but I think my preferred approach is for the calling function to decide what the output dtype should be based on the input dtype. Not sure how best to put that to code either.

charlesdong1991 · 2020-01-31T13:37:40Z

@WillAyd @TomAugspurger I agree, and i am not a big fan of this current approach, this is the reason i have [WIP] in the title. I tried to fix when to call this _try_cast depending on the situation and type, but not working, and causes a lot of errors locally, so i am still trying to find a better/more robust fix.

Sorry about this noise in PRs, and i wish i could update this soon!

TomAugspurger · 2020-01-31T14:13:19Z

Thanks for working on this. It's good that we're exploring possible solutions. We should take a bit of time to try and get this right.

…

On Fri, Jan 31, 2020 at 7:37 AM Kaiqi Dong ***@***.***> wrote: @WillAyd <https://github.com/WillAyd> @TomAugspurger <https://github.com/TomAugspurger> I agree, and i am not a big fan of this current approach, this is the reason i have [WIP] in the title. I tried to fix when to call this _try_cast depending on the situation and type, but not working, and causes a lot of errors locally, so i am still trying to find a better/more robust fix. Sorry about this noise in PRs, and i wish i could update this soon! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#31470?email_source=notifications&email_token=AAKAOIQL5KBDBER4RG25CMDRAQSSLA5CNFSM4KN3O4JKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKOVKXI#issuecomment-580736349>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOISV7CVPVR2LFT33PPTRAQSSLANCNFSM4KN3O4JA> .

charlesdong1991 · 2020-02-01T15:53:38Z

I still have some confusion on what should be the correct behaviour, based on the #31359 (comment)
it seems for IntegerArray, mean should return a float type result because IntegerArray._reduce("mean") returns float. However, in the case below, currently doing agg.mean on a normal array with integer returns int64 output instead of float? is it correct? if so, could this be considered as inconsistency between extension array and normal array and then cause confusion to users? @TomAugspurger @jreback @jbrockmendel @WillAyd @jorisvandenbossche

def test_groupby_extension_agg(self, as_index, data_for_grouping):
        df = pd.DataFrame({"A": [1, 1, 2, 2, 3, 3, 1, 4], "B": data_for_grouping})
        result = df.groupby("B", as_index=as_index).A.mean()
        _, index = pd.factorize(data_for_grouping, sort=True)
    
        index = pd.Index(index, name="B")
        expected = pd.Series([3, 1, 4], index=index, name="A")
        if as_index:
>           self.assert_series_equal(result, expected)
E           AssertionError: Attributes of Series are different
E           
E           Attribute "dtype" are different
E           [left]:  float64
E           [right]: int64

TomAugspurger · 2020-02-01T16:47:29Z

IMO the result dtype should only depend on the dtype of the inputs, not the values. IIUC the result can only be int dtype in this case because the mean of the values happens to be equal to an integer? If so, I’d say that’s a bug. The result dtype should always be a float.

charlesdong1991 · 2020-02-01T17:25:54Z

@TomAugspurger thanks for your quick reply. Indeed, in these cases the output has int dtype because it happens to be equal to integer.

jreback · 2020-02-01T17:31:14Z

the int return in mean is a bug it should be float
likely just happened to coerce it before

charlesdong1991

I think this should pass the CI now, before heading to bed, I would like to describe it a bit, feel free to take a look and your reviews are very welcomed:

now the .agg("mean") will return float64, the same for var, median etc, it is aligned with numpy behaviour also. And for the rest (defined in the base, they should NOT be casted) and keep the type
and python_agg will always return the same type as the original object, and it is not changing the current behaviour, since right now those results are coerced. And I think this is correct, because we might want to return different types for python_agg and cython_agg on cython funcs, and this design is especially good for self-defined function used in aggregate. Therefore, you could see that in test, .agg("mean") will have float64 now while .agg(lambda x: np.mean(x)) will still return int64 as they are on master.

I believe this will have a lot of disputes, but all opinions are welcome, and having a consesus on the correct behaviour is important, afterwards I will focus on better coding later. Thanks all, I will do a follow-up code changes tomorrow, and hopefully get this done before 1.0.1 releases. @TomAugspurger @jreback @jbrockmendel @jorisvandenbossche @WillAyd

charlesdong1991 · 2020-02-03T22:42:38Z

pandas/core/groupby/base.py

+cython_cast_cat_type_list = frozenset(["first", "last"])
+cython_cast_keep_type_list = cython_cast_cat_type_list | frozenset(
+    ["min", "max", "add", "prod", "ohlc"]
+)
+


this is to specify cython func that should reserve the type

charlesdong1991 · 2020-02-03T22:43:19Z

pandas/core/groupby/generic.py

+                    if how in base.cython_cast_keep_type_list:
+                        result = maybe_downcast_numeric(result, block.dtype)


this needs to specify for the case when as_index=False, otherwise, will be coerced to int for those cases which they should not

charlesdong1991 · 2020-02-03T22:44:23Z

pandas/core/groupby/groupby.py

@@ -792,7 +792,7 @@ def _cumcount_array(self, ascending: bool = True):
        rev[sorter] = np.arange(count, dtype=np.intp)
        return out[rev].astype(np.int64, copy=False)

-    def _try_cast(self, result, obj, numeric_only: bool = False):
+    def _try_cast(self, result, obj, numeric_only: bool = False, is_python=False):


sorry, this is really ugly, the reason is to distinguish the python_agg and cython_agg since they have different situations to cast

will think a bit more

charlesdong1991 · 2020-02-03T22:45:59Z

pandas/core/groupby/groupby.py

+                if (
+                    isinstance(result[notna(result)][0], dtype.type)
+                    and is_python
+                    or not is_python
+                ):


this is also ugly, it does two things: for cython_agg, if above satisfied, will cast, but for python_agg, we only cast if the not null result has the same type as original object, and I think this is the correct behaviour.

charlesdong1991 · 2020-02-03T22:46:50Z

pandas/core/groupby/groupby.py

+                    if self._cython_aggregate_should_cast(how):
+                        result_column = self._try_cast(result_column, obj)
+                    output[key] = result_column


for cython_agg, we should only cast if it is one of the defined cython func, otherwise, should not touch to _try_cast

charlesdong1991 · 2020-02-03T22:50:23Z

pandas/tests/groupby/test_function.py

+
+    # there is some inconsistency issue in type based on different types, it happens
+    # on windows machine and linux_py36_32bit, skip it for now
+    if not observed:
+        tm.assert_frame_equal(result, expected)


somehow, i encoutered some issue with type here, only running on windows machine and linux_py36_32bit, this type is not the same, i will try a bit tomorrow, but i think the result is correct here.

TomAugspurger · 2020-02-04T21:59:32Z

Thanks for working on this @charlesdong1991. I hope to write up more later, but in general, I like the general idea of the calling function indicating the result dtype is a good idea.

For our immediate needs, I think we can use a more targeted fix to restore the 0.25.x behavior. I'll propose that as a separate PR.

TomAugspurger · 2020-02-04T22:19:46Z

#31668 for the narrowly scoped alternative to just fix the regression, but not the general issue.

jreback

thanks for working on this @charlesdong1991 but this is not very clean; i know this is a patch release. but this is making things worse.

jreback · 2020-02-05T00:24:17Z

pandas/core/groupby/groupby.py

-                #  if the type is compatible with the calling EA.
-                # datetime64tz is handled correctly in agg_series,
-                #  so is excluded here.
+                from pandas import notna


you can import at the top

jreback · 2020-02-05T00:32:32Z

merged in #31668, which is a backportable patch. thanks @charlesdong1991

charlesdong1991 · 2020-02-05T07:49:41Z

#31668 for the narrowly scoped alternative to just fix the regression, but not the general issue.

this is much cleaner for fixing regression!! awesome!! @TomAugspurger 👍

charlesdong1991 added 9 commits December 3, 2018 17:43

remove \n from docstring

7e461a1

fix conflicts

1314059

Merge remote-tracking branch 'upstream/master'

8bcb313

Merge remote-tracking branch 'upstream/master'

24c3ede

fix issue 17038

dea38f2

revert change

cd9e7ac

revert change

e5e912b

Merge remote-tracking branch 'upstream/master' into issue_31450

97f266f

try fix

93ebadb

charlesdong1991 added 2 commits January 30, 2020 20:24

upload test

3520b95

linting

32cc744

charlesdong1991 changed the title ~~BUG: groupby().agg fails on categorical column when func is first~~ [WIP] BUG: groupby().agg fails on categorical column when func is first Jan 30, 2020

charlesdong1991 added 6 commits January 30, 2020 21:49

broader concept

9f936cc

fix up

946c49f

imports

73b01c6

keep experimenting

2fdb3f5

fixtup

9e52c70

add comment

a366b02

charlesdong1991 changed the title ~~[WIP] BUG: groupby().agg fails on categorical column when func is first~~ [WIP] BUG: groupby().agg fails on categorical column Jan 31, 2020

Merge remote-tracking branch 'upstream/master' into issue_31450

bdfcfab

WillAyd requested changes Jan 31, 2020

View reviewed changes

WillAyd added Categorical Categorical Data Type Groupby labels Jan 31, 2020

experiment

36184f6

charlesdong1991 changed the title ~~[WIP] BUG: groupby().agg fails on categorical column~~ [WIP NOT READY FOR REVIEW] BUG: groupby().agg fails on categorical column Feb 1, 2020

charlesdong1991 added 2 commits February 1, 2020 14:15

update

9d4e021

change base

c588204

jreback added this to the 1.0.1 milestone Feb 1, 2020

charlesdong1991 added 11 commits February 1, 2020 19:28

experiment

a11279d

experiment

bb3ff98

experiment

5d0bcfd

experiemnt

cc516c8

experiment

3c5c3aa

fixup

a63e65d

experiment

4ba67e8

experiment

849f96f

experiment

50a7242

experiment

6635d31

fixup and linting

b55b6b4

charlesdong1991 commented Feb 3, 2020

View reviewed changes

charlesdong1991 changed the title ~~[WIP NOT READY FOR REVIEW] BUG: groupby().agg fails on categorical column~~ BUG: groupby().agg fails on categorical column Feb 3, 2020

Merge remote-tracking branch 'upstream/master' into issue_31450

5dd9b38

jreback requested changes Feb 5, 2020

View reviewed changes

jreback closed this Feb 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby().agg fails on categorical column #31470

BUG: groupby().agg fails on categorical column #31470

charlesdong1991 commented Jan 30, 2020

pep8speaks commented Jan 30, 2020 •

edited

Loading

jbrockmendel commented Jan 30, 2020

charlesdong1991 commented Jan 30, 2020 •

edited

Loading

WillAyd left a comment

TomAugspurger commented Jan 31, 2020

charlesdong1991 commented Jan 31, 2020

TomAugspurger commented Jan 31, 2020 via email

charlesdong1991 commented Feb 1, 2020 •

edited

Loading

TomAugspurger commented Feb 1, 2020

charlesdong1991 commented Feb 1, 2020

jreback commented Feb 1, 2020

charlesdong1991 left a comment •

edited

Loading

charlesdong1991 Feb 3, 2020

charlesdong1991 Feb 3, 2020

charlesdong1991 Feb 3, 2020

charlesdong1991 Feb 3, 2020

charlesdong1991 Feb 3, 2020

charlesdong1991 Feb 3, 2020 •

edited

Loading

TomAugspurger commented Feb 4, 2020

TomAugspurger commented Feb 4, 2020 •

edited

Loading

jreback left a comment

jreback Feb 5, 2020

jreback commented Feb 5, 2020

charlesdong1991 commented Feb 5, 2020 •

edited

Loading

		if how in base.cython_cast_keep_type_list:
		result = maybe_downcast_numeric(result, block.dtype)

BUG: groupby().agg fails on categorical column #31470

BUG: groupby().agg fails on categorical column #31470

Conversation

charlesdong1991 commented Jan 30, 2020

pep8speaks commented Jan 30, 2020 • edited Loading

Comment last updated at 2020-02-04 08:07:08 UTC

jbrockmendel commented Jan 30, 2020

charlesdong1991 commented Jan 30, 2020 • edited Loading

WillAyd left a comment

Choose a reason for hiding this comment

TomAugspurger commented Jan 31, 2020

charlesdong1991 commented Jan 31, 2020

TomAugspurger commented Jan 31, 2020 via email

charlesdong1991 commented Feb 1, 2020 • edited Loading

TomAugspurger commented Feb 1, 2020

charlesdong1991 commented Feb 1, 2020

jreback commented Feb 1, 2020

charlesdong1991 left a comment • edited Loading

Choose a reason for hiding this comment

charlesdong1991 Feb 3, 2020

Choose a reason for hiding this comment

charlesdong1991 Feb 3, 2020

Choose a reason for hiding this comment

charlesdong1991 Feb 3, 2020

Choose a reason for hiding this comment

charlesdong1991 Feb 3, 2020

Choose a reason for hiding this comment

charlesdong1991 Feb 3, 2020

Choose a reason for hiding this comment

charlesdong1991 Feb 3, 2020 • edited Loading

Choose a reason for hiding this comment

TomAugspurger commented Feb 4, 2020

TomAugspurger commented Feb 4, 2020 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

jreback Feb 5, 2020

Choose a reason for hiding this comment

jreback commented Feb 5, 2020

charlesdong1991 commented Feb 5, 2020 • edited Loading

pep8speaks commented Jan 30, 2020 •

edited

Loading

charlesdong1991 commented Jan 30, 2020 •

edited

Loading

charlesdong1991 commented Feb 1, 2020 •

edited

Loading

charlesdong1991 left a comment •

edited

Loading

charlesdong1991 Feb 3, 2020 •

edited

Loading

TomAugspurger commented Feb 4, 2020 •

edited

Loading

charlesdong1991 commented Feb 5, 2020 •

edited

Loading