Skip to content

BUG: groupby().agg fails on categorical column #31470

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 33 commits into from

Conversation

charlesdong1991
Copy link
Member

@pep8speaks
Copy link

pep8speaks commented Jan 30, 2020

Hello @charlesdong1991! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-02-04 08:07:08 UTC

@jbrockmendel
Copy link
Member

is this limited to "first"? i think the conclusion from previous thread is that we need to be more systematic about when we call _try_cast

@charlesdong1991
Copy link
Member Author

charlesdong1991 commented Jan 30, 2020

@jbrockmendel yeah, very right, i was about to change something in _try_cast or before _try_cast, but haven't seen the follow-up discussions yet, and due to the added change of isinstance(result[0], dtype.type) in _try_cast, the case of first won't be casted back which it should be.
so I just added as an experiment to see if the problem could be fixed, and prepare a more generic solution if so.

@charlesdong1991 charlesdong1991 changed the title BUG: groupby().agg fails on categorical column when func is first [WIP] BUG: groupby().agg fails on categorical column when func is first Jan 30, 2020
@charlesdong1991 charlesdong1991 changed the title [WIP] BUG: groupby().agg fails on categorical column when func is first [WIP] BUG: groupby().agg fails on categorical column Jan 31, 2020
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not a fan of the approach here - I think should try and avoid special-casing / adding more groupings to base.

This works for DataFrame selection but not Series right? If so I think should dive deeper into why the former can naturally support but the latter can't to find resolution

@WillAyd WillAyd added Categorical Categorical Data Type Groupby labels Jan 31, 2020
@TomAugspurger
Copy link
Contributor

Haven't had a chance to look closely yet, but I think my preferred approach is for the calling function to decide what the output dtype should be based on the input dtype. Not sure how best to put that to code either.

@charlesdong1991
Copy link
Member Author

@WillAyd @TomAugspurger I agree, and i am not a big fan of this current approach, this is the reason i have [WIP] in the title. I tried to fix when to call this _try_cast depending on the situation and type, but not working, and causes a lot of errors locally, so i am still trying to find a better/more robust fix.

Sorry about this noise in PRs, and i wish i could update this soon!

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 31, 2020 via email

@charlesdong1991 charlesdong1991 changed the title [WIP] BUG: groupby().agg fails on categorical column [WIP NOT READY FOR REVIEW] BUG: groupby().agg fails on categorical column Feb 1, 2020
@jreback jreback added this to the 1.0.1 milestone Feb 1, 2020
@charlesdong1991
Copy link
Member Author

charlesdong1991 commented Feb 1, 2020

I still have some confusion on what should be the correct behaviour, based on the #31359 (comment)
it seems for IntegerArray, mean should return a float type result because IntegerArray._reduce("mean") returns float. However, in the case below, currently doing agg.mean on a normal array with integer returns int64 output instead of float? is it correct? if so, could this be considered as inconsistency between extension array and normal array and then cause confusion to users? @TomAugspurger @jreback @jbrockmendel @WillAyd @jorisvandenbossche

def test_groupby_extension_agg(self, as_index, data_for_grouping):
        df = pd.DataFrame({"A": [1, 1, 2, 2, 3, 3, 1, 4], "B": data_for_grouping})
        result = df.groupby("B", as_index=as_index).A.mean()
        _, index = pd.factorize(data_for_grouping, sort=True)
    
        index = pd.Index(index, name="B")
        expected = pd.Series([3, 1, 4], index=index, name="A")
        if as_index:
>           self.assert_series_equal(result, expected)
E           AssertionError: Attributes of Series are different
E           
E           Attribute "dtype" are different
E           [left]:  float64
E           [right]: int64

@TomAugspurger
Copy link
Contributor

IMO the result dtype should only depend on the dtype of the inputs, not the values. IIUC the result can only be int dtype in this case because the mean of the values happens to be equal to an integer? If so, I’d say that’s a bug. The result dtype should always be a float.

@charlesdong1991
Copy link
Member Author

@TomAugspurger thanks for your quick reply. Indeed, in these cases the output has int dtype because it happens to be equal to integer.

@jreback
Copy link
Contributor

jreback commented Feb 1, 2020

the int return in mean is a bug it should be float
likely just happened to coerce it before

Copy link
Member Author

@charlesdong1991 charlesdong1991 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should pass the CI now, before heading to bed, I would like to describe it a bit, feel free to take a look and your reviews are very welcomed:

  1. now the .agg("mean") will return float64, the same for var, median etc, it is aligned with numpy behaviour also. And for the rest (defined in the base, they should NOT be casted) and keep the type
  2. and python_agg will always return the same type as the original object, and it is not changing the current behaviour, since right now those results are coerced. And I think this is correct, because we might want to return different types for python_agg and cython_agg on cython funcs, and this design is especially good for self-defined function used in aggregate. Therefore, you could see that in test, .agg("mean") will have float64 now while .agg(lambda x: np.mean(x)) will still return int64 as they are on master.

I believe this will have a lot of disputes, but all opinions are welcome, and having a consesus on the correct behaviour is important, afterwards I will focus on better coding later. Thanks all, I will do a follow-up code changes tomorrow, and hopefully get this done before 1.0.1 releases. @TomAugspurger @jreback @jbrockmendel @jorisvandenbossche @WillAyd

Comment on lines +95 to +99
cython_cast_cat_type_list = frozenset(["first", "last"])
cython_cast_keep_type_list = cython_cast_cat_type_list | frozenset(
["min", "max", "add", "prod", "ohlc"]
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to specify cython func that should reserve the type

Comment on lines +1074 to +1075
if how in base.cython_cast_keep_type_list:
result = maybe_downcast_numeric(result, block.dtype)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to specify for the case when as_index=False, otherwise, will be coerced to int for those cases which they should not

@@ -792,7 +792,7 @@ def _cumcount_array(self, ascending: bool = True):
rev[sorter] = np.arange(count, dtype=np.intp)
return out[rev].astype(np.int64, copy=False)

def _try_cast(self, result, obj, numeric_only: bool = False):
def _try_cast(self, result, obj, numeric_only: bool = False, is_python=False):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, this is really ugly, the reason is to distinguish the python_agg and cython_agg since they have different situations to cast

will think a bit more

Comment on lines +818 to +822
if (
isinstance(result[notna(result)][0], dtype.type)
and is_python
or not is_python
):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also ugly, it does two things: for cython_agg, if above satisfied, will cast, but for python_agg, we only cast if the not null result has the same type as original object, and I think this is the correct behaviour.

Comment on lines +908 to +910
if self._cython_aggregate_should_cast(how):
result_column = self._try_cast(result_column, obj)
output[key] = result_column
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for cython_agg, we should only cast if it is one of the defined cython func, otherwise, should not touch to _try_cast

Comment on lines +376 to +380

# there is some inconsistency issue in type based on different types, it happens
# on windows machine and linux_py36_32bit, skip it for now
if not observed:
tm.assert_frame_equal(result, expected)
Copy link
Member Author

@charlesdong1991 charlesdong1991 Feb 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somehow, i encoutered some issue with type here, only running on windows machine and linux_py36_32bit, this type is not the same, i will try a bit tomorrow, but i think the result is correct here.

@charlesdong1991 charlesdong1991 changed the title [WIP NOT READY FOR REVIEW] BUG: groupby().agg fails on categorical column BUG: groupby().agg fails on categorical column Feb 3, 2020
@TomAugspurger
Copy link
Contributor

Thanks for working on this @charlesdong1991. I hope to write up more later, but in general, I like the general idea of the calling function indicating the result dtype is a good idea.

For our immediate needs, I think we can use a more targeted fix to restore the 0.25.x behavior. I'll propose that as a separate PR.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 4, 2020

#31668 for the narrowly scoped alternative to just fix the regression, but not the general issue.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working on this @charlesdong1991 but this is not very clean; i know this is a patch release. but this is making things worse.

# if the type is compatible with the calling EA.
# datetime64tz is handled correctly in agg_series,
# so is excluded here.
from pandas import notna
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can import at the top

@jreback
Copy link
Contributor

jreback commented Feb 5, 2020

merged in #31668, which is a backportable patch. thanks @charlesdong1991

@jreback jreback closed this Feb 5, 2020
@charlesdong1991
Copy link
Member Author

charlesdong1991 commented Feb 5, 2020

#31668 for the narrowly scoped alternative to just fix the regression, but not the general issue.

this is much cleaner for fixing regression!! awesome!! @TomAugspurger 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

REGR: groupby().agg fails on categorical column in pandas 1.0.0
6 participants