-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: groupby().agg fails on categorical column #31470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @charlesdong1991! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-02-04 08:07:08 UTC |
is this limited to "first"? i think the conclusion from previous thread is that we need to be more systematic about when we call _try_cast |
@jbrockmendel yeah, very right, i was about to change something in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not a fan of the approach here - I think should try and avoid special-casing / adding more groupings to base.
This works for DataFrame selection but not Series right? If so I think should dive deeper into why the former can naturally support but the latter can't to find resolution
Haven't had a chance to look closely yet, but I think my preferred approach is for the calling function to decide what the output dtype should be based on the input dtype. Not sure how best to put that to code either. |
@WillAyd @TomAugspurger I agree, and i am not a big fan of this current approach, this is the reason i have [WIP] in the title. I tried to fix when to call this Sorry about this noise in PRs, and i wish i could update this soon! |
Thanks for working on this. It's good that we're exploring possible
solutions. We should take a bit of time to try and get this right.
…On Fri, Jan 31, 2020 at 7:37 AM Kaiqi Dong ***@***.***> wrote:
@WillAyd <https://github.com/WillAyd> @TomAugspurger
<https://github.com/TomAugspurger> I agree, and i am not a big fan of
this current approach, this is the reason i have [WIP] in the title. I
tried to fix when to call this _try_cast depending on the situation and
type, but not working, and causes a lot of errors locally, so i am still
trying to find a better/more robust fix.
Sorry about this noise in PRs, and i wish i could update this soon!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#31470?email_source=notifications&email_token=AAKAOIQL5KBDBER4RG25CMDRAQSSLA5CNFSM4KN3O4JKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKOVKXI#issuecomment-580736349>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOISV7CVPVR2LFT33PPTRAQSSLANCNFSM4KN3O4JA>
.
|
I still have some confusion on what should be the correct behaviour, based on the #31359 (comment)
|
IMO the result dtype should only depend on the dtype of the inputs, not the values. IIUC the result can only be int dtype in this case because the mean of the values happens to be equal to an integer? If so, I’d say that’s a bug. The result dtype should always be a float. |
@TomAugspurger thanks for your quick reply. Indeed, in these cases the output has int dtype because it happens to be equal to integer. |
the int return in mean is a bug it should be float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should pass the CI now, before heading to bed, I would like to describe it a bit, feel free to take a look and your reviews are very welcomed:
- now the
.agg("mean")
will returnfloat64
, the same forvar
,median
etc, it is aligned with numpy behaviour also. And for the rest (defined in the base, they should NOT be casted) and keep the type - and
python_agg
will always return the same type as the original object, and it is not changing the current behaviour, since right now those results are coerced. And I think this is correct, because we might want to return different types forpython_agg
andcython_agg
on cython funcs, and this design is especially good for self-defined function used in aggregate. Therefore, you could see that in test,.agg("mean")
will havefloat64
now while.agg(lambda x: np.mean(x))
will still returnint64
as they are on master.
I believe this will have a lot of disputes, but all opinions are welcome, and having a consesus on the correct behaviour is important, afterwards I will focus on better coding later. Thanks all, I will do a follow-up code changes tomorrow, and hopefully get this done before 1.0.1 releases. @TomAugspurger @jreback @jbrockmendel @jorisvandenbossche @WillAyd
cython_cast_cat_type_list = frozenset(["first", "last"]) | ||
cython_cast_keep_type_list = cython_cast_cat_type_list | frozenset( | ||
["min", "max", "add", "prod", "ohlc"] | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to specify cython func that should reserve the type
if how in base.cython_cast_keep_type_list: | ||
result = maybe_downcast_numeric(result, block.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to specify for the case when as_index=False
, otherwise, will be coerced to int
for those cases which they should not
@@ -792,7 +792,7 @@ def _cumcount_array(self, ascending: bool = True): | |||
rev[sorter] = np.arange(count, dtype=np.intp) | |||
return out[rev].astype(np.int64, copy=False) | |||
|
|||
def _try_cast(self, result, obj, numeric_only: bool = False): | |||
def _try_cast(self, result, obj, numeric_only: bool = False, is_python=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, this is really ugly, the reason is to distinguish the python_agg
and cython_agg
since they have different situations to cast
will think a bit more
if ( | ||
isinstance(result[notna(result)][0], dtype.type) | ||
and is_python | ||
or not is_python | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also ugly, it does two things: for cython_agg
, if above satisfied, will cast, but for python_agg
, we only cast if the not null result has the same type as original object, and I think this is the correct behaviour.
if self._cython_aggregate_should_cast(how): | ||
result_column = self._try_cast(result_column, obj) | ||
output[key] = result_column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for cython_agg
, we should only cast if it is one of the defined cython func, otherwise, should not touch to _try_cast
|
||
# there is some inconsistency issue in type based on different types, it happens | ||
# on windows machine and linux_py36_32bit, skip it for now | ||
if not observed: | ||
tm.assert_frame_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
somehow, i encoutered some issue with type here, only running on windows machine
and linux_py36_32bit
, this type is not the same, i will try a bit tomorrow, but i think the result is correct here.
Thanks for working on this @charlesdong1991. I hope to write up more later, but in general, I like the general idea of the calling function indicating the result dtype is a good idea. For our immediate needs, I think we can use a more targeted fix to restore the 0.25.x behavior. I'll propose that as a separate PR. |
#31668 for the narrowly scoped alternative to just fix the regression, but not the general issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for working on this @charlesdong1991 but this is not very clean; i know this is a patch release. but this is making things worse.
# if the type is compatible with the calling EA. | ||
# datetime64tz is handled correctly in agg_series, | ||
# so is excluded here. | ||
from pandas import notna |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can import at the top
merged in #31668, which is a backportable patch. thanks @charlesdong1991 |
this is much cleaner for fixing regression!! awesome!! @TomAugspurger 👍 |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff