-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: transform with nunique should have dtype int64 #35152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not averse to this cleanup, except for the 'temporarily' setting of obvserved. why do you think we need to do this?
pandas/core/groupby/generic.py
Outdated
@@ -485,8 +485,13 @@ def transform(self, func, *args, engine="cython", engine_kwargs=None, **kwargs): | |||
# If func is a reduction, we need to broadcast the | |||
# result to the whole group. Compute func result | |||
# and deal with possible broadcasting below. | |||
# Temporarily set observed for dealing with | |||
# categoricals so we don't have to convert dtypes. | |||
observed = self.observed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woa? can we not simply handle this at a lower level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Computing with
observed=True
is much more performant thanobserved=False
when the category combinations are not saturated, and slightly less performant when they are. So it seems best to me to run withobserved=True
since transform will disregard any unobserved combinations. - Computing with
observed=True
makes it so that we don't have to do a type conversion float to int because ofnp.nan
for missing combinations. - The next step is to call
getattr(self, func)(*args, **kwargs)
. So I don't see how to handle this at a lower level.
That said, the code as-is is definitely not good - self.observed
should be modified using a context manager in case the reduction fails. Something like:
with temp_setattr(self, observed=True) as obj:
getattr(obj, func)(*args, **kwargs)
where temp_setattr
is a context manager like the one found here. Not sure if this is a more palatable change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what i mean is, why don't you just pass observed=True
directly in the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're suggesting doing something like getattr(self, func)(*args, **kwargs, observed=True)
. Assuming args and kwargs are empty, that is akin to df.groupby(keys).sum(observed=True)
; but sum
doesn't take observed as an argument, only groupby
does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, can you make a context manager for this, like _group_selection_context
conceptually.
pandas/core/groupby/generic.py
Outdated
@@ -485,8 +485,13 @@ def transform(self, func, *args, engine="cython", engine_kwargs=None, **kwargs): | |||
# If func is a reduction, we need to broadcast the | |||
# result to the whole group. Compute func result | |||
# and deal with possible broadcasting below. | |||
# Temporarily set observed for dealing with | |||
# categoricals so we don't have to convert dtypes. | |||
observed = self.observed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what i mean is, why don't you just pass observed=True
directly in the function?
lgtm ping on green. |
@jreback green |
thanks @rhshadrach |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
cc @WillAyd
Removes casting on transformations which go through
_transform_fast
. The result is a reduction that is broadcast to the original index, so casting isn't necessary once special care is taken for categoricals whenobserved=False
If this PR is accepted, I'll close #35130 which is for the same issue.