-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
WIP/BUG: Correct results for groupby(...).transform with null keys #45839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/groupby/groupby.py
Outdated
# return [self.indices.get(name, []) for name in names] | ||
# self.indices is a dict and doesn't handle looking up nulls in the groups | ||
from pandas import ( | ||
Index, | ||
Series, | ||
) | ||
|
||
index = Index(self.indices.keys(), tupleize_cols=False) | ||
indices = Series(self.indices.values(), index=index) | ||
result = [indices[name] for name in names] | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here self.indices
is a dict mapping the name to the indices it occurs at. Problem is when there are null values for the name (e.g. np.nan
, (1, np.nan)
, etc). Here the lookup fails, and in general one shouldn't have null values as keys in a dict. I know Series gets around this using codes, hence the solution above.
The need to separate out the Index carefully is to handle odd cases like where self.indices
is {(1,): np.array([0, 2]), (1, 2): np.array([1, 3])}
. This is the particular case for a single test.
cc @jbrockmendel @phofl if there are any ideas to solve this in some more direct way rather than using a Series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the lookup fails, and in general one shouldn't have null values as keys in a dict
Not sure if this is relevant, but possibly related to #43943?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think in index.pyx we have some code for canonicalizing NA values, e.g. float("nan") -> np.nan, so that dict lookups are better-behaved. could be adapted/reused when defining indices
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - this is precisely what I was looking for. However, I found a simpler way to use the codes directly; I've put up #45953.
pandas/core/groupby/groupby.py
Outdated
Series, | ||
) | ||
|
||
index = Index(self.indices.keys(), tupleize_cols=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment about why this is two lines instead of just indices = Series(self.indices)
? i assume its to avoid a MultiIndex
transform
groupby
withNaN
s #17093 (Replace xxxx with the Github issue number)doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Fixes a lot of issues with
groupby(...).transform
when there are null values in the groups. Still needs tests for SeriesGroupBy. Happy to break this up if requested, but I'm stuck on one part; see comment below.Note this changes some tested behavior, but it changes it to conform to what is specified in the docs from DataFrameGroupBy.transform