Skip to content

WIP/BUG: Correct results for groupby(...).transform with null keys #45839

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

rhshadrach
Copy link
Member

Fixes a lot of issues with groupby(...).transform when there are null values in the groups. Still needs tests for SeriesGroupBy. Happy to break this up if requested, but I'm stuck on one part; see comment below.

Note this changes some tested behavior, but it changes it to conform to what is specified in the docs from DataFrameGroupBy.transform

Call function producing a like-indexed DataFrame on each group and return a DataFrame having the same indexes as the original object filled with the transformed values.

@rhshadrach rhshadrach added Bug Groupby Apply Apply, Aggregate, Transform, Map API - Consistency Internal Consistency of API/Behavior labels Feb 5, 2022
Comment on lines 668 to 678
# return [self.indices.get(name, []) for name in names]
# self.indices is a dict and doesn't handle looking up nulls in the groups
from pandas import (
Index,
Series,
)

index = Index(self.indices.keys(), tupleize_cols=False)
indices = Series(self.indices.values(), index=index)
result = [indices[name] for name in names]
return result
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here self.indices is a dict mapping the name to the indices it occurs at. Problem is when there are null values for the name (e.g. np.nan, (1, np.nan), etc). Here the lookup fails, and in general one shouldn't have null values as keys in a dict. I know Series gets around this using codes, hence the solution above.

The need to separate out the Index carefully is to handle odd cases like where self.indices is {(1,): np.array([0, 2]), (1, 2): np.array([1, 3])}. This is the particular case for a single test.

cc @jbrockmendel @phofl if there are any ideas to solve this in some more direct way rather than using a Series.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the lookup fails, and in general one shouldn't have null values as keys in a dict

Not sure if this is relevant, but possibly related to #43943?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think in index.pyx we have some code for canonicalizing NA values, e.g. float("nan") -> np.nan, so that dict lookups are better-behaved. could be adapted/reused when defining indices?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - this is precisely what I was looking for. However, I found a simpler way to use the codes directly; I've put up #45953.

Series,
)

index = Index(self.indices.keys(), tupleize_cols=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment about why this is two lines instead of just indices = Series(self.indices)? i assume its to avoid a MultiIndex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Apply Apply, Aggregate, Transform, Map Bug Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ambiguous behaviour when transform groupby with NaNs
2 participants