Skip to content

BUG: Fix some cases of groupby(...).transform with dropna=True #46209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 7, 2022

Conversation

rhshadrach
Copy link
Member

Part of #17093

The added xfails will pass when other parts of #45839 are merged.

@rhshadrach rhshadrach added Bug Groupby Apply Apply, Aggregate, Transform, Map labels Mar 3, 2022
@jreback jreback added this to the 1.5 milestone Mar 3, 2022
@jreback
Copy link
Contributor

jreback commented Mar 5, 2022

lgtm @jbrockmendel comments?

@jbrockmendel
Copy link
Member

i think _take with convert_indices=False corresponds to BM.reindex_indexer? Series has _reindex_indexer while DataFrame doesn't. Could make DataFrame._reindex_indexer and might be more consistent? Or could add a "allow_fill" kwd to take to be consistent with other take methods?

@rhshadrach
Copy link
Member Author

rhshadrach commented Mar 7, 2022

@jbrockmendel

_reindex_indexer is slightly different - without supplying a new_index, the result will have a RangeIndex rather than taking from the index values. To supply a new_index, we'd have to first take from the index, then pass this to _reindex_indexer. This is precisely what BM.take / AM.take are doing.

convert_indices=True attempts to convert negative indices to positive, e.g. -1 becomes n-1 with n being the length of the axis. This is different from allow_fill, which replaces -1 (and only -1) with fill_value. So by skipping the convert_indices (leaving -1 alone) and then doing allow_fill=True in reindex_indexer, we get the desired operation of -1 becoming [NA] (with the appropriate type of null). In other words, adding allow_fill to take wouldn't be sufficient here - it's already True.

@rhshadrach
Copy link
Member Author

Out of curiosity, I tried running the tests with convert_indices in BM.take always set to False. Surprisingly few tests failed.

FAILED pandas/tests/generic/test_generic.py::TestNDFrame::test_take_frame - AssertionError: DataFrame.iloc[:, 0] (column name="A") are different
FAILED pandas/tests/frame/indexing/test_take.py::TestDataFrameTake::test_take - AssertionError: DataFrame.iloc[:, 0] (column name="A") are different
FAILED pandas/tests/frame/indexing/test_take.py::TestDataFrameTake::test_take_mixed_type
FAILED pandas/tests/groupby/test_apply.py::test_groupby_apply_none_first - AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="groups") are different
FAILED pandas/tests/groupby/test_nth.py::test_nth - AssertionError: DataFrame.iloc[:, 0] (column name="two") are different
FAILED pandas/tests/indexing/test_iloc.py::TestiLocBaseIndependent::test_iloc_getitem_neg_int_can_reach_first_index - AssertionError: DataFrame.iloc[:, 0] (column name="A") are different
FAILED pandas/tests/indexing/test_iloc.py::TestiLocBaseIndependent::test_iloc_getitem_frame - AssertionError: DataFrame.iloc[:, 0] (column name="6") are different
FAILED pandas/tests/indexing/test_iloc.py::TestiLocBaseIndependent::test_iloc_getitem_float_duplicates - AssertionError: DataFrame.iloc[:, 0] (column name="a") are different
FAILED pandas/tests/internals/test_internals.py::TestIndexing::test_take[mgr0] - AssertionError: numpy array are different
FAILED pandas/tests/internals/test_internals.py::TestIndexing::test_take[mgr1] - AssertionError: numpy array are different
FAILED pandas/tests/internals/test_internals.py::TestIndexing::test_take[mgr2] - IndexError: tuple index out of range
FAILED pandas/tests/internals/test_internals.py::TestIndexing::test_take[mgr3] - IndexError: tuple index out of range
FAILED pandas/tests/internals/test_internals.py::TestIndexing::test_take[mgr4] - IndexError: tuple index out of range
FAILED pandas/tests/internals/test_internals.py::TestIndexing::test_take[mgr5] - IndexError: tuple index out of range
FAILED pandas/tests/frame/methods/test_drop_duplicates.py::test_drop_duplicates

@jreback jreback merged commit 4d7a03a into pandas-dev:main Mar 7, 2022
@jreback
Copy link
Contributor

jreback commented Mar 7, 2022

thanks @rhshadrach

@jbrockmendel
Copy link
Member

Out of curiosity, I tried running the tests with convert_indices in BM.take always set to False. Surprisingly few tests failed.

If you can find a way to make arg(s) unnecessary that'd be great. We clearly have a lot of methods/args that do similar things.

@rhshadrach rhshadrach deleted the transform_dropna_2 branch March 8, 2022 04:11
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
@jbrockmendel
Copy link
Member

@rhshadrach trying to clean out some old branches got me looking at this convert_indices=False thing again.

In BM.take we do new_labels = self.axes[axis].take(indexer), but since that .take call doesn't explicitly pass allow_fill=True, and a fill_value, it behaves as if convert_indices were True. That seems like it should be a problem. Why isn't it?

Assuming it is correct as-is, I think instead of calling ._take with convert_indices=True we could do

new_ax = result.axes[axis].take(ids)
output = result._reindex_with_indexers(
    {axis: (new_ax, ids)}, allow_dups=True, copy=False
)

It's worth double-checking, but I'm pretty sure this is equivalent. It's a little more complicated, but would allow us to do non-trivial cleanup elsewhere.

@rhshadrach
Copy link
Member Author

That seems like it should be a problem. Why isn't it?

The indices, coming from ids, are guaranteed to be in the valid range ([0, len(result))) with the exception of where id == -1. In the id == -1 case, not converting gives back an NA value as desired.

The alternative you mentioned looks correct to me, but I haven't looked at the performance impact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants