Skip to content

BUG: Groupby.nth includes group key inconsistently #12839 #13316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.18.2.txt
Original file line number Diff line number Diff line change
Expand Up @@ -313,7 +313,7 @@ Bug Fixes


- Bug in ``groupby`` where ``apply`` returns different result depending on whether first result is ``None`` or not (:issue:`12824`)

- Bug in ``groupby(..).nth()`` where the group key is included inconsistently (:issue:`12839`)



Expand Down
8 changes: 8 additions & 0 deletions pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -457,6 +457,11 @@ def _selected_obj(self):
else:
return self.obj[self._selection]

def _reset_group_selection(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here (and to _set_selection) to what these are doing

if self._group_selection is not None:
self._group_selection = None
self._reset_cache('_selected_obj')

def _set_selection_from_grouper(self):
""" we may need create a selection if we have non-level groupers """
grp = self.grouper
Expand All @@ -468,6 +473,7 @@ def _set_selection_from_grouper(self):

if len(groupers):
self._group_selection = ax.difference(Index(groupers)).tolist()
self._reset_cache('_selected_obj')
Copy link
Contributor Author

@adneu adneu May 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree I think this is sufficient

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I think maybe move the _reset_cache higher up (e.g. outside the len(groupers) though not sure if this breaks anything (it just feels cleaner)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, I think _reset_cache is only necessary if _group_selection changes, this is why I ordered it in this way.


def _set_result_index_ordered(self, result):
# set the result index on the passed values object and
Expand Down Expand Up @@ -1402,6 +1408,7 @@ def head(self, n=5):
0 1 2
2 5 6
"""
self._reset_group_selection()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont' create a new method, just call _set_selection_from_grouper

Copy link
Contributor Author

@adneu adneu May 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the documentation for head and tail, they are supposed to return info for each group (regardless of the group selection). So I think _group_selection needs to be set to None rather than be set based on the grouper, i.e. with _set_selection_from_grouper.

Copy link
Contributor

@jreback jreback Jun 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that is fine. just doc the method _reset_group_selection() (and add a doc-string to _set_selection_from_grouper. Maybe rename these to be more consistent as well

_set_group_selection()
_reset_group_selection()

mask = self._cumcount_array() < n
return self._selected_obj[mask]

Expand All @@ -1428,6 +1435,7 @@ def tail(self, n=5):
0 a 1
2 b 1
"""
self._reset_group_selection()
mask = self._cumcount_array(ascending=False) < n
return self._selected_obj[mask]

Expand Down
35 changes: 35 additions & 0 deletions pandas/tests/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -354,6 +354,41 @@ def test_nth_multi_index_as_expected(self):
names=['A', 'B']))
assert_frame_equal(result, expected)

def test_group_selection_cache(self):
# GH 12839 nth, head, and tail should return same result consistently
df = DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
expected = df.iloc[[0, 2]].set_index('A')

g = df.groupby('A')
g.head()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check both .head() lines (and so on below)

result = g.nth(0)
assert_frame_equal(result, expected)

g = df.groupby('A')
g.tail()
result = g.nth(0)
assert_frame_equal(result, expected)

g = df.groupby('A')
g.nth(0)
result = g.head(n=2)
assert_frame_equal(result, df)

g = df.groupby('A')
g.nth(0)
result = g.tail(n=2)
assert_frame_equal(result, df)

g = df.groupby('A')
g.head()
result = g.head(n=2)
assert_frame_equal(result, df)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you meant with doing .head() twice in a row, is this what you are looking for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no I mean

g = df.groupby('A')
result = g.head()
assert_frame_.....
result = g.head()
assert_frame....

e.g. it was 2 of the same calls in a row that was triggering yes? want to have an exact replica of the original failure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare each result as well (e.g. don't just g.nth(0) compare it)

Copy link
Contributor Author

@adneu adneu May 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the issue was switching between two methods that either use a group selection (nth) or don't (head and tail). The bug is that once one of these methods are called, _selected_obj is placed into the cache either with or without the group selection (self.obj[self._group_selection] or self.obj), and further retrievals from the cache will retrieve the same object without paying attention to whether the group selection has changed or not. The proposed pull request clears the cache for _selected_obj whenever the group selection changes. This happens in _set_selection_from_grouper in nth, and it also needs to happen at the beginning of head and tail (after making sure _group_selection is None)

g = df.groupby('A')
g.tail()
result = g.tail(n=2)
assert_frame_equal(result, df)

def test_grouper_index_types(self):
# related GH5375
# groupby misbehaving when using a Floatlike index
Expand Down