Skip to content

BUG: Groupby.nth includes group key inconsistently #12839 #13316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.18.2.txt
Original file line number Diff line number Diff line change
Expand Up @@ -313,7 +313,7 @@ Bug Fixes


- Bug in ``groupby`` where ``apply`` returns different result depending on whether first result is ``None`` or not (:issue:`12824`)

- Bug in ``groupby(..).nth()`` where the group key is included inconsistently (:issue:`12839`)



Expand Down
35 changes: 26 additions & 9 deletions pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ def _groupby_function(name, alias, npfunc, numeric_only=True,
@Appender(_doc_template)
@Appender(_local_template)
def f(self):
self._set_selection_from_grouper()
self._set_group_selection()
try:
return self._cython_agg_general(alias, numeric_only=numeric_only)
except AssertionError as e:
Expand Down Expand Up @@ -457,8 +457,21 @@ def _selected_obj(self):
else:
return self.obj[self._selection]

def _set_selection_from_grouper(self):
""" we may need create a selection if we have non-level groupers """
def _clear_group_selection(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this _reset_group_selection. sorry to be a pain about the names, but naming is very important for consistency.

"""
Clear group based selection. Used for methods needing to return info on
each group regardless of whether a group selection was previously set.
"""
if self._group_selection is not None:
self._group_selection = None
# GH12839 clear cached selection too when changing group selection
self._reset_cache('_selected_obj')

def _set_group_selection(self):
"""
Create group based selection. Used when selection is not passed
directly but instead via a grouper.
"""
grp = self.grouper
if self.as_index and getattr(grp, 'groupings', None) is not None and \
self.obj.ndim > 1:
Expand All @@ -468,6 +481,8 @@ def _set_selection_from_grouper(self):

if len(groupers):
self._group_selection = ax.difference(Index(groupers)).tolist()
# GH12839 clear selected obj cache when group selection changes
self._reset_cache('_selected_obj')
Copy link
Contributor Author

@adneu adneu May 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree I think this is sufficient

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I think maybe move the _reset_cache higher up (e.g. outside the len(groupers) though not sure if this breaks anything (it just feels cleaner)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, I think _reset_cache is only necessary if _group_selection changes, this is why I ordered it in this way.


def _set_result_index_ordered(self, result):
# set the result index on the passed values object and
Expand Down Expand Up @@ -511,7 +526,7 @@ def _make_wrapper(self, name):

# need to setup the selection
# as are not passed directly but in the grouper
self._set_selection_from_grouper()
self._set_group_selection()

f = getattr(self._selected_obj, name)
if not isinstance(f, types.MethodType):
Expand Down Expand Up @@ -979,7 +994,7 @@ def mean(self, *args, **kwargs):
except GroupByError:
raise
except Exception: # pragma: no cover
self._set_selection_from_grouper()
self._set_group_selection()
f = lambda x: x.mean(axis=self.axis)
return self._python_agg_general(f)

Expand All @@ -997,7 +1012,7 @@ def median(self):
raise
except Exception: # pragma: no cover

self._set_selection_from_grouper()
self._set_group_selection()

def f(x):
if isinstance(x, np.ndarray):
Expand Down Expand Up @@ -1040,7 +1055,7 @@ def var(self, ddof=1, *args, **kwargs):
if ddof == 1:
return self._cython_agg_general('var')
else:
self._set_selection_from_grouper()
self._set_group_selection()
f = lambda x: x.var(ddof=ddof)
return self._python_agg_general(f)

Expand Down Expand Up @@ -1216,7 +1231,7 @@ def nth(self, n, dropna=None):
raise TypeError("n needs to be an int or a list/set/tuple of ints")

nth_values = np.array(nth_values, dtype=np.intp)
self._set_selection_from_grouper()
self._set_group_selection()

if not dropna:
mask = np.in1d(self._cumcount_array(), nth_values) | \
Expand Down Expand Up @@ -1324,7 +1339,7 @@ def cumcount(self, ascending=True):
dtype: int64
"""

self._set_selection_from_grouper()
self._set_group_selection()

index = self._selected_obj.index
cumcounts = self._cumcount_array(ascending=ascending)
Expand Down Expand Up @@ -1402,6 +1417,7 @@ def head(self, n=5):
0 1 2
2 5 6
"""
self._clear_group_selection()
mask = self._cumcount_array() < n
return self._selected_obj[mask]

Expand All @@ -1428,6 +1444,7 @@ def tail(self, n=5):
0 a 1
2 b 1
"""
self._clear_group_selection()
mask = self._cumcount_array(ascending=False) < n
return self._selected_obj[mask]

Expand Down
31 changes: 30 additions & 1 deletion pandas/tests/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -354,6 +354,35 @@ def test_nth_multi_index_as_expected(self):
names=['A', 'B']))
assert_frame_equal(result, expected)

def test_group_selection_cache(self):
# GH 12839 nth, head, and tail should return same result consistently
df = DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
expected = df.iloc[[0, 2]].set_index('A')

g = df.groupby('A')
result1 = g.head(n=2)
result2 = g.nth(0)
assert_frame_equal(result1, df)
assert_frame_equal(result2, expected)

g = df.groupby('A')
result1 = g.tail(n=2)
result2 = g.nth(0)
assert_frame_equal(result1, df)
assert_frame_equal(result2, expected)

g = df.groupby('A')
result1 = g.nth(0)
result2 = g.head(n=2)
assert_frame_equal(result1, expected)
assert_frame_equal(result2, df)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you meant with doing .head() twice in a row, is this what you are looking for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no I mean

g = df.groupby('A')
result = g.head()
assert_frame_.....
result = g.head()
assert_frame....

e.g. it was 2 of the same calls in a row that was triggering yes? want to have an exact replica of the original failure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare each result as well (e.g. don't just g.nth(0) compare it)

Copy link
Contributor Author

@adneu adneu May 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the issue was switching between two methods that either use a group selection (nth) or don't (head and tail). The bug is that once one of these methods are called, _selected_obj is placed into the cache either with or without the group selection (self.obj[self._group_selection] or self.obj), and further retrievals from the cache will retrieve the same object without paying attention to whether the group selection has changed or not. The proposed pull request clears the cache for _selected_obj whenever the group selection changes. This happens in _set_selection_from_grouper in nth, and it also needs to happen at the beginning of head and tail (after making sure _group_selection is None)

g = df.groupby('A')
result1 = g.nth(0)
result2 = g.tail(n=2)
assert_frame_equal(result1, expected)
assert_frame_equal(result2, df)

def test_grouper_index_types(self):
# related GH5375
# groupby misbehaving when using a Floatlike index
Expand Down Expand Up @@ -6107,7 +6136,7 @@ def test_cython_transform(self):
# bit a of hack to make sure the cythonized shift
# is equivalent to pre 0.17.1 behavior
if op == 'shift':
gb._set_selection_from_grouper()
gb._set_group_selection()

for (op, args), targop in ops:
if op != 'shift' and 'int' not in gb_target:
Expand Down