Skip to content

BUG: Groupby.nth includes group key inconsistently #12839 #13316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

BUG: Groupby.nth includes group key inconsistently #12839 #13316

wants to merge 4 commits into from

Conversation

adneu
Copy link
Contributor

@adneu adneu commented May 29, 2016

When the group selection changes, the cache for _selected_obj needs to be reset so that nth, head, and tail can return consistent results.

@codecov-io
Copy link

codecov-io commented May 29, 2016

Current coverage is 84.34%

Merging #13316 into master will increase coverage by <.01%

@@             master     #13316   diff @@
==========================================
  Files           138        138          
  Lines         51126      51133     +7   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43122      43130     +8   
+ Misses         8004       8003     -1   
  Partials          0          0          

Powered by Codecov. Last updated by 2134b63...16f5cd3

df = DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
g = df.groupby('A')
g.head()
assert_frame_equal(g.nth(0), df.iloc[[0, 2]].set_index('A'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put blank lines between testing new things

g.head()
result = g.head(n=2)
assert_frame_equal(result, df)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you meant with doing .head() twice in a row, is this what you are looking for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no I mean

g = df.groupby('A')
result = g.head()
assert_frame_.....
result = g.head()
assert_frame....

e.g. it was 2 of the same calls in a row that was triggering yes? want to have an exact replica of the original failure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare each result as well (e.g. don't just g.nth(0) compare it)

Copy link
Contributor Author

@adneu adneu May 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the issue was switching between two methods that either use a group selection (nth) or don't (head and tail). The bug is that once one of these methods are called, _selected_obj is placed into the cache either with or without the group selection (self.obj[self._group_selection] or self.obj), and further retrievals from the cache will retrieve the same object without paying attention to whether the group selection has changed or not. The proposed pull request clears the cache for _selected_obj whenever the group selection changes. This happens in _set_selection_from_grouper in nth, and it also needs to happen at the beginning of head and tail (after making sure _group_selection is None)

@@ -457,6 +457,11 @@ def _selected_obj(self):
else:
return self.obj[self._selection]

def _reset_group_selection(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here (and to _set_selection) to what these are doing

@jreback jreback added this to the 0.18.2 milestone May 31, 2016
@jreback
Copy link
Contributor

jreback commented May 31, 2016

@adneu ok just some minor comments. ping when updated / green.

@jorisvandenbossche
Copy link
Member

@adneu Do you have some time to update this?

@jreback
Copy link
Contributor

jreback commented Jun 30, 2016

@adneu ok only minor changes. can you update

@adneu
Copy link
Contributor Author

adneu commented Jul 5, 2016

Sorry for the delay. I added some docstrings and tests based on your comments. Pls check out the latest commit and let me know if ok.

@@ -457,8 +457,21 @@ def _selected_obj(self):
else:
return self.obj[self._selection]

def _set_selection_from_grouper(self):
""" we may need create a selection if we have non-level groupers """
def _clear_group_selection(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this _reset_group_selection. sorry to be a pain about the names, but naming is very important for consistency.

@jreback
Copy link
Contributor

jreback commented Jul 5, 2016

naming change but otherwise lgtm. ping when green.

@jorisvandenbossche
Copy link
Member

@adneu Looks good!

@adneu
Copy link
Contributor Author

adneu commented Jul 6, 2016

@jreback good to go?

@jreback jreback closed this in cc0a188 Jul 6, 2016
@jreback
Copy link
Contributor

jreback commented Jul 6, 2016

thanks @adneu

@adneu adneu deleted the 12839 branch July 7, 2016 03:12
nateGeorge pushed a commit to nateGeorge/pandas that referenced this pull request Aug 15, 2016
closes pandas-dev#12839

Author: adneu <[email protected]>

Closes pandas-dev#13316 from adneu/12839 and squashes the following commits:

16f5cd3 [adneu] Name change
ac1851a [adneu] Added docstrings/comments, and new tests.
4d73cbf [adneu] Updated tests
9b75df4 [adneu] BUG: Groupby.nth includes group key inconsistently pandas-dev#12839
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GroupBy.nth includes group key inconsistently
4 participants