Skip to content

Bug: Grouping by index and column fails on DataFrame with single index (GH14327) #14333

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

jonmmease
Copy link
Contributor

Don't know if this is too late for 0.19.0 but I went ahead and added the whatsnew entry there for now.

Jon M. Mease added 3 commits October 1, 2016 21:14
Existing logic under "if level is not None:" assumed that index was a MultiIndex.
Now we check and also handle the case where an Index is passed in with a None grouper.
This resolves GH 14327
@jonmmease jonmmease changed the title Bug 14327 Bug: Grouping by index and column fails on DataFrame with single index (GH14327) Oct 2, 2016
@codecov-io
Copy link

codecov-io commented Oct 2, 2016

Current coverage is 85.26% (diff: 95.83%)

Merging #14333 into master will increase coverage by <.01%

@@             master     #14333   diff @@
==========================================
  Files           140        140          
  Lines         50630      50639     +9   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43169      43177     +8   
- Misses         7461       7462     +1   
  Partials          0          0          

Powered by Codecov. Last update 6dcc238...33eb725

@@ -1584,3 +1584,4 @@ Bug Fixes
- ``PeridIndex`` can now accept ``list`` and ``array`` which contains ``pd.NaT`` (:issue:`13430`)
- Bug in ``df.groupby`` where ``.median()`` returns arbitrary values if grouped dataframe contains empty bins (:issue:`13629`)
- Bug in ``Index.copy()`` where ``name`` parameter was ignored (:issue:`14302`)
- Bug in ``df.groupby`` causing an ``AttributeError`` when grouping a single index frame by a column and the index (:issue`14327`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.20.0

@jonmmease
Copy link
Contributor Author

whatsnew moved to 0.20.0 as @jreback requested

@jorisvandenbossche
Copy link
Member

Looks good to me

if self.name is None:
self.name = index.names[level]

# XXX complete hack
if isinstance(index, MultiIndex):
inds = index.labels[level]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this should be a private method in an Index instead (and overridden in MultiIndex)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean everything inside the if isinstance(index, MultiIndex): block? If so then it looks like the MultiIndex override would need to input grouper, index, and level and return a tuple of labels, level_index, and grouper. This seems a little messy to me since the parent Index method would have no use for the level parameter and would need to return None for the labels and level_index values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just take args and return a tuples of things
no state is kept

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for the clarification

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just a very minor comment. We might be asking to change the whatsnew to v0.19.1, but we need to sort that out first; will keep you posted.

@@ -432,6 +432,13 @@ def _update_inplace(self, result, **kwargs):
# guard when called from IndexOpsMixin
raise TypeError("Index can't be updated inplace")

def _get_grouper_for_level(self, grouper, level):
# return grouper if grouper is not None else self
Copy link
Contributor

@TomAugspurger TomAugspurger Oct 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment isn't all that helpful :) Could you maybe change it to explain what the tuple being returned is (the two Nones are labels and level_index) and maybe note that the MultiIndex version is what's useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good call :-)

self._labels = labels
self._group_index = level_index
self.grouper = level_index.take(labels)
self.grouper, self._labels, self._group_index = index._get_grouper_for_level(self.grouper, level)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the line causing the travis failure. You'll need to wrap it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that! Guess I forgot to run flake8

@@ -524,6 +524,39 @@ def _format_native_types(self, na_rep='nan', **kwargs):

return mi.values

def _get_grouper_for_level(self, grouper, level):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, same as above for styling.

@@ -432,6 +432,17 @@ def _update_inplace(self, result, **kwargs):
# guard when called from IndexOpsMixin
raise TypeError("Index can't be updated inplace")

def _get_grouper_for_level(self, grouper, level):
# Use self (Index) as grouper if None was passed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a Parameters section in the doc-string. and move the in-line comment to the Returns part.


# XXX complete hack

if grouper is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want to put a comment here explain what is going on would be great (for future readers).

Further you can just return if grouper is not None (and then don't use an else), I think makes the code read slightly better.

'B': ['one', 'one', 'two',
'two', 'one', 'one']},
index=idx)
result = df_multi.groupby(['B', pd.Grouper(level='inner')]).mean()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try with these reversed as well, e.g. [pd.Grouper(....), 'B'])

@jonmmease
Copy link
Contributor Author

Thanks for the feedback @jreback, @jorisvandenbossche, and @TomAugspurger. Writing the docstrings and working through the type signatures sure helped me better understand the logic I was refactoring!

@jonmmease
Copy link
Contributor Author

@TomAugspurger @jorisvandenbossche @jreback I've noticed that there's now a 0.19.1 whatsnew file. Should I move the whatsnew entry there?

@jorisvandenbossche
Copy link
Member

Yes, as it seems a straightforward bug fix, you can put it in 0.19.1

@jonmmease
Copy link
Contributor Author

@jorisvandenbossche Hmm, I rebased my branch on upstream/master so that I have a copy of the 0.19.1 whatsnew file to work with. After making the change I did a force push which successfully updated the branch in my fork (https://github.com/jmmease/pandas/tree/bug_14327), but it didn't update this pull request.

Should I have merged upstream/master instead of rebasing?

@jorisvandenbossche
Copy link
Member

Well, something is going wrong since we moved the repo from the pydata org to the pandas-dev org. PRs that were already open before seem to have this problem.

@wesm @TomAugspurger @jreback anybody experience with this problem, or an idea what could be going on?

----------
group_mapper: Group mapping function or None
Function mapping index values to groups
level : int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just make this level=None by default and assert it is None for Index. don't put the (Only used phrase)


Returns
-------
grouper : Index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it needs to always return a tuple (doesn't matter if they are used or not)

@@ -432,6 +432,35 @@ def _update_inplace(self, result, **kwargs):
# guard when called from IndexOpsMixin
raise TypeError("Index can't be updated inplace")

def _get_grouper_for_level(self, group_mapper, level):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this mapper

level_index : Index or None
Index of unique values for level
"""
inds = self.labels[level]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this indexer

Index of values to group on
labels : ndarray of int or None
Array of locations in level_index
level_index : Index or None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this uniques (these are not actually used anywhere, but are descriptive). We use certain terms in the codebase. This will make it consistent.

@jreback
Copy link
Contributor

jreback commented Oct 14, 2016

@jmmease lgtm. just some doc consistency changes. ping on green.

@jonmmease
Copy link
Contributor Author

@jreback Thanks, I'll make these updates this evening. Do you have any guidance on how I should proceed since this PR didn't update with my rebase+force push from yesterday? (See comment from @jorisvandenbossche above) I could try again with a merge instead of rebase or I could open a new PR and reference this one.

@jorisvandenbossche
Copy link
Member

On the short term, opening as a new PR is probably the easiest (if you close this one, you can just create a new one from the same branch)

@jonmmease
Copy link
Contributor Author

PR reopened in #14428

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Grouping by index and column fails on DataFrame with single index
5 participants