-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Bug in groupby.get_group on categoricalindex #15163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, a few comments inline.
@@ -389,7 +389,7 @@ Bug Fixes | |||
|
|||
- Bug in compat for passing long integers to ``Timestamp.replace`` (:issue:`15030`) | |||
- Bug in ``.loc`` that would not return the correct dtype for scalar access for a DataFrame (:issue:`11617`) | |||
|
|||
- Bug in ``CategoricalIndex``, a missing method is causing ``groupby`` to fail (:issue:`15155`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you re-phrase this as "Bug in GroupBy.get_group
failing with a categorical grouper (:issue15155
)"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
@@ -48,8 +48,9 @@ def setUp(self): | |||
'D': np.array( | |||
np.random.randn(8), dtype='float32')}) | |||
|
|||
index = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'], ['one', 'two', | |||
'three']], | |||
index = MultiIndex(levels=[pd.CategoricalIndex( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this could break other tests that use self.mframe
. Better to just change this in your tests.
def test_level_groupby_get_group(self):
df = self.mframe.copy()
index = ... # your new index w/ a Categorial
df.index = index
... # your test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well my test is the only one using self.mframe
- in fact quite a few tests could use self.mframe
now. I can change it back, but this feels more natural.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also simply contruct the MI you need (e.g. from the original issue) directly in the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, .mframe
is used by several tests, create a new df local to the test, this is very specific.
def test_level_groupby_get_group(self): | ||
# gh15155 | ||
testGroupBy = self.mframe.groupby(level=["first"]) | ||
assert_numpy_array_equal(testGroupBy.get_group("foo").values, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a huge deal since this is mostly a smoke test, but better to use assert_frame_equal(a, b)
so that we test the indexes and names line up too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use assert_numpy_array_equal
, rather directly test the get_group
and assert the results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't use assert_frame_equal(a,b)
since the two have different structures, at least not as it is:
# testGroupBy.get_group("foo")
A B C
first second
foo one -0.529506 -1.282295 0.521247
two 0.982640 0.428228 1.031026
three -0.753013 0.087009 1.082119
# self.mframe.loc["foo"]
A B C
second
one -0.529506 -1.282295 0.521247
two 0.982640 0.428228 1.031026
three -0.753013 0.087009 1.082119
I agree that testing frames would be nicer, but how?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [3]: mframe.loc[['foo']]
Out[3]:
A B C
first second
foo one -0.859892 0.873341 0.679335
two -1.751616 0.678056 0.118863
three 0.055330 0.550937 1.545533
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That still fails (now using the code from the example to generate the test data):
# assert_frame_equal(testGroupedBy.get_group("a"),test.loc[["a"]])
MultiIndex level [0] classes are not equivalent
[left]: CategoricalIndex([u'a', u'a', u'a', u'a', u'a'], categories=[u'a', u'b'], ordered=False, name=u'Index1', dtype='category')
[right]: Index([u'a', u'a', u'a', u'a', u'a'], dtype='object', name=u'Index1')
Is this another bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a separate issue. please post an issue with a simpe repro.
for now you can simply construct the expected frame directly for comparision, and put a note with a reference to the new issue.
6a99731
to
742d4a5
Compare
Current coverage is 85.54% (diff: 100%)
|
thanks @watercrossing if you are interested in working on #15166 would be great! |
closes pandas-dev#15155 Author: watercrossing <[email protected]> Closes pandas-dev#15163 from watercrossing/indexgroup and squashes the following commits: 742d4a5 [watercrossing] BUG: GroupBy.get_group failing with a categorical grouper (pandas-dev#15155)
git diff upstream/master | flake8 --diff