Skip to content

BUG: ngroups and len(groups) do not equal when grouping with a list of Grouper and column label (GH26326) #26374

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
May 20, 2019

Conversation

shantanu-gontia
Copy link
Contributor

@shantanu-gontia shantanu-gontia commented May 13, 2019

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Most comments are stylistic though I think the implementation needs to be more generalizable as well

to_groupby = zip(*(ping.grouper for ping in self.groupings))
to_groupby = zip(*(ping.grouper if not isinstance(ping.grouper,
self.__class__)
else ping.grouper.groupings[0].grouper for ping
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this only work if the grouper was the first item? Would need something more generalizable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering the same. Can Groupers be nested to more than one level?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would fail if your test was .groupby(['beta', pd.Grouper(level='alpha')]) instead of .groupby([pd.Grouper(level='alpha'), 'beta'])

Copy link
Contributor Author

@shantanu-gontia shantanu-gontia May 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is with the construction of the Groupings. The [0] indexes the sole BaseGrouper created inside the grouping due to the _get_grouper method. The order will be maintained. I have added another assertion to check this and it passed.



def test_groupby_groups_in_BaseGrouper():
# https://github.com/pandas-dev/pandas/issues/26326
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# https://github.com/pandas-dev/pandas/issues/26326
# GH 26326


def test_groupby_groups_in_BaseGrouper():
# https://github.com/pandas-dev/pandas/issues/26326
m_index = pd.MultiIndex.from_product([['A', 'B'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use mi instead of m_index

# https://github.com/pandas-dev/pandas/issues/26326
m_index = pd.MultiIndex.from_product([['A', 'B'],
['C', 'D']], names=['alpha', 'beta'])
df_sample = pd.DataFrame({'foo': [1, 2, 1, 2], 'bar': [1, 2, 3, 4]},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df instead of df_sample

['C', 'D']], names=['alpha', 'beta'])
df_sample = pd.DataFrame({'foo': [1, 2, 1, 2], 'bar': [1, 2, 3, 4]},
index=m_index)
dfGBY_BaseGrouper = df_sample.groupby([pd.Grouper(level='alpha'), 'beta'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just grp1 and grp2 for better readability here and line below

@@ -406,7 +406,7 @@ Groupby/Resample/Rolling
- Bug in :meth:`pandas.core.groupby.GroupBy.idxmax` and :meth:`pandas.core.groupby.GroupBy.idxmin` with datetime column would return incorrect dtype (:issue:`25444`, :issue:`15306`)
- Bug in :meth:`pandas.core.groupby.GroupBy.cumsum`, :meth:`pandas.core.groupby.GroupBy.cumprod`, :meth:`pandas.core.groupby.GroupBy.cummin` and :meth:`pandas.core.groupby.GroupBy.cummax` with categorical column having absent categories, would return incorrect result or segfault (:issue:`16771`)
- Bug in :meth:`pandas.core.groupby.GroupBy.nth` where NA values in the grouping would return incorrect results (:issue:`26011`)

- Bug in :meth:`pandas.core.groupby.ops.BaseGrouper.groups` in which a :class:`BaseGrouper` object with another :class:`BaseGrouper` as part of its :class:`Groupings` would return incorrect set of groups (:issue:`26326`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you construct a message that is more user facing? These items aren't part of the API

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User facing as in more user friendly?

@codecov
Copy link

codecov bot commented May 13, 2019

Codecov Report

Merging #26374 into master will decrease coverage by 50.49%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26374      +/-   ##
==========================================
- Coverage   91.68%   41.19%   -50.5%     
==========================================
  Files         174      174              
  Lines       50700    50700              
==========================================
- Hits        46486    20887   -25599     
- Misses       4214    29813   +25599
Flag Coverage Δ
#multiple ?
#single 41.19% <0%> (-0.15%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/ops.py 19.47% <0%> (-76.49%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
pandas/core/tools/numeric.py 10.44% <0%> (-89.56%) ⬇️
... and 130 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e26aa00...0dba40d. Read the comment docs.

@codecov
Copy link

codecov bot commented May 13, 2019

Codecov Report

Merging #26374 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26374      +/-   ##
==========================================
+ Coverage   91.73%   91.74%   +<.01%     
==========================================
  Files         174      174              
  Lines       50754    50752       -2     
==========================================
+ Hits        46560    46561       +1     
+ Misses       4194     4191       -3
Flag Coverage Δ
#multiple 90.25% <100%> (+0.01%) ⬆️
#single 41.69% <40%> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/grouper.py 98.53% <100%> (ø) ⬆️
pandas/core/groupby/ops.py 96% <100%> (+0.03%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97.02% <0%> (-0.12%) ⬇️
pandas/core/base.py 97.97% <0%> (-0.02%) ⬇️
pandas/core/series.py 93.67% <0%> (ø) ⬆️
pandas/core/internals/blocks.py 94.08% <0%> (ø) ⬆️
pandas/util/testing.py 90.7% <0%> (+0.1%) ⬆️
pandas/core/dtypes/dtypes.py 97.34% <0%> (+0.69%) ⬆️
pandas/core/computation/expr.py 97.52% <0%> (+0.82%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 80bddaf...ea023de. Read the comment docs.

@shantanu-gontia
Copy link
Contributor Author

Added the changes you requested in the last commit

@shantanu-gontia
Copy link
Contributor Author

It would appear that this method would not work when subclasses of BaseGrouper are involved. Will patch with a different method in the next commit

@pep8speaks
Copy link

pep8speaks commented May 14, 2019

Hello @shantanu-gontia! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-05-19 22:20:05 UTC

@@ -408,7 +408,7 @@ Groupby/Resample/Rolling
- Bug in :meth:`pandas.core.groupby.GroupBy.idxmax` and :meth:`pandas.core.groupby.GroupBy.idxmin` with datetime column would return incorrect dtype (:issue:`25444`, :issue:`15306`)
- Bug in :meth:`pandas.core.groupby.GroupBy.cumsum`, :meth:`pandas.core.groupby.GroupBy.cumprod`, :meth:`pandas.core.groupby.GroupBy.cummin` and :meth:`pandas.core.groupby.GroupBy.cummax` with categorical column having absent categories, would return incorrect result or segfault (:issue:`16771`)
- Bug in :meth:`pandas.core.groupby.GroupBy.nth` where NA values in the grouping would return incorrect results (:issue:`26011`)

- Bug in :meth:`pandas.core.groupby.ops.BaseGrouper.groups` in which creating a :class:`GroupBy` object with a key of type :class:`Grouper` would result in producing incorrect groups (:issue:`26326`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not user facing, can you reword a bit

@@ -258,6 +267,8 @@ def groups(self):
if len(self.groupings) == 1:
return self.groupings[0].groups
else:
def is_basegrouper(self, obj):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed

['C', 'D']], names=['alpha', 'beta'])
df = pd.DataFrame({'foo': [1, 2, 1, 2], 'bar': [1, 2, 3, 4]},
index=mi)
grp1 = df.groupby([pd.Grouper(level='alpha'), 'beta'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you call this

result =

and the nbggrp1 -> expected

then do then consecutively, e.g. like

result = df.groupby(....)
expected = ....

assert result.groups == expected.groups

result = df......
expected = ....

assert

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, looks cleaner that way.



def test_groupby_groups_in_BaseGrouper():
# GH 26326
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give a 1-line description of what this is testing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. will do so in the next commit

@jreback jreback added this to the 0.25.0 milestone May 19, 2019
@@ -444,9 +444,9 @@ Groupby/Resample/Rolling
- Bug in :meth:`pandas.core.groupby.GroupBy.cumsum`, :meth:`pandas.core.groupby.GroupBy.cumprod`, :meth:`pandas.core.groupby.GroupBy.cummin` and :meth:`pandas.core.groupby.GroupBy.cummax` with categorical column having absent categories, would return incorrect result or segfault (:issue:`16771`)
- Bug in :meth:`pandas.core.groupby.GroupBy.nth` where NA values in the grouping would return incorrect results (:issue:`26011`)
- Bug in :meth:`pandas.core.groupby.SeriesGroupBy.transform` where transforming an empty group would raise error (:issue:`26208`)
- Bug in :meth:`pandas.core.frame.DataFrame.groupby` where passing a :class:`pandas.core.groupby.grouper.Grouper` would return incorrect groups (:issue:`26326`)
Copy link
Contributor

@jreback jreback May 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incorrect groups -> incorrect groups when using .groups accessor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do so in the next commit.

@jreback jreback merged commit 279753c into pandas-dev:master May 20, 2019
@jreback
Copy link
Contributor

jreback commented May 20, 2019

thanks @shantanu-gontia

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: ngroups and len(groups) do not equal when grouping with a list of Grouper and column label
4 participants