Skip to content

BUG: Fix groupby observed=True when aggregating a column #24412

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Dec 31, 2018
Merged

BUG: Fix groupby observed=True when aggregating a column #24412

merged 16 commits into from
Dec 31, 2018

Conversation

Koustav-Samaddar
Copy link
Contributor

@Koustav-Samaddar Koustav-Samaddar commented Dec 24, 2018

@codecov
Copy link

codecov bot commented Dec 24, 2018

Codecov Report

Merging #24412 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #24412   +/-   ##
=======================================
  Coverage    92.3%    92.3%           
=======================================
  Files         163      163           
  Lines       51943    51943           
=======================================
  Hits        47946    47946           
  Misses       3997     3997
Flag Coverage Δ
#multiple 90.71% <ø> (ø) ⬆️
#single 43% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/groupby/generic.py 87.15% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc7bc3f...3b50900. Read the comment docs.

4 similar comments
@codecov
Copy link

codecov bot commented Dec 24, 2018

Codecov Report

Merging #24412 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #24412   +/-   ##
=======================================
  Coverage    92.3%    92.3%           
=======================================
  Files         163      163           
  Lines       51943    51943           
=======================================
  Hits        47946    47946           
  Misses       3997     3997
Flag Coverage Δ
#multiple 90.71% <ø> (ø) ⬆️
#single 43% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/groupby/generic.py 87.15% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc7bc3f...3b50900. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 24, 2018

Codecov Report

Merging #24412 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #24412   +/-   ##
=======================================
  Coverage    92.3%    92.3%           
=======================================
  Files         163      163           
  Lines       51943    51943           
=======================================
  Hits        47946    47946           
  Misses       3997     3997
Flag Coverage Δ
#multiple 90.71% <ø> (ø) ⬆️
#single 43% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/groupby/generic.py 87.15% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc7bc3f...3b50900. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 24, 2018

Codecov Report

Merging #24412 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #24412   +/-   ##
=======================================
  Coverage    92.3%    92.3%           
=======================================
  Files         163      163           
  Lines       51943    51943           
=======================================
  Hits        47946    47946           
  Misses       3997     3997
Flag Coverage Δ
#multiple 90.71% <ø> (ø) ⬆️
#single 43% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/groupby/generic.py 87.15% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc7bc3f...3b50900. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 24, 2018

Codecov Report

Merging #24412 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #24412   +/-   ##
=======================================
  Coverage    92.3%    92.3%           
=======================================
  Files         163      163           
  Lines       51943    51943           
=======================================
  Hits        47946    47946           
  Misses       3997     3997
Flag Coverage Δ
#multiple 90.71% <ø> (ø) ⬆️
#single 43% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/groupby/generic.py 87.15% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc7bc3f...3b50900. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 24, 2018

Codecov Report

Merging #24412 into master will decrease coverage by 60.4%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #24412       +/-   ##
===========================================
- Coverage    92.3%    31.9%   -60.41%     
===========================================
  Files         163      166        +3     
  Lines       51977    52421      +444     
===========================================
- Hits        47979    16723    -31256     
- Misses       3998    35698    +31700
Flag Coverage Δ
#multiple 30.29% <ø> (-60.43%) ⬇️
#single 31.9% <ø> (-11.1%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/generic.py 13.71% <ø> (-73.45%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-98.65%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/core/reshape/reshape.py 8.06% <0%> (-91.51%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
... and 129 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef1bd69...3e9178a. Read the comment docs.

@mroeschke mroeschke changed the title Bug 23970 BUG: Fix groupby observed=True when aggregating a column Dec 24, 2018
'a': pd.Series([1, 1, 2], dtype='category'),
'b': [1, 2, 2], 'x': [1, 2, 3]})

expected = df
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use the observed fixture as well here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't quite understand. Could you elaborate?

@pep8speaks
Copy link

pep8speaks commented Dec 25, 2018

Hello @Koustav-Samaddar! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on December 28, 2018 at 03:23 Hours UTC

'b': [1, 2, 2], 'x': [1, 2, 3]})

result = expected.groupby(['a', 'b'],
as_index=False, observed=True)['x'].sum()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is as_index required here? I see the comment in the original issue though based off of code correction and surprised if that's really the culprit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as_index is required to demonstrate that the bug exists. The code block that requires the correct value of observed is only enetered if as_index=False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To further clarify, no as_index a.k.a. as_index=True would show no difference since the code that is executed in this case does not use observed.

'a': pd.Series([1, 1, 2], dtype='category'),
'b': [1, 2, 2], 'x': [1, 2, 3]})

result = expected.groupby(['a', 'b'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be reading this wrong but do we need the non-categorical column in the grouper to emulate this behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is because the bug generates a group that doesn't exist due to using some weird Cartesian product of the two columns used in the grouper.

Therefore, while the only valid groupings are (1, 1) (1, 2) (2, 2); the bug uses the CP of the two set(1, 2) * set(1, 2) which results in an additional non-existent (2, 1) grouping as well which is represented as NaN in the bugged output.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks for the responses. In that case I think you can improve the whatsnew to say that the observed keyword was essentially being ignored when selecting one column as part of the GroupBy. It’s currently too vague IMO

Copy link
Contributor Author

@Koustav-Samaddar Koustav-Samaddar Dec 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also used to be the case. But I changed it to be simpler on @mroeschke 's advice (in a previous resolved convo).

I can change it back if needed but would just like a confirmation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarity is always welcome. But generally whatsnew entries should refer to externally facing methods and implications.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea the previous entry was arguably too specific mentioning the exact internal method where the problem arose. The current entry is too vague to be useful.

If you can meet somewhere in the middle on that I think it again would be mentioning that the observed keyword was not being respected, but without mentioning the private internal methods where the implementation was actually changed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed it. Hopefully this is a better middle ground between the 2 previous approaches!


def test_groupby_agg_observed_true_single_column():
# GH-23970
expected = pd.DataFrame({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df and expected should be constructed as separate objects. I think generally this test case could be made more explicit depending on answer to other questions, but once we can clarify those pieces will want to split this out as two separate variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can create a test case where df and expected are different but that'll cause df to be larger.

I originally had expected and df as separate variable names but on @mroeschke 's advice (in a previous resolved convo) I replaced df with expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize over as_index=True/False

Copy link
Contributor Author

@Koustav-Samaddar Koustav-Samaddar Dec 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was parameterising as_index, but I hit a roadblock - I'm having difficulty recreating the expected value for as_index=True and I'm stumped how to go about fixing it.
Any help is greatly appreciated.

import pandas as pd
from pandas.util import testing as tm

expected = pd.DataFrame({
    'a': pd.Series([1, 1, 2], dtype='category'),
    'b': [1, 2, 2],
    'x': [1, 2, 3]
})

result = expected.groupby(['a', 'b'], 
	as_index=True,
	observed=True)['x'].sum()

print(result)

"""
a  b
1  1    1
   2    2
2  2    3
Name: x, dtype: int64
"""

s_idx = pd.MultiIndex(levels=[[1, 2], [1, 2]],
           codes=[[0, 0, 1], [0, 1, 1]],
           names=['a', 'b'])

s_val = [ 1, 2, 3 ]

expected = pd.Series(index=s_idx, data=s_val, name='x')
print(expected)

"""
a  b
1  1    1
   2    2
2  2    3
Name: x, dtype: int64
"""

tm.assert_series_equal(result, expected)

"""
Traceback (most recent call last):
  File "myTest.py", line 41, in <module>
    tm.assert_series_equal(result, expected)
  File "/home/vagrant/pandas/pandas/util/testing.py", line 1373, in assert_series_equal
    obj='{obj}.index'.format(obj=obj))
  File "/home/vagrant/pandas/pandas/util/testing.py", line 942, in assert_index_equal
    check_exact=check_exact, obj=lobj)
  File "/home/vagrant/pandas/pandas/util/testing.py", line 915, in assert_index_equal
    _check_types(left, right, obj=obj)
  File "/home/vagrant/pandas/pandas/util/testing.py", line 891, in _check_types
    assert_class_equal(l, r, exact=exact, obj=obj)
  File "/home/vagrant/pandas/pandas/util/testing.py", line 996, in assert_class_equal
    repr_class(right))
  File "/home/vagrant/pandas/pandas/util/testing.py", line 1188, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: MultiIndex level [0] are different

MultiIndex level [0] classes are not equivalent
[left]:  CategoricalIndex([1, 1, 2], categories=[1, 2], ordered=False, name='a', dtype='category')
[right]: Int64Index([1, 1, 2], dtype='int64', name='a')
"""

I understand that the issue is with not using CategoricalIndex at MultiIndex[0] but I have no idea how to go about doing it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use MultiIndex.from_arrays and pass a list of the Index already of the correct types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

@WillAyd WillAyd added the Categorical Categorical Data Type label Dec 25, 2018
@@ -1560,6 +1560,7 @@ Groupby/Resample/Rolling
- Bug in :meth:`pandas.core.groupby.GroupBy.rank` with ``method='dense'`` and ``pct=True`` when a group has only one member would raise a ``ZeroDivisionError`` (:issue:`23666`).
- Calling :meth:`pandas.core.groupby.GroupBy.rank` with empty groups and ``pct=True`` was raising a ``ZeroDivisionError`` (:issue:`22519`)
- Bug in :meth:`DataFrame.resample` when resampling ``NaT`` in ``TimeDeltaIndex`` (:issue:`13223`).
- Bug in :meth:`DataFrame.groupby` did not work correctly with ``observed=True`` when aggregating a specified column (:issue:`23970`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is when its a categorical column; also this is only for as_index=False, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any operation over a selected column was effectively ignoring the observed parameter that was being passed to it. With as_index=True the observed value isn't used so the code behaves as expected.


def test_groupby_agg_observed_true_single_column():
# GH-23970
expected = pd.DataFrame({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize over as_index=True/False


def test_groupby_agg_observed_true_single_column():
# GH-23970
expected = pd.DataFrame({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use MultiIndex.from_arrays and pass a list of the Index already of the correct types

@topper-123
Copy link
Contributor

A small note: this does not close #21151.

@Koustav-Samaddar
Copy link
Contributor Author

A small note: this does not close #21151.

Definitely. This only affects bugs caused due to the observed argument being ignored. From my understanding #21151 is due to an error in the building of the result DataFrame and not due to the observed tag being ignored.

result = df.groupby(
['a', 'b'], as_index=as_index, observed=True)['x'].sum()

if isinstance(result, pd.DataFrame):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you an use assert_equal here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Didn't know about that one

@jreback jreback added this to the 0.24.0 milestone Dec 30, 2018
@jreback
Copy link
Contributor

jreback commented Dec 30, 2018

minor comment. ping on green. @WillAyd ?

@jreback jreback merged commit c98b782 into pandas-dev:master Dec 31, 2018
@jreback
Copy link
Contributor

jreback commented Dec 31, 2018

thanks!

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

groupby observed=True not working for aggregating a column
6 participants