BUG: Fix groupby observed=True when aggregating a column #24412

Koustav-Samaddar · 2018-12-24T16:17:19Z

closes groupby observed=True not working for aggregating a column #23970
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Bug would have impacted any groupby function that relied on `observed` if it were `True`

codecov · 2018-12-24T16:54:45Z

Codecov Report

Merging #24412 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #24412   +/-   ##
=======================================
  Coverage    92.3%    92.3%           
=======================================
  Files         163      163           
  Lines       51943    51943           
=======================================
  Hits        47946    47946           
  Misses       3997     3997

Flag	Coverage Δ
#multiple	`90.71% <ø> (ø)`	⬆️
#single	`43% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/groupby/generic.py	`87.15% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc7bc3f...3b50900. Read the comment docs.

codecov · 2018-12-24T16:54:45Z

Codecov Report

Merging #24412 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #24412   +/-   ##
=======================================
  Coverage    92.3%    92.3%           
=======================================
  Files         163      163           
  Lines       51943    51943           
=======================================
  Hits        47946    47946           
  Misses       3997     3997

Flag	Coverage Δ
#multiple	`90.71% <ø> (ø)`	⬆️
#single	`43% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/groupby/generic.py	`87.15% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc7bc3f...3b50900. Read the comment docs.

codecov · 2018-12-24T16:54:47Z

Codecov Report

Merging #24412 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #24412   +/-   ##
=======================================
  Coverage    92.3%    92.3%           
=======================================
  Files         163      163           
  Lines       51943    51943           
=======================================
  Hits        47946    47946           
  Misses       3997     3997

Flag	Coverage Δ
#multiple	`90.71% <ø> (ø)`	⬆️
#single	`43% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/groupby/generic.py	`87.15% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc7bc3f...3b50900. Read the comment docs.

codecov · 2018-12-24T16:54:47Z

Codecov Report

Merging #24412 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #24412   +/-   ##
=======================================
  Coverage    92.3%    92.3%           
=======================================
  Files         163      163           
  Lines       51943    51943           
=======================================
  Hits        47946    47946           
  Misses       3997     3997

Flag	Coverage Δ
#multiple	`90.71% <ø> (ø)`	⬆️
#single	`43% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/groupby/generic.py	`87.15% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc7bc3f...3b50900. Read the comment docs.

codecov · 2018-12-24T16:54:48Z

Codecov Report

Merging #24412 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #24412   +/-   ##
=======================================
  Coverage    92.3%    92.3%           
=======================================
  Files         163      163           
  Lines       51943    51943           
=======================================
  Hits        47946    47946           
  Misses       3997     3997

Flag	Coverage Δ
#multiple	`90.71% <ø> (ø)`	⬆️
#single	`43% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/groupby/generic.py	`87.15% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc7bc3f...3b50900. Read the comment docs.

codecov · 2018-12-24T16:54:48Z

Codecov Report

Merging #24412 into master will decrease coverage by 60.4%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master   #24412       +/-   ##
===========================================
- Coverage    92.3%    31.9%   -60.41%     
===========================================
  Files         163      166        +3     
  Lines       51977    52421      +444     
===========================================
- Hits        47979    16723    -31256     
- Misses       3998    35698    +31700

Flag	Coverage Δ
#multiple	`30.29% <ø> (-60.43%)`	⬇️
#single	`31.9% <ø> (-11.1%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/groupby/generic.py	`13.71% <ø> (-73.45%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/core/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-98.65%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-95.46%)`	⬇️
pandas/core/reshape/reshape.py	`8.06% <0%> (-91.51%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.17%)`	⬇️
... and 129 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef1bd69...3e9178a. Read the comment docs.

doc/source/whatsnew/v0.24.0.rst

pandas/tests/groupby/test_groupby.py

jreback · 2018-12-24T19:17:25Z

pandas/tests/groupby/test_groupby.py

+        'a': pd.Series([1, 1, 2], dtype='category'),
+        'b': [1, 2, 2], 'x': [1, 2, 3]})
+
+    expected = df


you can use the observed fixture as well here

Sorry I didn't quite understand. Could you elaborate?

pep8speaks · 2018-12-25T00:37:17Z

Hello @Koustav-Samaddar! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on December 28, 2018 at 03:23 Hours UTC

WillAyd · 2018-12-25T00:48:39Z

pandas/tests/groupby/test_categorical.py

+        'b': [1, 2, 2], 'x': [1, 2, 3]})
+
+    result = expected.groupby(['a', 'b'],
+        as_index=False, observed=True)['x'].sum()


Is as_index required here? I see the comment in the original issue though based off of code correction and surprised if that's really the culprit

as_index is required to demonstrate that the bug exists. The code block that requires the correct value of observed is only enetered if as_index=False

To further clarify, no as_index a.k.a. as_index=True would show no difference since the code that is executed in this case does not use observed.

WillAyd · 2018-12-25T00:52:13Z

pandas/tests/groupby/test_categorical.py

+        'a': pd.Series([1, 1, 2], dtype='category'),
+        'b': [1, 2, 2], 'x': [1, 2, 3]})
+
+    result = expected.groupby(['a', 'b'],


Might be reading this wrong but do we need the non-categorical column in the grouper to emulate this behavior?

Yes. This is because the bug generates a group that doesn't exist due to using some weird Cartesian product of the two columns used in the grouper.

Therefore, while the only valid groupings are (1, 1) (1, 2) (2, 2); the bug uses the CP of the two set(1, 2) * set(1, 2) which results in an additional non-existent (2, 1) grouping as well which is represented as NaN in the bugged output.

Ok thanks for the responses. In that case I think you can improve the whatsnew to say that the observed keyword was essentially being ignored when selecting one column as part of the GroupBy. It’s currently too vague IMO

Also used to be the case. But I changed it to be simpler on @mroeschke 's advice (in a previous resolved convo).

I can change it back if needed but would just like a confirmation.

Clarity is always welcome. But generally whatsnew entries should refer to externally facing methods and implications.

Yea the previous entry was arguably too specific mentioning the exact internal method where the problem arose. The current entry is too vague to be useful.

If you can meet somewhere in the middle on that I think it again would be mentioning that the observed keyword was not being respected, but without mentioning the private internal methods where the implementation was actually changed

Pushed it. Hopefully this is a better middle ground between the 2 previous approaches!

WillAyd · 2018-12-25T00:53:05Z

pandas/tests/groupby/test_categorical.py

+
+def test_groupby_agg_observed_true_single_column():
+    # GH-23970
+    expected = pd.DataFrame({


df and expected should be constructed as separate objects. I think generally this test case could be made more explicit depending on answer to other questions, but once we can clarify those pieces will want to split this out as two separate variables

I can create a test case where df and expected are different but that'll cause df to be larger.

I originally had expected and df as separate variable names but on @mroeschke 's advice (in a previous resolved convo) I replaced df with expected.

can you parameterize over as_index=True/False

I was parameterising as_index, but I hit a roadblock - I'm having difficulty recreating the expected value for as_index=True and I'm stumped how to go about fixing it.
Any help is greatly appreciated.

import pandas as pd from pandas.util import testing as tm expected = pd.DataFrame({ 'a': pd.Series([1, 1, 2], dtype='category'), 'b': [1, 2, 2], 'x': [1, 2, 3] }) result = expected.groupby(['a', 'b'], as_index=True, observed=True)['x'].sum() print(result) """ a b 1 1 1 2 2 2 2 3 Name: x, dtype: int64 """ s_idx = pd.MultiIndex(levels=[[1, 2], [1, 2]], codes=[[0, 0, 1], [0, 1, 1]], names=['a', 'b']) s_val = [ 1, 2, 3 ] expected = pd.Series(index=s_idx, data=s_val, name='x') print(expected) """ a b 1 1 1 2 2 2 2 3 Name: x, dtype: int64 """ tm.assert_series_equal(result, expected) """ Traceback (most recent call last): File "myTest.py", line 41, in <module> tm.assert_series_equal(result, expected) File "/home/vagrant/pandas/pandas/util/testing.py", line 1373, in assert_series_equal obj='{obj}.index'.format(obj=obj)) File "/home/vagrant/pandas/pandas/util/testing.py", line 942, in assert_index_equal check_exact=check_exact, obj=lobj) File "/home/vagrant/pandas/pandas/util/testing.py", line 915, in assert_index_equal _check_types(left, right, obj=obj) File "/home/vagrant/pandas/pandas/util/testing.py", line 891, in _check_types assert_class_equal(l, r, exact=exact, obj=obj) File "/home/vagrant/pandas/pandas/util/testing.py", line 996, in assert_class_equal repr_class(right)) File "/home/vagrant/pandas/pandas/util/testing.py", line 1188, in raise_assert_detail raise AssertionError(msg) AssertionError: MultiIndex level [0] are different MultiIndex level [0] classes are not equivalent [left]: CategoricalIndex([1, 1, 2], categories=[1, 2], ordered=False, name='a', dtype='category') [right]: Int64Index([1, 1, 2], dtype='int64', name='a') """

I understand that the issue is with not using CategoricalIndex at MultiIndex[0] but I have no idea how to go about doing it.

use MultiIndex.from_arrays and pass a list of the Index already of the correct types

Thanks a lot!

jreback · 2018-12-26T00:12:54Z

doc/source/whatsnew/v0.24.0.rst

@@ -1560,6 +1560,7 @@ Groupby/Resample/Rolling
 - Bug in :meth:`pandas.core.groupby.GroupBy.rank` with ``method='dense'`` and ``pct=True`` when a group has only one member would raise a ``ZeroDivisionError`` (:issue:`23666`).
 - Calling :meth:`pandas.core.groupby.GroupBy.rank` with empty groups and ``pct=True`` was raising a ``ZeroDivisionError`` (:issue:`22519`)
 - Bug in :meth:`DataFrame.resample` when resampling ``NaT`` in ``TimeDeltaIndex`` (:issue:`13223`).
+- Bug in :meth:`DataFrame.groupby` did not work correctly with ``observed=True`` when aggregating a specified column (:issue:`23970`)


this is when its a categorical column; also this is only for as_index=False, right?

Any operation over a selected column was effectively ignoring the observed parameter that was being passed to it. With as_index=True the observed value isn't used so the code behaves as expected.

jreback · 2018-12-26T00:13:19Z

pandas/tests/groupby/test_categorical.py

+
+def test_groupby_agg_observed_true_single_column():
+    # GH-23970
+    expected = pd.DataFrame({


can you parameterize over as_index=True/False

doc/source/whatsnew/v0.24.0.rst

jreback · 2018-12-26T18:43:47Z

pandas/tests/groupby/test_categorical.py

+
+def test_groupby_agg_observed_true_single_column():
+    # GH-23970
+    expected = pd.DataFrame({


use MultiIndex.from_arrays and pass a list of the Index already of the correct types

topper-123 · 2018-12-28T17:18:57Z

A small note: this does not close #21151.

Koustav-Samaddar · 2018-12-30T03:52:25Z

A small note: this does not close #21151.

Definitely. This only affects bugs caused due to the observed argument being ignored. From my understanding #21151 is due to an error in the building of the result DataFrame and not due to the observed tag being ignored.

jreback · 2018-12-30T20:15:21Z

pandas/tests/groupby/test_categorical.py

+    result = df.groupby(
+        ['a', 'b'], as_index=as_index, observed=True)['x'].sum()
+
+    if isinstance(result, pd.DataFrame):


you an use assert_equal here

Thanks! Didn't know about that one

jreback · 2018-12-30T20:16:01Z

minor comment. ping on green. @WillAyd ?

jreback · 2018-12-31T13:25:09Z

thanks!

…24412)

Koustav-Samaddar added 3 commits December 24, 2018 16:02

GH23970 Added test-case that causes bug 23970

debb02a

GH 23970 Fixed source of the bug

f450dc2

Bug would have impacted any groupby function that relied on `observed` if it were `True`

GH23970 Added relevant docstring to whatsnew

3b50900

mroeschke changed the title ~~Bug 23970~~ BUG: Fix groupby observed=True when aggregating a column Dec 24, 2018

mroeschke reviewed Dec 24, 2018

View reviewed changes

doc/source/whatsnew/v0.24.0.rst Outdated Show resolved Hide resolved

mroeschke reviewed Dec 24, 2018

View reviewed changes

pandas/tests/groupby/test_groupby.py Outdated Show resolved Hide resolved

mroeschke reviewed Dec 24, 2018

View reviewed changes

pandas/tests/groupby/test_groupby.py Outdated Show resolved Hide resolved

mroeschke added Bug Groupby labels Dec 24, 2018

jreback requested changes Dec 24, 2018

View reviewed changes

pandas/tests/groupby/test_groupby.py Outdated Show resolved Hide resolved

jreback requested changes Dec 24, 2018

View reviewed changes

Koustav-Samaddar added 3 commits December 24, 2018 23:08

Made requested change to docstring

e6e9894

Made requested changes to tests

5181639

Fixed typo from incomplete find/replace

4c68d41

Fixed pep8 violation

77cf3c6

WillAyd requested changes Dec 25, 2018

View reviewed changes

WillAyd added the Categorical Categorical Data Type label Dec 25, 2018

Koustav-Samaddar added 2 commits December 25, 2018 02:07

Fixed pep's E128 violation

572d9c3

Fixed another possible pep violation

8793995

jreback requested changes Dec 26, 2018

View reviewed changes

Koustav-Samaddar added 2 commits December 26, 2018 03:42

Fixed whatsnew entryto meet in the middle

b75bc7b

Merge branch 'master' into bug-23970

ef48de6

jreback reviewed Dec 26, 2018

View reviewed changes

Koustav-Samaddar added 4 commits December 28, 2018 03:16

Test has been modified to have parameterised as_index

d87e159

Fixed pep8 issues in test file

b01d809

Made recommended change to whatsnew

7a6898f

Merged for git pull

d584327

jreback reviewed Dec 30, 2018

View reviewed changes

jreback added this to the 0.24.0 milestone Dec 30, 2018

Changed test to use assert_equal

3e9178a

jreback approved these changes Dec 31, 2018

View reviewed changes

jreback merged commit c98b782 into pandas-dev:master Dec 31, 2018

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

BUG: Fix groupby observed=True when aggregating a column (pandas-dev#…

5ba4dd1

…24412)

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

BUG: Fix groupby observed=True when aggregating a column (pandas-dev#…

8012c07

…24412)

Uh oh!

BUG: Fix groupby observed=True when aggregating a column #24412

BUG: Fix groupby observed=True when aggregating a column #24412

Uh oh!

Conversation

Koustav-Samaddar commented Dec 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 24, 2018

Codecov Report

Uh oh!

codecov bot commented Dec 24, 2018

Codecov Report

Uh oh!

codecov bot commented Dec 24, 2018

Codecov Report

Uh oh!

codecov bot commented Dec 24, 2018

Codecov Report

Uh oh!

codecov bot commented Dec 24, 2018

Codecov Report

Uh oh!

codecov bot commented Dec 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pep8speaks commented Dec 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated on December 28, 2018 at 03:23 Hours UTC

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Koustav-Samaddar Dec 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Koustav-Samaddar Dec 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Koustav-Samaddar commented Dec 24, 2018 •

edited

Loading

codecov bot commented Dec 24, 2018 •

edited

Loading

pep8speaks commented Dec 25, 2018 •

edited

Loading

Koustav-Samaddar Dec 25, 2018 •

edited

Loading

Koustav-Samaddar Dec 26, 2018 •

edited

Loading