BUG: Fix #10355, std() groupby calculation #26229

alexcwatt · 2019-04-27T22:37:59Z

closes BUG: STD modifies groupby target column when as_index=False #10355
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

codecov · 2019-04-28T00:49:14Z

Codecov Report

Merging #26229 into master will decrease coverage by 51.27%.
The diff coverage is 0%.

@@             Coverage Diff             @@
##           master   #26229       +/-   ##
===========================================
- Coverage   91.97%    40.7%   -51.28%     
===========================================
  Files         175      175               
  Lines       52379    52380        +1     
===========================================
- Hits        48178    21321    -26857     
- Misses       4201    31059    +26858

Flag	Coverage Δ
#multiple	`?`
#single	`40.7% <0%> (-0.15%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/groupby/groupby.py	`24.45% <0%> (-72.78%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.37%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.16%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.1%)`	⬇️
pandas/core/tools/numeric.py	`10.44% <0%> (-89.56%)`	⬇️
... and 131 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64104ec...493901e. Read the comment docs.

codecov · 2019-04-28T00:49:15Z

Codecov Report

Merging #26229 into master will decrease coverage by 50.05%.
The diff coverage is 0%.

@@             Coverage Diff             @@
##           master   #26229       +/-   ##
===========================================
- Coverage   91.73%   41.68%   -50.06%     
===========================================
  Files         174      174               
  Lines       50741    50746        +5     
===========================================
- Hits        46548    21154    -25394     
- Misses       4193    29592    +25399

Flag	Coverage Δ
#multiple	`?`
#single	`41.68% <0%> (-0.11%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/groupby/ops.py	`19.47% <ø> (-76.49%)`	⬇️
pandas/core/groupby/groupby.py	`24.24% <0%> (-73%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.37%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.16%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.1%)`	⬇️
... and 131 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1263e1a...83847b6. Read the comment docs.

jreback · 2019-04-28T15:35:47Z

pandas/core/groupby/groupby.py

@@ -867,7 +867,7 @@ def _python_agg_general(self, func, *args, **kwargs):
            try:
                result, counts = self.grouper.agg_series(obj, f)
                output[name] = self._try_cast(result, obj, numeric_only=True)
-            except TypeError:
+            except (IndexError, TypeError):


really? how does this happen?

Some of the test cases fail when using _python_agg_general on, I believe, an empty series. I was quite confused at first as to why my code was causing this to fail when tests were passing just fine on var. Then I noticed that if I removed the _cython_agg_general call (which var tries to use, and only falls back on _python_agg_general in case of an exception), additional tests would begin failing.

is this still needed? what exactly is raising IndexErrror?

So it comes from groupby/ops.py's _aggregate_series_pure_python (called by agg_series when _aggregate_series_fast doesn't work). In particular if pandas/tests/resample/test_base.py:test_resample_empty_series is running and I don't have the cython version and I instead run the python version, the test fails due to an IndexError from pandas\_libs\reduction.pyx:242 where get_result checks self.ngroups > 0 and then accesses self.bins[0] ... I guess maybe I should update that code to check for self.bins > 0.

I can also just not handle IndexError and forget about this bug; if it comes up for std(), it would also be likely to come up for other functions in the same way... only when Cython fails and some aggregation is being done on an empty series, if I understand correctly.

yes that is better

we need to be very clear in where exceptions are caught

It seems that there is another issue if I merely check self.bins.size > 0 that still causes a failure; I think it's best not to worry about this case at the time.

pandas/core/groupby/groupby.py

pandas/tests/groupby/test_groupby.py

jreback · 2019-04-28T15:40:13Z

did the patch i put up in the original issue not solve this?

pep8speaks · 2019-04-28T17:57:00Z

Hello @alexcwatt! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-05-17 13:55:48 UTC

pandas/core/groupby/groupby.py

jreback · 2019-04-28T18:49:17Z

pandas/core/groupby/groupby.py

@@ -867,7 +867,7 @@ def _python_agg_general(self, func, *args, **kwargs):
            try:
                result, counts = self.grouper.agg_series(obj, f)
                output[name] = self._try_cast(result, obj, numeric_only=True)
-            except TypeError:
+            except (IndexError, TypeError):


is this still needed? what exactly is raising IndexErrror?

pandas/tests/groupby/test_function.py

pandas/tests/groupby/test_whitelist.py

jreback

looks pretty good. comments on the tests. ping on green.

jreback · 2019-05-07T01:40:25Z

pandas/tests/groupby/test_whitelist.py

    if axis == 0:
        frame = raw_frame
    else:
        frame = raw_frame.T

+    groupby_kwargs = {'level': level, 'axis': axis, 'sort': sort,


I don't think you actually need to define this kwargs, just inline it when calling frame.groupby(...) no?

Ah, good point

jreback · 2019-05-07T01:41:29Z

pandas/tests/groupby/test_whitelist.py

+    groupby_kwargs = {'level': level, 'axis': axis, 'sort': sort,
+        'as_index': as_index}
+    group_op_kwargs = {}
+    frame_op_kwargs = {'level': level, 'axis': axis}
    if op in AGG_FUNCTIONS_WITH_SKIPNA:


I think this would be more clear with

# comment explaining why this is needed if op in AGG_FUNCTION.S....: group_op_kwargs = ..... frame_op_kwargs = .... else: group...... frame_op....

So remove the idea of 'base arguments'? I thought it might be good to pull out the common arguments but maybe it's more readable the other way. The previous test code just seemed pretty repetitive to me.

yeah just trying to make this simpler to read

Sounds good, I will clean these tests up a bit and ping when everything is green!

Estebansalazar · 2019-05-11T22:22:24Z

Thaks!

jreback

pls also merge master

jreback · 2019-05-12T14:38:26Z

doc/source/whatsnew/v0.25.0.rst

@@ -399,6 +399,7 @@ Groupby/Resample/Rolling
 - Bug in :meth:`pandas.core.groupby.GroupBy.idxmax` and :meth:`pandas.core.groupby.GroupBy.idxmin` with datetime column would return incorrect dtype (:issue:`25444`, :issue:`15306`)
 - Bug in :meth:`pandas.core.groupby.GroupBy.cumsum`, :meth:`pandas.core.groupby.GroupBy.cumprod`, :meth:`pandas.core.groupby.GroupBy.cummin` and :meth:`pandas.core.groupby.GroupBy.cummax` with categorical column having absent categories, would return incorrect result or segfault (:issue:`16771`)
 - Bug in :meth:`pandas.core.groupby.GroupBy.nth` where NA values in the grouping would return incorrect results (:issue:`26011`)
+- Bug in :meth:`pandas.core.groupby.GroupBy.std` that computed standard deviation without respecting groupby context when `as_index=False` (:issue:`10355`)


use double back ticks around as_index=False

jreback · 2019-05-12T14:40:08Z

pandas/tests/groupby/test_whitelist.py

@@ -164,33 +164,43 @@ def raw_frame():
 @pytest.mark.parametrize('axis', [0, 1])


remove this and it will just use the axis fixture; note that axis will take on [0, 1, 'index', 'columns'], so you may have to adjust some of the test conditions

jreback · 2019-05-16T01:51:11Z

can you merge master and update

alexcwatt · 2019-05-17T04:40:15Z

Merged master and made some updates to the unit tests. I am now looking into some failing test cases (pytest pandas\tests\resample\ -k test_resample_empty_series), will try to figure out this weekend but wanted to push the other minor updates.

alexcwatt · 2019-05-17T13:58:30Z

Still have a bit more to do with this one corner case. I think the reason it hasn't been an issue too much in the past is that some methods like groupby's mean just catch all exceptions coming from the cython_agg_general and fall back on Python. I can do that if I have to, and revert my changes to reduction.pyx, but I am trying to grok what's going on in there so we don't need to catch a general Exception and fall back on Python.

If anyone can shed some light please feel free, otherwise again I am going to try to understand and resolve it this weekend.

jreback · 2019-05-19T18:48:39Z

thanks for working on this @alexcwatt groupby is a bit tricky :-D

alexcwatt · 2019-05-20T03:27:40Z

@jreback Can you please confirm that the test that is failing does specify the desired behavior? Currently the expectation is that it should return an empty dataframe, with no columns. The actual result is now that it returns an empty dataframe, with a column '0'. The column '0' is seen in other tests like test_resample_with_nat that actually return data but the test test_resample_with_only_nat is written to specify that we expect an empty dataframe with no columns...

WillAyd · 2019-05-20T16:40:56Z

@alexcwatt I wasn't able to reproduce that issue off of this locally; restarted CI to see

WillAyd · 2019-05-20T16:46:00Z

Ignore previous comment I was able to reproduce issue after building extensions. So yea @alexcwatt it looks like these changes are implicitly adding a blank column to that test which we wouldn't want

alexcwatt · 2019-05-20T20:11:02Z

Okay, thanks for the confirmation, will look into it when I can

…

On Mon, May 20, 2019, at 12:46 PM, William Ayd wrote: Ignore previous comment I was able to reproduce issue after building extensions. So yea @alexcwatt <https://github.com/alexcwatt> it looks like these changes are implicitly adding a blank column to that test which we wouldn't want — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26229?email_source=notifications&email_token=AADYU6NIYF2SQ24WKQFLB6TPWLIXDA5CNFSM4HI5FVL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVZNNOQ#issuecomment-494065338>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADYU6JE2DS3LA3D2LYPKYTPWLIXDANCNFSM4HI5FVLQ>.

jreback · 2019-06-08T22:18:18Z

can you merge master and see if you can get passing

jreback · 2019-06-27T03:41:54Z

can you merge master.

alexcwatt · 2019-06-27T20:46:32Z

I'll take another look tonight. Unfortunately what I've found is that there is an exception being raised by the Cython code that I don't quite understand, and the way some of the other groupby operations work is that they catch exceptions from Cython and fall back to the Python version that doesn't have the bug/issue. I think this is what I need to do for std() as well.

Edit: When I tried fixing this in the Cython it lead to that one unit test failing, and I haven't been able to figure out how to prevent that purely Cython side. Might open a bug about the Cython code to be fixed later, with notes on the direction I was going.

TomAugspurger · 2019-07-02T20:22:23Z

Pushing to 0.25.1. @alex if you merge master sometime later today there will be a 0.25.1.rst for the release note.

WillAyd · 2019-08-28T16:13:38Z

Nice PR but I think this has gone stale so closing. ping if you'd like to pick back up

jreback requested changes Apr 28, 2019

View reviewed changes

jreback added Bug Groupby labels Apr 28, 2019

jreback requested changes Apr 28, 2019

View reviewed changes

pandas/core/groupby/groupby.py Show resolved Hide resolved

jreback requested changes Apr 28, 2019

View reviewed changes

pandas/core/groupby/groupby.py Show resolved Hide resolved

jreback requested changes Apr 28, 2019

View reviewed changes

WillAyd requested changes Apr 29, 2019

View reviewed changes

pandas/tests/groupby/test_function.py Outdated Show resolved Hide resolved

gfyoung reviewed May 2, 2019

View reviewed changes

pandas/tests/groupby/test_whitelist.py Outdated Show resolved Hide resolved

alexcwatt added 13 commits May 6, 2019 21:32

BUG: Fix pandas-dev#10355, std() groupby calculation

14eb325

Add whatsnew note

281ae55

Pass ddof to std, remove axis

0b457ef

Add back validation

17b5aaf

Handle IndexError in _python_agg_general

85b1639

Make requested changes to tests

a0c3c3e

Add cython version of groupby std

575a9ca

PEP8 fix

ef89d64

Refactor std and var to reduce complexity

62ad090

Don't cactch IndexError

12b49c4

Begin updating test_regression_whitelist_methods

8c52aba

Update test_regression_whitelist_methods, remove specialized test

34382db

Resolve PEP8 issue

42ad605

alexcwatt force-pushed the fix-10355 branch from 3194ef2 to 42ad605 Compare May 7, 2019 01:36

jreback requested changes May 7, 2019

View reviewed changes

Make test code clearer

71e1fb8

jreback requested changes May 12, 2019

View reviewed changes

alexcwatt added 2 commits May 16, 2019 23:37

Merge branch 'master' into fix-10355

663c564

Switch to fixture for axis, add appropriate pytest.skips

ee5ba00

Handle case of no bins in reduction.pyx

83847b6

TomAugspurger added this to the 0.25.1 milestone Jul 2, 2019

jreback removed this from the 0.25.1 milestone Jul 24, 2019

WillAyd closed this Aug 28, 2019

		@@ -164,33 +164,43 @@ def raw_frame():
		@pytest.mark.parametrize('axis', [0, 1])

BUG: Fix #10355, std() groupby calculation #26229

BUG: Fix #10355, std() groupby calculation #26229

Conversation

alexcwatt commented Apr 27, 2019

codecov bot commented Apr 28, 2019

Codecov Report

codecov bot commented Apr 28, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

alexcwatt Apr 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 28, 2019

pep8speaks commented Apr 28, 2019 • edited Loading

Comment last updated at 2019-05-17 13:55:48 UTC

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Estebansalazar commented May 11, 2019

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented May 16, 2019

alexcwatt commented May 17, 2019

alexcwatt commented May 17, 2019

jreback commented May 19, 2019

alexcwatt commented May 20, 2019

WillAyd commented May 20, 2019

WillAyd commented May 20, 2019

alexcwatt commented May 20, 2019 via email

jreback commented Jun 8, 2019

jreback commented Jun 27, 2019

alexcwatt commented Jun 27, 2019 • edited Loading

TomAugspurger commented Jul 2, 2019

WillAyd commented Aug 28, 2019

codecov bot commented Apr 28, 2019 •

edited

Loading

alexcwatt Apr 28, 2019 •

edited

Loading

pep8speaks commented Apr 28, 2019 •

edited

Loading

alexcwatt commented Jun 27, 2019 •

edited

Loading