-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix #10355, std() groupby calculation #26229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #26229 +/- ##
===========================================
- Coverage 91.97% 40.7% -51.28%
===========================================
Files 175 175
Lines 52379 52380 +1
===========================================
- Hits 48178 21321 -26857
- Misses 4201 31059 +26858
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26229 +/- ##
===========================================
- Coverage 91.73% 41.68% -50.06%
===========================================
Files 174 174
Lines 50741 50746 +5
===========================================
- Hits 46548 21154 -25394
- Misses 4193 29592 +25399
Continue to review full report at Codecov.
|
pandas/core/groupby/groupby.py
Outdated
@@ -867,7 +867,7 @@ def _python_agg_general(self, func, *args, **kwargs): | |||
try: | |||
result, counts = self.grouper.agg_series(obj, f) | |||
output[name] = self._try_cast(result, obj, numeric_only=True) | |||
except TypeError: | |||
except (IndexError, TypeError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really? how does this happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the test cases fail when using _python_agg_general
on, I believe, an empty series. I was quite confused at first as to why my code was causing this to fail when tests were passing just fine on var
. Then I noticed that if I removed the _cython_agg_general
call (which var tries to use, and only falls back on _python_agg_general
in case of an exception), additional tests would begin failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still needed? what exactly is raising IndexErrror?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it comes from groupby/ops.py's _aggregate_series_pure_python
(called by agg_series
when _aggregate_series_fast
doesn't work). In particular if pandas/tests/resample/test_base.py:test_resample_empty_series
is running and I don't have the cython version and I instead run the python version, the test fails due to an IndexError
from pandas\_libs\reduction.pyx:242
where get_result
checks self.ngroups > 0
and then accesses self.bins[0]
... I guess maybe I should update that code to check for self.bins > 0
.
I can also just not handle IndexError
and forget about this bug; if it comes up for std(), it would also be likely to come up for other functions in the same way... only when Cython fails and some aggregation is being done on an empty series, if I understand correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that is better
we need to be very clear in where exceptions are caught
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that there is another issue if I merely check self.bins.size > 0
that still causes a failure; I think it's best not to worry about this case at the time.
did the patch i put up in the original issue not solve this? |
Hello @alexcwatt! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-05-17 13:55:48 UTC |
pandas/core/groupby/groupby.py
Outdated
@@ -867,7 +867,7 @@ def _python_agg_general(self, func, *args, **kwargs): | |||
try: | |||
result, counts = self.grouper.agg_series(obj, f) | |||
output[name] = self._try_cast(result, obj, numeric_only=True) | |||
except TypeError: | |||
except (IndexError, TypeError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still needed? what exactly is raising IndexErrror?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good. comments on the tests. ping on green.
if axis == 0: | ||
frame = raw_frame | ||
else: | ||
frame = raw_frame.T | ||
|
||
groupby_kwargs = {'level': level, 'axis': axis, 'sort': sort, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you actually need to define this kwargs, just inline it when calling frame.groupby(...) no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good point
groupby_kwargs = {'level': level, 'axis': axis, 'sort': sort, | ||
'as_index': as_index} | ||
group_op_kwargs = {} | ||
frame_op_kwargs = {'level': level, 'axis': axis} | ||
if op in AGG_FUNCTIONS_WITH_SKIPNA: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be more clear with
# comment explaining why this is needed
if op in AGG_FUNCTION.S....:
group_op_kwargs = .....
frame_op_kwargs = ....
else:
group......
frame_op....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So remove the idea of 'base arguments'? I thought it might be good to pull out the common arguments but maybe it's more readable the other way. The previous test code just seemed pretty repetitive to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah just trying to make this simpler to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I will clean these tests up a bit and ping when everything is green!
Thaks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls also merge master
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -399,6 +399,7 @@ Groupby/Resample/Rolling | |||
- Bug in :meth:`pandas.core.groupby.GroupBy.idxmax` and :meth:`pandas.core.groupby.GroupBy.idxmin` with datetime column would return incorrect dtype (:issue:`25444`, :issue:`15306`) | |||
- Bug in :meth:`pandas.core.groupby.GroupBy.cumsum`, :meth:`pandas.core.groupby.GroupBy.cumprod`, :meth:`pandas.core.groupby.GroupBy.cummin` and :meth:`pandas.core.groupby.GroupBy.cummax` with categorical column having absent categories, would return incorrect result or segfault (:issue:`16771`) | |||
- Bug in :meth:`pandas.core.groupby.GroupBy.nth` where NA values in the grouping would return incorrect results (:issue:`26011`) | |||
- Bug in :meth:`pandas.core.groupby.GroupBy.std` that computed standard deviation without respecting groupby context when `as_index=False` (:issue:`10355`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use double back ticks around as_index=False
@@ -164,33 +164,43 @@ def raw_frame(): | |||
@pytest.mark.parametrize('axis', [0, 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this and it will just use the axis fixture; note that axis will take on [0, 1, 'index', 'columns'], so you may have to adjust some of the test conditions
can you merge master and update |
Merged master and made some updates to the unit tests. I am now looking into some failing test cases ( |
Still have a bit more to do with this one corner case. I think the reason it hasn't been an issue too much in the past is that some methods like groupby's mean just catch all exceptions coming from the cython_agg_general and fall back on Python. I can do that if I have to, and revert my changes to reduction.pyx, but I am trying to grok what's going on in there so we don't need to catch a general Exception and fall back on Python. If anyone can shed some light please feel free, otherwise again I am going to try to understand and resolve it this weekend. |
thanks for working on this @alexcwatt groupby is a bit tricky :-D |
@jreback Can you please confirm that the test that is failing does specify the desired behavior? Currently the expectation is that it should return an empty dataframe, with no columns. The actual result is now that it returns an empty dataframe, with a column '0'. The column '0' is seen in other tests like |
@alexcwatt I wasn't able to reproduce that issue off of this locally; restarted CI to see |
Ignore previous comment I was able to reproduce issue after building extensions. So yea @alexcwatt it looks like these changes are implicitly adding a blank column to that test which we wouldn't want |
Okay, thanks for the confirmation, will look into it when I can
…On Mon, May 20, 2019, at 12:46 PM, William Ayd wrote:
Ignore previous comment I was able to reproduce issue after building extensions. So yea @alexcwatt <https://github.com/alexcwatt> it looks like these changes are implicitly adding a blank column to that test which we wouldn't want
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#26229?email_source=notifications&email_token=AADYU6NIYF2SQ24WKQFLB6TPWLIXDA5CNFSM4HI5FVL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVZNNOQ#issuecomment-494065338>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADYU6JE2DS3LA3D2LYPKYTPWLIXDANCNFSM4HI5FVLQ>.
|
can you merge master and see if you can get passing |
can you merge master. |
I'll take another look tonight. Unfortunately what I've found is that there is an exception being raised by the Cython code that I don't quite understand, and the way some of the other groupby operations work is that they catch exceptions from Cython and fall back to the Python version that doesn't have the bug/issue. I think this is what I need to do for std() as well. Edit: When I tried fixing this in the Cython it lead to that one unit test failing, and I haven't been able to figure out how to prevent that purely Cython side. Might open a bug about the Cython code to be fixed later, with notes on the direction I was going. |
Pushing to 0.25.1. @alex if you merge master sometime later today there will be a 0.25.1.rst for the release note. |
Nice PR but I think this has gone stale so closing. ping if you'd like to pick back up |
git diff upstream/master -u -- "*.py" | flake8 --diff