Skip to content

BUG: Fix #10355, std() groupby calculation #26229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from

Conversation

alexcwatt
Copy link
Contributor

@codecov
Copy link

codecov bot commented Apr 28, 2019

Codecov Report

Merging #26229 into master will decrease coverage by 51.27%.
The diff coverage is 0%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #26229       +/-   ##
===========================================
- Coverage   91.97%    40.7%   -51.28%     
===========================================
  Files         175      175               
  Lines       52379    52380        +1     
===========================================
- Hits        48178    21321    -26857     
- Misses       4201    31059    +26858
Flag Coverage Δ
#multiple ?
#single 40.7% <0%> (-0.15%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/groupby.py 24.45% <0%> (-72.78%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
pandas/core/tools/numeric.py 10.44% <0%> (-89.56%) ⬇️
... and 131 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64104ec...493901e. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 28, 2019

Codecov Report

Merging #26229 into master will decrease coverage by 50.05%.
The diff coverage is 0%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #26229       +/-   ##
===========================================
- Coverage   91.73%   41.68%   -50.06%     
===========================================
  Files         174      174               
  Lines       50741    50746        +5     
===========================================
- Hits        46548    21154    -25394     
- Misses       4193    29592    +25399
Flag Coverage Δ
#multiple ?
#single 41.68% <0%> (-0.11%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/ops.py 19.47% <ø> (-76.49%) ⬇️
pandas/core/groupby/groupby.py 24.24% <0%> (-73%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
... and 131 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1263e1a...83847b6. Read the comment docs.

@@ -867,7 +867,7 @@ def _python_agg_general(self, func, *args, **kwargs):
try:
result, counts = self.grouper.agg_series(obj, f)
output[name] = self._try_cast(result, obj, numeric_only=True)
except TypeError:
except (IndexError, TypeError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really? how does this happen?

Copy link
Contributor Author

@alexcwatt alexcwatt Apr 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the test cases fail when using _python_agg_general on, I believe, an empty series. I was quite confused at first as to why my code was causing this to fail when tests were passing just fine on var. Then I noticed that if I removed the _cython_agg_general call (which var tries to use, and only falls back on _python_agg_general in case of an exception), additional tests would begin failing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still needed? what exactly is raising IndexErrror?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it comes from groupby/ops.py's _aggregate_series_pure_python (called by agg_series when _aggregate_series_fast doesn't work). In particular if pandas/tests/resample/test_base.py:test_resample_empty_series is running and I don't have the cython version and I instead run the python version, the test fails due to an IndexError from pandas\_libs\reduction.pyx:242 where get_result checks self.ngroups > 0 and then accesses self.bins[0] ... I guess maybe I should update that code to check for self.bins > 0.

I can also just not handle IndexError and forget about this bug; if it comes up for std(), it would also be likely to come up for other functions in the same way... only when Cython fails and some aggregation is being done on an empty series, if I understand correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that is better

we need to be very clear in where exceptions are caught

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that there is another issue if I merely check self.bins.size > 0 that still causes a failure; I think it's best not to worry about this case at the time.

@jreback
Copy link
Contributor

jreback commented Apr 28, 2019

did the patch i put up in the original issue not solve this?

@pep8speaks
Copy link

pep8speaks commented Apr 28, 2019

Hello @alexcwatt! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-05-17 13:55:48 UTC

@@ -867,7 +867,7 @@ def _python_agg_general(self, func, *args, **kwargs):
try:
result, counts = self.grouper.agg_series(obj, f)
output[name] = self._try_cast(result, obj, numeric_only=True)
except TypeError:
except (IndexError, TypeError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still needed? what exactly is raising IndexErrror?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good. comments on the tests. ping on green.

if axis == 0:
frame = raw_frame
else:
frame = raw_frame.T

groupby_kwargs = {'level': level, 'axis': axis, 'sort': sort,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you actually need to define this kwargs, just inline it when calling frame.groupby(...) no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point

groupby_kwargs = {'level': level, 'axis': axis, 'sort': sort,
'as_index': as_index}
group_op_kwargs = {}
frame_op_kwargs = {'level': level, 'axis': axis}
if op in AGG_FUNCTIONS_WITH_SKIPNA:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be more clear with

# comment explaining why this is needed
if op in AGG_FUNCTION.S....:
     group_op_kwargs = .....
     frame_op_kwargs = ....
else:
    group......
    frame_op....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So remove the idea of 'base arguments'? I thought it might be good to pull out the common arguments but maybe it's more readable the other way. The previous test code just seemed pretty repetitive to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah just trying to make this simpler to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I will clean these tests up a bit and ping when everything is green!

@Estebansalazar
Copy link

Thaks!

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls also merge master

@@ -399,6 +399,7 @@ Groupby/Resample/Rolling
- Bug in :meth:`pandas.core.groupby.GroupBy.idxmax` and :meth:`pandas.core.groupby.GroupBy.idxmin` with datetime column would return incorrect dtype (:issue:`25444`, :issue:`15306`)
- Bug in :meth:`pandas.core.groupby.GroupBy.cumsum`, :meth:`pandas.core.groupby.GroupBy.cumprod`, :meth:`pandas.core.groupby.GroupBy.cummin` and :meth:`pandas.core.groupby.GroupBy.cummax` with categorical column having absent categories, would return incorrect result or segfault (:issue:`16771`)
- Bug in :meth:`pandas.core.groupby.GroupBy.nth` where NA values in the grouping would return incorrect results (:issue:`26011`)
- Bug in :meth:`pandas.core.groupby.GroupBy.std` that computed standard deviation without respecting groupby context when `as_index=False` (:issue:`10355`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double back ticks around as_index=False

@@ -164,33 +164,43 @@ def raw_frame():
@pytest.mark.parametrize('axis', [0, 1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this and it will just use the axis fixture; note that axis will take on [0, 1, 'index', 'columns'], so you may have to adjust some of the test conditions

@jreback
Copy link
Contributor

jreback commented May 16, 2019

can you merge master and update

@alexcwatt
Copy link
Contributor Author

Merged master and made some updates to the unit tests. I am now looking into some failing test cases (pytest pandas\tests\resample\ -k test_resample_empty_series), will try to figure out this weekend but wanted to push the other minor updates.

@alexcwatt
Copy link
Contributor Author

Still have a bit more to do with this one corner case. I think the reason it hasn't been an issue too much in the past is that some methods like groupby's mean just catch all exceptions coming from the cython_agg_general and fall back on Python. I can do that if I have to, and revert my changes to reduction.pyx, but I am trying to grok what's going on in there so we don't need to catch a general Exception and fall back on Python.

If anyone can shed some light please feel free, otherwise again I am going to try to understand and resolve it this weekend.

@jreback
Copy link
Contributor

jreback commented May 19, 2019

thanks for working on this @alexcwatt groupby is a bit tricky :-D

@alexcwatt
Copy link
Contributor Author

@jreback Can you please confirm that the test that is failing does specify the desired behavior? Currently the expectation is that it should return an empty dataframe, with no columns. The actual result is now that it returns an empty dataframe, with a column '0'. The column '0' is seen in other tests like test_resample_with_nat that actually return data but the test test_resample_with_only_nat is written to specify that we expect an empty dataframe with no columns...

@WillAyd
Copy link
Member

WillAyd commented May 20, 2019

@alexcwatt I wasn't able to reproduce that issue off of this locally; restarted CI to see

@WillAyd
Copy link
Member

WillAyd commented May 20, 2019

Ignore previous comment I was able to reproduce issue after building extensions. So yea @alexcwatt it looks like these changes are implicitly adding a blank column to that test which we wouldn't want

@alexcwatt
Copy link
Contributor Author

alexcwatt commented May 20, 2019 via email

@jreback
Copy link
Contributor

jreback commented Jun 8, 2019

can you merge master and see if you can get passing

@jreback
Copy link
Contributor

jreback commented Jun 27, 2019

can you merge master.

@alexcwatt
Copy link
Contributor Author

alexcwatt commented Jun 27, 2019

I'll take another look tonight. Unfortunately what I've found is that there is an exception being raised by the Cython code that I don't quite understand, and the way some of the other groupby operations work is that they catch exceptions from Cython and fall back to the Python version that doesn't have the bug/issue. I think this is what I need to do for std() as well.

Edit: When I tried fixing this in the Cython it lead to that one unit test failing, and I haven't been able to figure out how to prevent that purely Cython side. Might open a bug about the Cython code to be fixed later, with notes on the direction I was going.

@TomAugspurger
Copy link
Contributor

Pushing to 0.25.1. @alex if you merge master sometime later today there will be a 0.25.1.rst for the release note.

@TomAugspurger TomAugspurger added this to the 0.25.1 milestone Jul 2, 2019
@jreback jreback removed this from the 0.25.1 milestone Jul 24, 2019
@WillAyd
Copy link
Member

WillAyd commented Aug 28, 2019

Nice PR but I think this has gone stale so closing. ping if you'd like to pick back up

@WillAyd WillAyd closed this Aug 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: STD modifies groupby target column when as_index=False
7 participants