-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: groupby sub-selection ignored with some methods #5264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
the cython routines use I would say this is a bug / not implemented behavior |
I'm taking a look at this today (the groupby code is a tad intimidating). Would this be considered a bug as well? That the column being grouped on, In [18]: t
Out[18]:
0 1 2 3
0 1 -0.138264 0.647689 1.523030
1 1 -0.234137 1.579213 0.767435
2 1 0.542560 -0.463418 -0.465730
3 1 -1.913280 -1.724918 -0.562288
4 1 0.314247 -0.908024 -1.412304
5 0 -0.225776 0.067528 -1.424748
6 0 0.110923 -1.150994 0.375698
7 0 -0.291694 -0.601707 1.852278
8 0 -1.057711 0.822545 -1.220844
9 0 -1.959670 -1.328186 0.196861
In [19]: t.groupby(0).apply(lambda x: x + 1)
Out[19]:
0 1 2 3
0 2 0.861736 1.647689 2.523030
1 2 0.765863 2.579213 1.767435
2 2 1.542560 0.536582 0.534270
3 2 -0.913280 -0.724918 0.437712
4 2 1.314247 0.091976 -0.412304
5 1 0.774224 1.067528 -0.424748
6 1 1.110923 -0.150994 1.375698
7 1 0.708306 0.398293 2.852278
8 1 -0.057711 1.822545 -0.220844
9 1 -0.959670 -0.328186 1.196861 If And just to make sure I follow this, |
OK, wow that was a lot of digging for a (roughly) one line change. All it comes down to is in I said roughly because that would mess up the Could I get a ruling on whether the column being grouped on should be included as part of the output? My fix on the selection bug will break that. I can hack around it, but I'm not sure including the grouping column is desired. For example, from index = MultiIndex.from_arrays([[0, 0, 0, 1, 1, 1],
[1, 2, 3, 1, 2, 3]])
df = DataFrame({'d': [1., 1., 1., 2., 2., 2.],
'c': np.tile(['a', 'b', 'c'], 2),
'v': np.arange(1., 7.)}, index=index)
def f(group):
v = group['v']
group['v2'] = (v - v.min()) / (v.max() - v.min())
return group
result = df.groupby('d').apply(f)
expected = df.copy()
expected['v2'] = np.tile([0., 0.5, 1], 2)
assert_frame_equal(result, expected) So when I run this with my fix I get:
|
can u put a link to the branch? |
https://github.com/TomAugspurger/pandas/tree/groupby-sugar EDIT: forgot to push my changes. They're up now. |
I guess for |
For what it's worth, it look like > library("plyr", lib.loc="/usr/local/Cellar/r/3.0.2/R.framework/Versions/3.0/Resources/library")
> f <- function(x) {return(x + 1)}
> dt <- data.frame(age=rchisq(20,10),group=sample(1:2,20,rep=T))
> dt
age group
1 11.137641 1
2 8.796264 2
3 13.530428 1
4 13.647518 2
5 5.647735 2
6 7.277148 1
7 6.479024 2
8 9.289560 1
9 5.211502 1
10 12.575817 2
11 11.291723 2
12 11.295667 1
13 13.752943 2
14 9.731945 1
15 4.490958 2
16 16.340906 2
17 10.620146 2
18 11.168694 1
19 3.015077 2
20 13.511891 2
> ddply(dt, ~group, f)
age group
1 12.137641 2
2 14.530428 2
3 8.277148 2
4 10.289560 2
5 6.211502 2
6 12.295667 2
7 10.731945 2
8 12.168694 2
9 9.796264 3
10 14.647518 3
11 6.647735 3
12 7.479024 3
13 13.575817 3
14 12.291723 3
15 14.752943 3
16 5.490958 3
17 17.340906 3
18 11.620146 3
19 4.015077 3
20 14.511891 3 |
See this post on SO. I think this user was confused about the grouper having the function applied to it. My answer there goes through what I think happened to them, and it's not at all obvious. Can anyone provide a reason why the |
@TomAugspurger ok. I believe this is a bug. Here is what happens. If the grouper (e.g. 'year' in this case) would have been a string, then then the aggregation usually will fall to the except part as the functions cannot aggregate strings, so item-by-item will work properly. In this can since year is a number the function works and thus the grouping column IS included. So I think that you could exclude the I think to nail this down you need a simple example and basically a bunch of cases which need to be checked: the product of these combinatons:
so 4 x 3 x 3 = 36 combinations! need to find a nice easy way to check these (this may be overkill though) make sense? |
Agreed completely. I'll make a new issue. |
u have a PR for this issue (the sub selection) I believe? (or is it just the private branch?) go ahead and make a PR and can resolve this one |
Do you want the "apply to grouper" issue in a separate PR? Or one PR with two commits? The PR for this issue (the sub selection being ignored) will probably depend on the PR for the apply to grouper. |
maybe best to close at the same time if not possible then can do separately whatever works |
@TomAugspurger I think all of these merely need to replace |
Great, that's the approach I was taking before getting muddled up with bigger API issues. You said you have a fix in the other issue right? I'd be happy look that over (especially the tests). |
trivial fix... (but a couple of other tests will break)... |
apparently self._obj_with_exclusion isn't quite the right object to use (it excludes the grouped by column, regardless of as_index) Eeep. I think good opportunity to change head and tail... see PR :S I created new method |
add to the list corr; @TomAugspurger maybe make a check list or something at the top? I guess need to verify all of the tmethod by a smoke tests (these are methods on the whitelist). perhaps change is really easy if its done with the whitelist generating function then.... |
I put this into a checklist and ticked off head and tail (which I fixed yest) added in a few but haven't properly gone through it. Like I say, I implemented a |
@jreback yea re whitelist way to do it, makes sense |
pls tick the boxes when you have a chance (and put the ref PR #) |
added issue number, will tick off once merged, looks like my PR fixes most of these, if not all. needs some tests for a few of the others... may need a swathe of tests for these as I think coverage is low... |
I added a load to this list. Their is some weird behaviour in agg functions, where if you select a col not in the DataFrame it doesn't care:
feature/edge case here is that A isn't displayed unless you select it |
I agree this should raise KeyError |
@TomAugspurger fix for a lot of this is in master now. Would you mind taking a look at what you think about the rest, is the behaviour correct? (esp. the aggs?) |
Thanks for taking care of this. I may have actually lost the branch that I started to fix this, so I guess there wasn't really duplicated effort :) I ticked the boxes for the aggs (since #6578 closes it the unrelated KeyError issue). I haven't tested the For a groupby EDIT: I'm not very smart today. Of course the groupby |
I can take care of filter, it's a one line change (was just waiting for the above to merge). I think ohlc already passes I just needs to be added to the first list in my PR. @TomAugspurger Do you think the aggs is all good once that 6578 is fixed? Does the behaviour look right? If so, we should add tests for them. (I unchecked them 'til we add tests) |
actually I'm confused if ohlc was meant to be a method, see #6594... |
Any objections to closing this? The remaining issues (ohlc and not raising a KeyError on nonexistent columns) aren't related to the original subselection being ignored and have their own issues. |
no...let's close (if necessary move anything remaining to a new issue) |
related #6512, #6524, #6346
Update from @hayd I think these should reference
_selected_obj
rather thanobj
.Looking through some others, looks these also ignore the selection
Aggregation functions like (they already kind of do, but they allow bad selections ie column names not in columns, may be sep issue?):
(these "work" with the described bug)
Atm selecting a column not in df doesn't raise:
what about
iloc/loc/ix
(current all disabled)?The column selection on a groupby object is being ignored when
.quantile()
is called. So it computes the quantile on all the (numeric) columns and returns the full DataFrame.Seeing all these, I'm wondering if this is a bug or just how some of the methods are implementer. The docs don't mention anything about only supporting some methods though.
version: '0.12.0-883-g988d4be'
The text was updated successfully, but these errors were encountered: