-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
EHN: Add filter methods to SeriesGroupBy, DataFrameGroupBy GH919 #3680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
this looks ok. I don't think you need to cythonize at all. If you could add for DataFrame and some more tests would be great |
@wesm ? |
To avoid confusion with the DataFrame method .filter, should we called this method Then again, for users uninitiated in SQL, it might not be as self-evident what |
That last PR came with a test for a single-column DataFrame and a tests for a multi-column DataFrame, where the filtering condition considers both columns. I will add more tests for thoroughness, but this looks OK so far. |
I'm confused why the new Travis build failed. The error looks wholly unrelated to my changes, something about the wrong version of PyTables.... |
rebase on master; was a small issue with pytables versioning when 3.0.0 came out |
Just chiming in to say that I'm very excited to see this merged! |
@danielballan looks like you have some merge commits in here.... using I do this to avoid that....
you can get rid of them by (after git pull in master)
then remove the offending commits (just delete them) |
@danielballan also I would add some tests where you filter out some groups entirely (and thereby return nan for those groups, I think; could also drop them...but nan prob more consistent) |
@jreback [Edit: Wrong Jeff, sorry.] The point of this PR is to filter out groups entirely or not at all, based on a criterion applied the each group as a unit. It does not filter within groups. As for returning NaNs vs. dropping, most use cases I have seen call for dropping. I could add a drop keyword to provide a choice. |
@danielballan I like have |
Perhaps |
In this case, drop means "Drop the filtered groups," not exactly "Drop missing values." So I like |
@danielballan yes, brain damanged today.... @phobson +1 on |
@danielballan also....pls add some tests where you have a mixed frame, and case where the filter condition raises |
All suggestions implemented. That last commit (TST...) implements |
a lot of the logic in transform (the slow/fast path stuff) is shared with filter also pls add some tests that aggregate afterwards (test a priori filtering versus your filtering) |
I'll add a test that compares to Wes's workaround. Can you explain more what you mean b y "tests that aggregate afterwards"? I have some tests that compare results to data that I filtered "by hand." |
when I wrote that for some reason was thinking that
but that is really equivalent to
but 2nd one is almost certainly faster and more intuitve |
@wesm comments on this? |
@danielballan I think this is ok for 0.11.1.... |
Great. I'm finishing one more batch of tests. Will then work on documenting after I've pushed them. |
also an example of usage in the groupby.rst as well (could be same example in whatsnew if that works)..I usually write the main docs first then just copy them (unless they are 2 long) |
I'm new to writing pandas documentation, but it looks like I got it right. I wrote up three common use cases and showed it working on a Series, on a DataFrame, and with |
great.....looks like you need to rebase then resubmit.... |
top-section |
@danielballan fyi...even after 0.11.1 is released can still update docs, and wes only cutting a release candidate I think |
OK. As far as I can tell this is good to merge. |
thanks you very much! |
Little question on this, current behaviour is the ordering of the DataFrame now based on the groups, would it be better to retain original ordering? Example:
You can see index is no longer in order. |
Yes, I think it would better. Is it worth sorting by default? Making it a keyword option? In a case where the user don't care about the order, nonsorted results will return faster. But maybe the improved speed is not worth the potential confusion. |
you should just sort the indexers after you concatenate, you could do a keyword if you want (we do support a |
I think a keyword is probably about right, will it always be possible to get the correct order back? Also worth mentioning, filter (and transform) seem to sulk when non-uniquely indexed things. But I think these operations still make sense in these cases. |
@hayd the dup cases can be handled by using an |
The example I had used was just setting all the index to 0 (from above):
|
@hayd can you make that a separate issue? |
Separated into two issues. |
Considering an excerpt from SeriesGroupBy.filter,
I don't think sorting the index is the best default behavior. I think we should restore whatever ordering the original index had, e.g.,
although that specific approach is no good if the result has NaNs. But, moreover, I'm not sure why the above loses the ordering to begin with. Is it in |
I think you can just add:
the indexers are not necessarily sorted (you can do this if |
I did not think this would work when the index is non-sequential, but tests show that it does. Fair enough. #5222 should close. |
closes #919
I have been using Wes' workaround (see #919) for filtering groups. Finally, for brevity's sake, I wrote a real filter method. In the one simple case I checked, it performs faster than the workaround.
On a small (~10) Series:
On a large (1000000) Series:
This PR only handles Series, and I included one simple test. If I am on the right track, I'll write one for DataFrames also and write additional tests. If this is a job for Cython, I'm out of my depth, but I think numpy is sufficient. Does this look OK?