Skip to content

EHN: Add filter methods to SeriesGroupBy, DataFrameGroupBy GH919 #3680

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 6, 2013

Conversation

danielballan
Copy link
Contributor

closes #919

I have been using Wes' workaround (see #919) for filtering groups. Finally, for brevity's sake, I wrote a real filter method. In the one simple case I checked, it performs faster than the workaround.

On a small (~10) Series:

In [7]: %timeit grouped.filter(lambda x: x.mean() > 10)  # my method
1000 loops, best of 3: 346 us per loop

In [8]: %timeit grouped.obj[grouped.transform(lambda x: x.mean() > 10)]  # workaround
1000 loops, best of 3: 462 us per loop

On a large (1000000) Series:

In [18]: %timeit grouped.filter(lambda x: x.mean() > 0) # my method
1 loops, best of 3: 213 ms per loop

In [19]: %timeit grouped.obj[grouped.transform(lambda x: x.mean() > 0)]
1 loops, best of 3: 696 ms per loop

This PR only handles Series, and I included one simple test. If I am on the right track, I'll write one for DataFrames also and write additional tests. If this is a job for Cython, I'm out of my depth, but I think numpy is sufficient. Does this look OK?

@jreback
Copy link
Contributor

jreback commented May 30, 2013

this looks ok. I don't think you need to cythonize at all. If you could add for DataFrame and some more tests would be great

@jreback
Copy link
Contributor

jreback commented May 31, 2013

@wesm ?

@danielballan
Copy link
Contributor Author

To avoid confusion with the DataFrame method .filter, should we called this method .having, in a nod to the SQL usage, GROUP BY ... HAVING?

Then again, for users uninitiated in SQL, it might not be as self-evident what .having is meant for. Thoughts?

@jreback
Copy link
Contributor

jreback commented Jun 3, 2013

this is a groupby method, so there isn't a conflict with DataFrame per se, and I kind of like the nomeclature, it says, hey filter out these objs before you group....

(and I am fundamentally biased again SQL in any event :))

@wesm @y-p ?

@danielballan
Copy link
Contributor Author

That last PR came with a test for a single-column DataFrame and a tests for a multi-column DataFrame, where the filtering condition considers both columns. I will add more tests for thoroughness, but this looks OK so far.

@danielballan
Copy link
Contributor Author

I'm confused why the new Travis build failed. The error looks wholly unrelated to my changes, something about the wrong version of PyTables....

@jreback
Copy link
Contributor

jreback commented Jun 3, 2013

rebase on master; was a small issue with pytables versioning when 3.0.0 came out

@phobson
Copy link

phobson commented Jun 4, 2013

Just chiming in to say that I'm very excited to see this merged!
+1

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

@danielballan looks like you have some merge commits in here....

using I do this to avoid that....

git pull (in master)
git br my-branch --set-upstream master
git checkout my-branch
git pull --rebase

you can get rid of them by (after git pull in master)

git checkout my-branch
git rebase -i <starting-comit>

then remove the offending commits (just delete them)

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

@danielballan also I would add some tests where you filter out some groups entirely (and thereby return nan for those groups, I think; could also drop them...but nan prob more consistent)

@danielballan
Copy link
Contributor Author

@jreback [Edit: Wrong Jeff, sorry.] The point of this PR is to filter out groups entirely or not at all, based on a criterion applied the each group as a unit. It does not filter within groups.

As for returning NaNs vs. dropping, most use cases I have seen call for dropping. I could add a drop keyword to provide a choice.

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

@danielballan I like have drop=True as a kw to filter, meaning don't return the invalid groups, if False then return na for those groups

@phobson
Copy link

phobson commented Jun 4, 2013

Perhaps dropna to be consistent with DataFrame.stack?

@danielballan
Copy link
Contributor Author

In this case, drop means "Drop the filtered groups," not exactly "Drop missing values." So I like drop a little better, but I'm game to be persuaded or overruled.

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

@danielballan yes, brain damanged today.... @phobson +1 on dropna

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

@danielballan also....pls add some tests where you have a mixed frame, and case where the filter condition raises

@danielballan
Copy link
Contributor Author

All suggestions implemented. That last commit (TST...) implements dropna as well. Perhaps it would be best if I rebase this into one ENH commit when it's all ready to go.

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

@danielballan

a lot of the logic in transform (the slow/fast path stuff) is shared with filter
can u refactor that out to another function to avoid dup code?

also pls add some tests that aggregate afterwards (test a priori filtering versus your filtering)
also I didn't look close enough. but pls add a test that is essentially your top-level example (eg using Wes workaround)

@danielballan
Copy link
Contributor Author

I'll add a test that compares to Wes's workaround.

Can you explain more what you mean b y "tests that aggregate afterwards"? I have some tests that compare results to data that I filtered "by hand."

@jreback
Copy link
Contributor

jreback commented Jun 5, 2013

when I wrote that for some reason was thinking that filter would be part of a pipeline...e.g.

df.groupby('A').filter(lambda x: x.mean()>0).apply('mean')

but that is really equivalent to

res = df.groupby('A').apply('mean')
res[res>0]

but 2nd one is almost certainly faster and more intuitve
the filter operation by definition is a transform

@jreback
Copy link
Contributor

jreback commented Jun 6, 2013

@wesm comments on this?

@jreback
Copy link
Contributor

jreback commented Jun 6, 2013

@danielballan I think this is ok for 0.11.1....
can you add an example in v0.11.1, enhancements, and release notes mention?

@danielballan
Copy link
Contributor Author

Great. I'm finishing one more batch of tests. Will then work on documenting after I've pushed them.

@jreback
Copy link
Contributor

jreback commented Jun 6, 2013

also an example of usage in the groupby.rst as well (could be same example in whatsnew if that works)..I usually write the main docs first then just copy them (unless they are 2 long)

@danielballan
Copy link
Contributor Author

I'm new to writing pandas documentation, but it looks like I got it right. I wrote up three common use cases and showed it working on a Series, on a DataFrame, and with dropna=False.

@jreback
Copy link
Contributor

jreback commented Jun 6, 2013

great.....looks like you need to rebase then resubmit....

@jreback
Copy link
Contributor

jreback commented Jun 6, 2013

top-section Filration -> Filtration

@jreback
Copy link
Contributor

jreback commented Jun 6, 2013

@danielballan fyi...even after 0.11.1 is released can still update docs, and wes only cutting a release candidate I think

@danielballan
Copy link
Contributor Author

OK. As far as I can tell this is good to merge.

@jreback jreback merged commit 2a2cfb8 into pandas-dev:master Jun 6, 2013
@jreback
Copy link
Contributor

jreback commented Jun 6, 2013

thanks you very much!

@hayd
Copy link
Contributor

hayd commented Aug 21, 2013

Little question on this, current behaviour is the ordering of the DataFrame now based on the groups, would it be better to retain original ordering?

Example:

In [1]: data = pandas.DataFrame(
    {'pid' : [1,1,1,2,2,3,3,3],
     'tag' : [23,45,62,24,45,34,25,62],
     })

In [2]: g = data.groupby('tag')

In [3]: g.filter(lambda x: len(x) > 1)
Out[3]: 
   pid  tag
1    1   45
4    2   45
2    1   62
7    3   62

You can see index is no longer in order.

@danielballan danielballan deleted the filter branch August 21, 2013 13:24
@danielballan
Copy link
Contributor Author

Yes, I think it would better. Is it worth sorting by default? Making it a keyword option? In a case where the user don't care about the order, nonsorted results will return faster. But maybe the improved speed is not worth the potential confusion.

@jreback
Copy link
Contributor

jreback commented Aug 21, 2013

you should just sort the indexers after you concatenate, you could do a keyword if you want (we do support a sorted=True I think on general groupby's, there might be a way to iterate in order of the groups as well to have the indexers come out in the same order....not sure

@hayd
Copy link
Contributor

hayd commented Aug 21, 2013

I think a keyword is probably about right, will it always be possible to get the correct order back?

Also worth mentioning, filter (and transform) seem to sulk when non-uniquely indexed things. But I think these operations still make sense in these cases.

@jreback
Copy link
Contributor

jreback commented Aug 21, 2013

@hayd the dup cases can be handled by using an Index of the indexers rather than using raw numpy; can you see if you can put an example of these up as a test case?

@hayd
Copy link
Contributor

hayd commented Aug 21, 2013

The example I had used was just setting all the index to 0 (from above):

data.index = 8 * [0]
g = data.groupby('tag')
g.filter(lambda x: len(x) > 1)
Exception: Reindexing only valid with uniquely valued Index objects

@jreback
Copy link
Contributor

jreback commented Aug 21, 2013

@hayd can you make that a separate issue?

@hayd
Copy link
Contributor

hayd commented Aug 21, 2013

Separated into two issues.

@danielballan
Copy link
Contributor Author

@hayd, @jreback

Considering an excerpt from SeriesGroupBy.filter,

    indexers = [self.obj.index.get_indexer(group.index) \
                if wrapper(group) else [] for _ , group in self]

    if len(indexers) == 0:
        filtered = self.obj.take([]) # because np.concatenate would fail
    else:
        filtered = self.obj.take(np.concatenate(indexers))
    if dropna:
        return filtered
    else:
        return filtered.reindex(self.obj.index) # Fill with NaNs.

I don't think sorting the index is the best default behavior. I think we should restore whatever ordering the original index had, e.g.,

return filtered.reindex(self.object.index).dropna()

although that specific approach is no good if the result has NaNs.

But, moreover, I'm not sure why the above loses the ordering to begin with. Is it in for _, group in self? I will keep exploring but I would happily accept your two cents if you understand what's going on here.

@jreback
Copy link
Contributor

jreback commented Oct 3, 2013

I think you can just add:

    if len(indexers) == 0:
        filtered = self.obj.take([]) # because np.concatenate would fail
    else:
        filtered = self.obj.take(np.sort(np.concatenate(indexers)))
    if dropna:
        return filtered
    else:
        return filtered.reindex(self.obj.index) # Fill with NaNs.

the indexers are not necessarily sorted (you can do this if self.sort==True, which is default

@danielballan
Copy link
Contributor Author

I did not think this would work when the index is non-sequential, but tests show that it does. Fair enough. #5222 should close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add filtering capability to GroupBy
4 participants