EHN: Add filter methods to SeriesGroupBy, DataFrameGroupBy GH919 #3680

danielballan · 2013-05-22T02:43:16Z

closes #919

I have been using Wes' workaround (see #919) for filtering groups. Finally, for brevity's sake, I wrote a real filter method. In the one simple case I checked, it performs faster than the workaround.

On a small (~10) Series:

In [7]: %timeit grouped.filter(lambda x: x.mean() > 10)  # my method
1000 loops, best of 3: 346 us per loop

In [8]: %timeit grouped.obj[grouped.transform(lambda x: x.mean() > 10)]  # workaround
1000 loops, best of 3: 462 us per loop

On a large (1000000) Series:

In [18]: %timeit grouped.filter(lambda x: x.mean() > 0) # my method
1 loops, best of 3: 213 ms per loop

In [19]: %timeit grouped.obj[grouped.transform(lambda x: x.mean() > 0)]
1 loops, best of 3: 696 ms per loop

This PR only handles Series, and I included one simple test. If I am on the right track, I'll write one for DataFrames also and write additional tests. If this is a job for Cython, I'm out of my depth, but I think numpy is sufficient. Does this look OK?

jreback · 2013-05-30T19:31:27Z

this looks ok. I don't think you need to cythonize at all. If you could add for DataFrame and some more tests would be great

jreback · 2013-05-31T00:01:19Z

@wesm ?

danielballan · 2013-06-03T13:14:01Z

To avoid confusion with the DataFrame method .filter, should we called this method .having, in a nod to the SQL usage, GROUP BY ... HAVING?

Then again, for users uninitiated in SQL, it might not be as self-evident what .having is meant for. Thoughts?

jreback · 2013-06-03T16:16:03Z

this is a groupby method, so there isn't a conflict with DataFrame per se, and I kind of like the nomeclature, it says, hey filter out these objs before you group....

(and I am fundamentally biased again SQL in any event :))

@wesm @y-p ?

danielballan · 2013-06-03T19:30:56Z

That last PR came with a test for a single-column DataFrame and a tests for a multi-column DataFrame, where the filtering condition considers both columns. I will add more tests for thoroughness, but this looks OK so far.

danielballan · 2013-06-03T20:13:42Z

I'm confused why the new Travis build failed. The error looks wholly unrelated to my changes, something about the wrong version of PyTables....

jreback · 2013-06-03T20:15:32Z

rebase on master; was a small issue with pytables versioning when 3.0.0 came out

phobson · 2013-06-04T19:44:10Z

Just chiming in to say that I'm very excited to see this merged!
+1

jreback · 2013-06-04T19:48:17Z

@danielballan looks like you have some merge commits in here....

using I do this to avoid that....

git pull (in master)
git br my-branch --set-upstream master
git checkout my-branch
git pull --rebase

you can get rid of them by (after git pull in master)

git checkout my-branch
git rebase -i <starting-comit>

then remove the offending commits (just delete them)

jreback · 2013-06-04T19:52:40Z

@danielballan also I would add some tests where you filter out some groups entirely (and thereby return nan for those groups, I think; could also drop them...but nan prob more consistent)

danielballan · 2013-06-04T20:39:55Z

@jreback [Edit: Wrong Jeff, sorry.] The point of this PR is to filter out groups entirely or not at all, based on a criterion applied the each group as a unit. It does not filter within groups.

As for returning NaNs vs. dropping, most use cases I have seen call for dropping. I could add a drop keyword to provide a choice.

jreback · 2013-06-04T20:43:49Z

@danielballan I like have drop=True as a kw to filter, meaning don't return the invalid groups, if False then return na for those groups

phobson · 2013-06-04T20:46:11Z

Perhaps dropna to be consistent with DataFrame.stack?

danielballan · 2013-06-04T20:48:56Z

In this case, drop means "Drop the filtered groups," not exactly "Drop missing values." So I like drop a little better, but I'm game to be persuaded or overruled.

jreback · 2013-06-04T20:49:41Z

@danielballan yes, brain damanged today.... @phobson +1 on dropna

jreback · 2013-06-04T20:52:52Z

@danielballan also....pls add some tests where you have a mixed frame, and case where the filter condition raises

danielballan · 2013-06-04T22:07:06Z

All suggestions implemented. That last commit (TST...) implements dropna as well. Perhaps it would be best if I rebase this into one ENH commit when it's all ready to go.

jreback · 2013-06-04T23:08:36Z

@danielballan

a lot of the logic in transform (the slow/fast path stuff) is shared with filter
can u refactor that out to another function to avoid dup code?

also pls add some tests that aggregate afterwards (test a priori filtering versus your filtering)
also I didn't look close enough. but pls add a test that is essentially your top-level example (eg using Wes workaround)

danielballan · 2013-06-05T15:52:42Z

I'll add a test that compares to Wes's workaround.

Can you explain more what you mean b y "tests that aggregate afterwards"? I have some tests that compare results to data that I filtered "by hand."

jreback · 2013-06-05T16:03:12Z

when I wrote that for some reason was thinking that filter would be part of a pipeline...e.g.

df.groupby('A').filter(lambda x: x.mean()>0).apply('mean')

but that is really equivalent to

res = df.groupby('A').apply('mean')
res[res>0]

but 2nd one is almost certainly faster and more intuitve
the filter operation by definition is a transform

jreback · 2013-06-06T12:35:52Z

@wesm comments on this?

jreback · 2013-06-06T12:36:51Z

@danielballan I think this is ok for 0.11.1....
can you add an example in v0.11.1, enhancements, and release notes mention?

danielballan · 2013-06-06T13:06:15Z

Great. I'm finishing one more batch of tests. Will then work on documenting after I've pushed them.

jreback · 2013-06-06T13:09:15Z

also an example of usage in the groupby.rst as well (could be same example in whatsnew if that works)..I usually write the main docs first then just copy them (unless they are 2 long)

danielballan · 2013-06-06T20:06:20Z

I'm new to writing pandas documentation, but it looks like I got it right. I wrote up three common use cases and showed it working on a Series, on a DataFrame, and with dropna=False.

jreback · 2013-06-06T20:10:37Z

great.....looks like you need to rebase then resubmit....

jreback · 2013-06-06T20:13:03Z

top-section Filration -> Filtration

jreback · 2013-06-06T20:22:51Z

@danielballan fyi...even after 0.11.1 is released can still update docs, and wes only cutting a release candidate I think

danielballan · 2013-06-06T20:53:02Z

OK. As far as I can tell this is good to merge.

jreback · 2013-06-06T21:06:45Z

thanks you very much!

hayd · 2013-08-21T12:57:53Z

Little question on this, current behaviour is the ordering of the DataFrame now based on the groups, would it be better to retain original ordering?

Example:

In [1]: data = pandas.DataFrame(
    {'pid' : [1,1,1,2,2,3,3,3],
     'tag' : [23,45,62,24,45,34,25,62],
     })

In [2]: g = data.groupby('tag')

In [3]: g.filter(lambda x: len(x) > 1)
Out[3]: 
   pid  tag
1    1   45
4    2   45
2    1   62
7    3   62

You can see index is no longer in order.

danielballan · 2013-08-21T13:26:11Z

Yes, I think it would better. Is it worth sorting by default? Making it a keyword option? In a case where the user don't care about the order, nonsorted results will return faster. But maybe the improved speed is not worth the potential confusion.

jreback · 2013-08-21T13:37:33Z

you should just sort the indexers after you concatenate, you could do a keyword if you want (we do support a sorted=True I think on general groupby's, there might be a way to iterate in order of the groups as well to have the indexers come out in the same order....not sure

hayd · 2013-08-21T13:38:34Z

I think a keyword is probably about right, will it always be possible to get the correct order back?

Also worth mentioning, filter (and transform) seem to sulk when non-uniquely indexed things. But I think these operations still make sense in these cases.

jreback · 2013-08-21T13:42:15Z

@hayd the dup cases can be handled by using an Index of the indexers rather than using raw numpy; can you see if you can put an example of these up as a test case?

hayd · 2013-08-21T13:47:10Z

The example I had used was just setting all the index to 0 (from above):

data.index = 8 * [0]
g = data.groupby('tag')
g.filter(lambda x: len(x) > 1)
Exception: Reindexing only valid with uniquely valued Index objects

jreback · 2013-08-21T14:07:52Z

@hayd can you make that a separate issue?

hayd · 2013-08-21T15:38:57Z

Separated into two issues.

danielballan · 2013-10-03T14:52:14Z

@hayd, @jreback

Considering an excerpt from SeriesGroupBy.filter,

    indexers = [self.obj.index.get_indexer(group.index) \
                if wrapper(group) else [] for _ , group in self]

    if len(indexers) == 0:
        filtered = self.obj.take([]) # because np.concatenate would fail
    else:
        filtered = self.obj.take(np.concatenate(indexers))
    if dropna:
        return filtered
    else:
        return filtered.reindex(self.obj.index) # Fill with NaNs.

I don't think sorting the index is the best default behavior. I think we should restore whatever ordering the original index had, e.g.,

return filtered.reindex(self.object.index).dropna()

although that specific approach is no good if the result has NaNs.

But, moreover, I'm not sure why the above loses the ordering to begin with. Is it in for _, group in self? I will keep exploring but I would happily accept your two cents if you understand what's going on here.

jreback · 2013-10-03T18:01:41Z

I think you can just add:

    if len(indexers) == 0:
        filtered = self.obj.take([]) # because np.concatenate would fail
    else:
        filtered = self.obj.take(np.sort(np.concatenate(indexers)))
    if dropna:
        return filtered
    else:
        return filtered.reindex(self.obj.index) # Fill with NaNs.

the indexers are not necessarily sorted (you can do this if self.sort==True, which is default

danielballan · 2013-10-14T20:08:27Z

I did not think this would work when the index is non-sequential, but tests show that it does. Fair enough. #5222 should close.

ENH: Add filter method to SeriesGroupBy, DataFrameGroupBy

2a2cfb8

jreback merged commit 2a2cfb8 into pandas-dev:master Jun 6, 2013

jreback mentioned this pull request Jun 6, 2013

Add filtering capability to GroupBy #919

Closed

danielballan deleted the filter branch August 21, 2013 13:24

This was referenced Aug 21, 2013

Groupby filter doesn't work with repeated index #4620

Closed

Groupby filter changes ordering #4621

Closed

EHN: Add filter methods to SeriesGroupBy, DataFrameGroupBy GH919 #3680

EHN: Add filter methods to SeriesGroupBy, DataFrameGroupBy GH919 #3680

Conversation

danielballan commented May 22, 2013

jreback commented May 30, 2013

jreback commented May 31, 2013

danielballan commented Jun 3, 2013

jreback commented Jun 3, 2013

danielballan commented Jun 3, 2013

danielballan commented Jun 3, 2013

jreback commented Jun 3, 2013

phobson commented Jun 4, 2013

jreback commented Jun 4, 2013

jreback commented Jun 4, 2013

danielballan commented Jun 4, 2013

jreback commented Jun 4, 2013

phobson commented Jun 4, 2013

danielballan commented Jun 4, 2013

jreback commented Jun 4, 2013

jreback commented Jun 4, 2013

danielballan commented Jun 4, 2013

jreback commented Jun 4, 2013

danielballan commented Jun 5, 2013

jreback commented Jun 5, 2013

jreback commented Jun 6, 2013

jreback commented Jun 6, 2013

danielballan commented Jun 6, 2013

jreback commented Jun 6, 2013

danielballan commented Jun 6, 2013

jreback commented Jun 6, 2013

jreback commented Jun 6, 2013

jreback commented Jun 6, 2013

danielballan commented Jun 6, 2013

jreback commented Jun 6, 2013

hayd commented Aug 21, 2013

danielballan commented Aug 21, 2013

jreback commented Aug 21, 2013

hayd commented Aug 21, 2013

jreback commented Aug 21, 2013

hayd commented Aug 21, 2013

jreback commented Aug 21, 2013

hayd commented Aug 21, 2013

danielballan commented Oct 3, 2013

jreback commented Oct 3, 2013

danielballan commented Oct 14, 2013