filtration chain for DataFrames #11875

danfrankj · 2015-12-21T02:59:44Z

Inspired by the R package dplyr I'd like to be able to chain together my data manipulation but can't find an elegant way to do a filter.

For example if I have a group -> aggregate -> filter sequence, I'd do something like this


In [13]: df
Out[13]: 
     A      B      C      D
0  foo    one -0.193 -0.900
1  bar    one  1.505  0.223
2  foo    two  0.646 -0.025
3  bar  three -0.072  1.227
4  foo    two -1.367 -0.873
5  bar    two  1.176  2.317
6  foo    one  0.424  0.858
7  foo  three  2.129  0.038

In [14]: aggd = df.groupby('A').sum()

In [15]: final = aggd[aggd['C'] > 2]

In [16]: final
Out[16]: 
         C      D
A                
bar  2.608  3.767

What I'd like to be able to do is....

In [21]:  df.groupby('A').sum().apply_filter(lambda row: row['C'] > 0, axis=1)

Out[21]: 
         C      D
A                
bar  2.608  3.767

This should be a pretty simple addition basically calling DataFrame.apply and happy to open a PR for it but wanted to see what you guys thought about this?

The text was updated successfully, but these errors were encountered:

max-sixty · 2015-12-21T05:09:15Z

How about .query:

In [81]: df
Out[81]: 
          a         b         c
0  0.778730  0.784767  0.798046
1  0.182564  0.686324  0.431897
2  0.149061  0.290067  0.397787
3  0.749971  0.050980  0.995215
4  0.144524  0.863902  0.973320
5  0.480789  0.492512  0.834956
6  0.251052  0.619787  0.237869
7  0.488043  0.793807  0.314146
8  0.816102  0.615878  0.900229
9  0.111648  0.431056  0.392364

In [82]: df.query('b>.5')
Out[82]: 
          a         b         c
0  0.778730  0.784767  0.798046
1  0.182564  0.686324  0.431897
4  0.144524  0.863902  0.973320
6  0.251052  0.619787  0.237869
7  0.488043  0.793807  0.314146
8  0.816102  0.615878  0.900229

TomAugspurger · 2015-12-21T12:35:05Z

Good call @MaximilianR. @danfrankj let us know if query isn't exactly what you were looking for. It might be a bit limited compared to dplyr since it doesn't take callables, but I think there's an open issue to allow this.

danfrankj · 2015-12-23T14:49:34Z

Agreed query should work though clunky at the moment - I assume the issue you're referring to is this one. #3393

danfrankj · 2016-04-07T01:11:59Z

Hey folks, I keep on coming back to this because

writing strings in DataFrame.query is weird both from a syntax perspective and from a "knowing what's going on" perspective
it doesn't seem like query is going to accept functions anytime soon?
we might want to filer based on another axis.

What do y'all think of this example implementation?

def filter_values(self, func, axis=0):
    ind = self.apply(func, axis=axis)
    if axis == 0:
        return self.loc[:, ind]
    else:
        return self.loc[ind, :]

@TomAugspurger
@MaximilianR

TomAugspurger · 2016-04-07T01:13:47Z

Take a look at #12539

I think it implements what you're thinking.

jreback · 2016-04-07T01:14:31Z

actually this will be merged shortly. I c @TomAugspurger pointed at the issue already! thanks!

danfrankj · 2016-04-07T01:19:51Z

awesome, one step ahead of me. Great stuff guys, thanks!

TomAugspurger closed this as completed Dec 21, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filtration chain for DataFrames #11875

filtration chain for DataFrames #11875

danfrankj commented Dec 21, 2015

max-sixty commented Dec 21, 2015

TomAugspurger commented Dec 21, 2015

danfrankj commented Dec 23, 2015

danfrankj commented Apr 7, 2016

TomAugspurger commented Apr 7, 2016

jreback commented Apr 7, 2016

danfrankj commented Apr 7, 2016

filtration chain for DataFrames #11875

filtration chain for DataFrames #11875

Comments

danfrankj commented Dec 21, 2015

max-sixty commented Dec 21, 2015

TomAugspurger commented Dec 21, 2015

danfrankj commented Dec 23, 2015

danfrankj commented Apr 7, 2016

TomAugspurger commented Apr 7, 2016

jreback commented Apr 7, 2016

danfrankj commented Apr 7, 2016