Skip to content

filtration chain for DataFrames #11875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
danfrankj opened this issue Dec 21, 2015 · 7 comments
Closed

filtration chain for DataFrames #11875

danfrankj opened this issue Dec 21, 2015 · 7 comments

Comments

@danfrankj
Copy link
Contributor

Inspired by the R package dplyr I'd like to be able to chain together my data manipulation but can't find an elegant way to do a filter.

For example if I have a group -> aggregate -> filter sequence, I'd do something like this


In [13]: df
Out[13]: 
     A      B      C      D
0  foo    one -0.193 -0.900
1  bar    one  1.505  0.223
2  foo    two  0.646 -0.025
3  bar  three -0.072  1.227
4  foo    two -1.367 -0.873
5  bar    two  1.176  2.317
6  foo    one  0.424  0.858
7  foo  three  2.129  0.038

In [14]: aggd = df.groupby('A').sum()

In [15]: final = aggd[aggd['C'] > 2]

In [16]: final
Out[16]: 
         C      D
A                
bar  2.608  3.767

What I'd like to be able to do is....

In [21]:  df.groupby('A').sum().apply_filter(lambda row: row['C'] > 0, axis=1)

Out[21]: 
         C      D
A                
bar  2.608  3.767

This should be a pretty simple addition basically calling DataFrame.apply and happy to open a PR for it but wanted to see what you guys thought about this?

@max-sixty
Copy link
Contributor

How about .query:

In [81]: df
Out[81]: 
          a         b         c
0  0.778730  0.784767  0.798046
1  0.182564  0.686324  0.431897
2  0.149061  0.290067  0.397787
3  0.749971  0.050980  0.995215
4  0.144524  0.863902  0.973320
5  0.480789  0.492512  0.834956
6  0.251052  0.619787  0.237869
7  0.488043  0.793807  0.314146
8  0.816102  0.615878  0.900229
9  0.111648  0.431056  0.392364

In [82]: df.query('b>.5')
Out[82]: 
          a         b         c
0  0.778730  0.784767  0.798046
1  0.182564  0.686324  0.431897
4  0.144524  0.863902  0.973320
6  0.251052  0.619787  0.237869
7  0.488043  0.793807  0.314146
8  0.816102  0.615878  0.900229

@TomAugspurger
Copy link
Contributor

Good call @MaximilianR. @danfrankj let us know if query isn't exactly what you were looking for. It might be a bit limited compared to dplyr since it doesn't take callables, but I think there's an open issue to allow this.

@danfrankj
Copy link
Contributor Author

Agreed query should work though clunky at the moment - I assume the issue you're referring to is this one. #3393

@danfrankj
Copy link
Contributor Author

Hey folks, I keep on coming back to this because

  1. writing strings in DataFrame.query is weird both from a syntax perspective and from a "knowing what's going on" perspective
  2. it doesn't seem like query is going to accept functions anytime soon?
  3. we might want to filer based on another axis.

What do y'all think of this example implementation?

def filter_values(self, func, axis=0):
    ind = self.apply(func, axis=axis)
    if axis == 0:
        return self.loc[:, ind]
    else:
        return self.loc[ind, :]

@TomAugspurger
@MaximilianR

@TomAugspurger
Copy link
Contributor

Take a look at #12539

I think it implements what you're thinking.

@jreback
Copy link
Contributor

jreback commented Apr 7, 2016

actually this will be merged shortly. I c @TomAugspurger pointed at the issue already! thanks!

@danfrankj
Copy link
Contributor Author

awesome, one step ahead of me. Great stuff guys, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants