Skip to content

Adding lambda support inside of __getitem__ for DataFrame, Series, .. etc. #2560

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
spearsem opened this issue Dec 18, 2012 · 4 comments · Fixed by #4164
Closed

Adding lambda support inside of __getitem__ for DataFrame, Series, .. etc. #2560

spearsem opened this issue Dec 18, 2012 · 4 comments · Fixed by #4164
Milestone

Comments

@spearsem
Copy link

To avoid the verbose syntax currently needed to select across many columns of a data frame, here's a suggestion. Inside of DataFrame's getitem function, make some special case logic to handle the case where a lambda function is passed in. If a lambda is passed in, then apply it to the dataframe itself and attempt to get the items based on the lambda's result.

Here's an example of what I mean. Suppose that I create a data frame named "dfrm" and it has columns A, B, C, D, and E. Then currently, the following syntax will work to sub-select across conditions on the A and B columns:

dfrm[(lambda x: (x.A < 0) & (x.B > 0))(dfrm)]

By adding the extra handling to getitem, you can remove the need for the the last set of parentheses where dfrm itself is passed as the argument to the lambda. getitem can check for a callable and just always pass itself to the callable, so that the syntax would look like this:

dfrm[(lambda x: (x.A < 0) & (x.B > 0))]

@ghost
Copy link

ghost commented Dec 18, 2012

I'm surprised this doesn't work:

dfrm.ix[(drfm.A < 0) and (drfm.B > 0)]

because and'ing two arrays together is not currently a vector operation,
perhaps it should in this case.

@jreback
Copy link
Contributor

jreback commented Dec 18, 2012

does the following do what you want?

In [3]: df = pd.DataFrame(np.random.randn(20,3),columns=['A','B','C'])

In [4]: df
Out[4]: 
           A         B         C
0   0.334712 -0.557606  0.344016
1   0.549630 -0.264684 -0.916011
2   1.655768 -0.908992 -0.063336
3  -0.142142  0.259900 -0.260913
4   2.160908  0.239873  0.321448
5   1.650202  0.077349 -0.068250
6   0.354457  0.530161 -1.758845
7  -0.803534  0.015683  0.424979
8  -1.436670 -1.168130  0.222747
9   1.525383  0.363306 -0.192263
10  0.069851  0.850365  1.741803
11 -0.515722  1.348962 -0.375264
12 -0.204887  1.114886 -0.928263
13  0.612595  1.547913  0.336282
14 -0.780298  0.926265 -0.006614
15  1.213962  1.618504  0.133741
16  0.870338  0.146988  2.189953
17 -2.041328  1.338305 -0.129272
18  0.014687  1.048986 -1.525997
19 -1.147067  0.379734 -1.331019

In [5]: df[(df.A < 0) & (df.B > 0)]
Out[5]: 
           A         B         C
3  -0.142142  0.259900 -0.260913
7  -0.803534  0.015683  0.424979
11 -0.515722  1.348962 -0.375264
12 -0.204887  1.114886 -0.928263
14 -0.780298  0.926265 -0.006614
17 -2.041328  1.338305 -0.129272
19 -1.147067  0.379734 -1.331019

@spearsem
Copy link
Author

That does it, but my goal with the lambdas is specifically to avoid needing to verbosely type out the dataframe's name -dot- attr_name for all the columns involved in the selection. The lambda at least lets me reduce it just to "x".

As for "and" not working where & works, this is just a known limitation of Python's for using the built in logical operators on arrays. It gives the classic "truth value of an array is undefined" error.

On Dec 18, 2012, at 6:27 PM, jreback [email protected] wrote:

does the following do what you want?

In [3]: df = pd.DataFrame(np.random.randn(20,3),columns=['A','B','C'])

In [4]: df
Out[4]:
A B C
0 0.334712 -0.557606 0.344016
1 0.549630 -0.264684 -0.916011
2 1.655768 -0.908992 -0.063336
3 -0.142142 0.259900 -0.260913
4 2.160908 0.239873 0.321448
5 1.650202 0.077349 -0.068250
6 0.354457 0.530161 -1.758845
7 -0.803534 0.015683 0.424979
8 -1.436670 -1.168130 0.222747
9 1.525383 0.363306 -0.192263
10 0.069851 0.850365 1.741803
11 -0.515722 1.348962 -0.375264
12 -0.204887 1.114886 -0.928263
13 0.612595 1.547913 0.336282
14 -0.780298 0.926265 -0.006614
15 1.213962 1.618504 0.133741
16 0.870338 0.146988 2.189953
17 -2.041328 1.338305 -0.129272
18 0.014687 1.048986 -1.525997
19 -1.147067 0.379734 -1.331019

In [5]: df[(df.A < 0) & (df.B > 0)]
Out[5]:
A B C
3 -0.142142 0.259900 -0.260913
7 -0.803534 0.015683 0.424979
11 -0.515722 1.348962 -0.375264
12 -0.204887 1.114886 -0.928263
14 -0.780298 0.926265 -0.006614
17 -2.041328 1.338305 -0.129272
19 -1.147067 0.379734 -1.331019

Reply to this email directly or view it on GitHub.

@cpcloud
Copy link
Member

cpcloud commented Jul 29, 2013

this is addressed by #4164

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants