-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: DataFrame isin #4237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: DataFrame isin #4237
Conversation
would an |
you should remove that merge commit # if you haven't added a pandas master remote
git remote add upstream git://github.com/pydata/pandas.git
# fetch latest changes
git fetch upstream
export EDITOR='gvim -f' # or emacs or whatever you want
git rebase -i upstream/master after that you can hit the |
what i like to do when i'm close to finishing a PR and before I add release notes is git fetch upstream
git rebase upstream/master in fact i have this puts your changes right on top of the latest in check out the pandas wiki for some more git and dev tips. |
I really think that the default behaviour should apply across every element (i.e. not be row-wise), like naive
Can always then apply all/any and think it's probably easier to read:
Saying that, if it can be more efficient to do the row-wise behaviour, having that as an option is a good thing. Though not entirely sold on 'and' and 'any' as argument values (but perhaps these are used elsewhere in codebase)... |
but +1 for isin as method, good addition |
i could also see this taking a dict of columns => sequences |
nice |
Also, does this work for DataFrames with repeated cols/index (think you may get bit here)? I'm afraid the naive implementation may be the easiest (and make most sense here)... |
@TomAugspurger lib has an ismember cython function (which is what Series.isin uses), so this way is pretty fast):
could later generalise ismember to take a 2d array if you wanted. |
|
Thinking more about the axis argument. I've been thinking about row-wise mostly (axis=0). This will return a Series of bools where the index is the same as the original DataFrame. Presumably axis=1 will return a Series of bools whose index is the same as the columns of the Dataframe I think generalizing ismember will be worthwhile down the road. But maybe we should get a python implementation first, and I can tackle the Cython later. |
Uhmm, I think I'm losing my mind a bit. In [9]: df
Out[9]:
ids ids2 vals
0 a e 1
1 b f 2
2 f c 3
3 f f 4 then we should get In [15]: df.isin({'ids': ['a', 'b'], 'vals': [1, 3]}, how='any', axis=1)
Out[15]:
0 True
1 True
2 True
3 False
dtype: bool and In [16]: df.isin({'ids': ['a', 'b'], 'vals': [1, 3]}, how='any', axis=0)
Out[16]:
vals True
ids True
dtype: bool Or do I have that backwards? Edit: I'm pretty sure this is backwards. |
@TomAugspurger Just to step back a little, what is the benefit of using how (and axis) rather than just doing .any() or .all() after ? I think for passing a dict behaviour should be something like
Edit this isn't quite the way to do it, for one thing the df[[k]] term is pretty hairy, probably better to use loc/iloc (maybe even have an iloc argument) :s |
docs. to be rebased ENH: Add isin method to DataFrame Basic tests. Added method and fixed tests. ENH: Add ordered argument to df.isin() Expects a sequence of arrays. Updated release notes for df.isin() CLN: cleanup going to remove ordered argument. Using a dict for ordered matching. Docs BUG: fixed subselection length check issues. Updated release notes for df.isin() remove merge conflict note
Okay, I went with Andy's suggestion and simplified things. Two cases
Two questions before the PR is ready:
Hopefully I've got it to a clean merge. I accidentally included one of @cpcloud's commits in there at one point and had to reset and force push. |
I think we don't need {row:[values]} so can just lose the axis argument (they can always transpose, and I think that's probably an unusual use case) that way we can drop it entirely. Also (I think) your implementation breaks with duplicate column names :s I have added a couple of commits to this (and changed some of the behaviour... will add as alternative pr). Let me know what you think. |
Thanks. Agreed on the axis argument. It makes much more sense column--wise since the types will be homogenous down columns, but (probably) heterogenous across columns. |
this is how you should iterate to potentially deal with dups (this is from core/frame.py)
|
@TomAugspurger close this in favor of #4258 ? |
@TomAugspurger sorry...I realized that's on top of yours..never mind :) |
WIP for now.
See #4211 for more info.
A few things to check before this is ready to merge:
This is basically a convenience method. I use the
Series
method for each column passed, and aggregate the methods sensibly.A few questions before merge though:
any
/all
vs.and
/or
? Seems likeany/all
would be more consistent with other methods likedf.dropna(how='all')
{colname: array_of_possible_values}
if the set of possible values to match against differs by columns (most likely case I think). I need the extra nested if statement here:since the two ways of calling will be a bit different. If you're passing a dict it will probably just be df.isin(values=dict). If you don't care which column matches what, the call will be df[subset_of_columns].isin(values). It's messy, but it works.
False
s. Passing an empty list as values, it would actually would work.Series.isin([])
will return all Falses. But it would fail on an empty dict, so I just check for it ahead of time and raise an error.git pull
instead ofgit pull --rebase
at some point. Will that mess up the history?Happy to move this to .13 also.