ENH: DataFrame isin #4237

TomAugspurger · 2013-07-14T14:21:47Z

WIP for now.

See #4211 for more info.

A few things to check before this is ready to merge:

This is basically a convenience method. I use the Series method for each column passed, and aggregate the methods sensibly.

A few questions before merge though:

Do we have a preference for any/all vs. and/or? Seems like any/all would be more consistent with other methods like df.dropna(how='all')
I'm not thrilled about all the nested if statements. The problem is accepting both dicts and just flat arrays as values. The idea is to use dicts like {colname: array_of_possible_values} if the set of possible values to match against differs by columns (most likely case I think). I need the extra nested if statement here:

        if how == 'and':
            if isinstance(values, dict):
                cond_n = len(values)
            else:
                cond_n = len(self.columns)  # Flat matching.
        elif how == 'or':
            cond_n = 1

since the two ways of calling will be a bit different. If you're passing a dict it will probably just be df.isin(values=dict). If you don't care which column matches what, the call will be df[subset_of_columns].isin(values). It's messy, but it works.

I check for empty values, and raise an error instead of just returning all Falses. Passing an empty list as values, it would actually would work. Series.isin([]) will return all Falses. But it would fail on an empty dict, so I just check for it ahead of time and raise an error.
Also, I may have messed up the history. I thought I rebased and squashed everything into one commit, but apparently not. I had some merge conflicts that I had to fix (just picking HEAD over the old stuff everywhere). I'll take a closer look at this. Also I may have done git pull instead of git pull --rebase at some point. Will that mess up the history?

Happy to move this to .13 also.

cpcloud · 2013-07-14T14:40:05Z

would an axis parameter be useful? i think it would be more consistent with the rest of pandas

cpcloud · 2013-07-14T15:05:35Z

you should remove that merge commit

# if you haven't added a pandas master remote
git remote add upstream git://github.com/pydata/pandas.git

# fetch latest changes
git fetch upstream
export EDITOR='gvim -f' # or emacs or whatever you want
git rebase -i upstream/master

after that you can hit the f key (fixup) or s key (squash) when the cursor is over the commit you'd like to fixup or squash

cpcloud · 2013-07-14T17:30:03Z

what i like to do when i'm close to finishing a PR and before I add release notes is

git fetch upstream
git rebase upstream/master

in fact i have alias gfrb='git fetch upstream && git rebase upstream/master' in my .zshrc the mnemonic is "git fetch rebase", it's not super general, but i pretty much name all of my upstreams upstream so it works in most cases and if it doesn't then there's always git reflog

this puts your changes right on top of the latest in master and generally makes things easier when you want to merge. you can add release notes before doing this but it most likely will cause a merge conflict (not a big deal, but still annoying)

check out the pandas wiki for some more git and dev tips.

hayd · 2013-07-14T20:05:20Z

I really think that the default behaviour should apply across every element (i.e. not be row-wise), like naive ~~(incorrect)~~ implementation:

def isin(self, items):
    return self.applymap(items.__contains__)

Can always then apply all/any and think it's probably easier to read:

df.isin(blah).all() # possibly setting axis

Saying that, if it can be more efficient to do the row-wise behaviour, having that as an option is a good thing. Though not entirely sold on 'and' and 'any' as argument values (but perhaps these are used elsewhere in codebase)...

hayd · 2013-07-14T20:06:10Z

but +1 for isin as method, good addition

cpcloud · 2013-07-14T20:18:26Z

i could also see this taking a dict of columns => sequences

cpcloud · 2013-07-14T20:19:29Z

nice dict is already there

hayd · 2013-07-14T21:09:13Z

Also, does this work for DataFrames with repeated cols/index (think you may get bit here)?

I'm afraid the naive implementation may be the easiest (and make most sense here)...

hayd · 2013-07-14T21:32:51Z

@TomAugspurger lib has an ismember cython function (which is what Series.isin uses), so this way is pretty fast):

def isin(self, items):
    return pd.DataFrame(lib.ismember(df.values.ravel(), set(items)).reshape(df.shape), df.index, df.columns)

could later generalise ismember to take a 2d array if you wanted.

cpcloud · 2013-07-14T21:33:30Z

flatten makes a copy use ravel instead

TomAugspurger · 2013-07-15T02:25:58Z

Thinking more about the axis argument. I've been thinking about row-wise mostly (axis=0). This will return a Series of bools where the index is the same as the original DataFrame. Presumably axis=1 will return a Series of bools whose index is the same as the columns of the Dataframe isin is called on. I'd also like to use @hayd's naiive implementation. Should this go in the axis argument also? Do we have something like axis=None to do things element-wise?

I think generalizing ismember will be worthwhile down the road. But maybe we should get a python implementation first, and I can tackle the Cython later.

TomAugspurger · 2013-07-15T03:36:52Z

Uhmm, I think I'm losing my mind a bit. axis=1 operates column-wise so with

In [9]: df
Out[9]: 
  ids ids2  vals
0   a    e     1
1   b    f     2
2   f    c     3
3   f    f     4

then we should get

In [15]: df.isin({'ids': ['a', 'b'], 'vals': [1, 3]}, how='any', axis=1)
Out[15]: 
0     True
1     True
2     True
3    False
dtype: bool

and axis=0 returns

In [16]: df.isin({'ids': ['a', 'b'], 'vals': [1, 3]}, how='any', axis=0)
Out[16]: 
vals    True
ids     True
dtype: bool

Or do I have that backwards?

Edit: I'm pretty sure this is backwards.

hayd · 2013-07-15T07:49:09Z

@TomAugspurger Just to step back a little, what is the benefit of using how (and axis) rather than just doing .any() or .all() after ?

I think for passing a dict behaviour should be something like

if isinstance(items, dict):
    if axis == 1:
       df = df.T
    return pd.concat(df[[k]].isin(v) for k, v in items.items())

Edit this isn't quite the way to do it, for one thing the df[[k]] term is pretty hairy, probably better to use loc/iloc (maybe even have an iloc argument) :s

docs. to be rebased ENH: Add isin method to DataFrame Basic tests. Added method and fixed tests. ENH: Add ordered argument to df.isin() Expects a sequence of arrays. Updated release notes for df.isin() CLN: cleanup going to remove ordered argument. Using a dict for ordered matching. Docs BUG: fixed subselection length check issues. Updated release notes for df.isin() remove merge conflict note

TomAugspurger · 2013-07-16T02:24:19Z

Okay, I went with Andy's suggestion and simplified things. Two cases

values is a flat array: will return a dataframe shaped identical to the first one. Plays nicely with .any() or .all() afterward
values is a dict of either {columns:[values]} or {rows:[values]} returns a subset of the original dataframe with just those columns/rows in the dict. This requires you to specify axis=1 for columns or axis=0 for rows.

Two questions before the PR is ready:

Is there a better way to handle the axis argument? I ignore it if values is flat, and require it if values is a dict.
When values is a dict, the result of df.isin() might not be sorted the same way as the original dataframe. I should probably sort it before returning, right? What's the best way of doing that?

Hopefully I've got it to a clean merge. I accidentally included one of @cpcloud's commits in there at one point and had to reset and force push.

hayd · 2013-07-16T13:27:55Z

I think we don't need {row:[values]} so can just lose the axis argument (they can always transpose, and I think that's probably an unusual use case) that way we can drop it entirely. Also (I think) your implementation breaks with duplicate column names :s

I have added a couple of commits to this (and changed some of the behaviour... will add as alternative pr). Let me know what you think.

TomAugspurger · 2013-07-16T13:32:12Z

Thanks. Agreed on the axis argument. It makes much more sense column--wise since the types will be homogenous down columns, but (probably) heterogenous across columns.

jreback · 2013-07-16T13:33:24Z

this is how you should iterate to potentially deal with dups (this is from core/frame.py)

 def itertuples(self, index=True):
        """
        Iterate over rows of DataFrame as tuples, with index value
        as first element of the tuple
        """
        arrays = []
        if index:
            arrays.append(self.index)

        # use integer indexing because of possible duplicate column names
        arrays.extend(self.iloc[:, k] for k in xrange(len(self.columns)))
        return izip(*arrays)

jreback · 2013-07-20T15:36:01Z

@TomAugspurger close this in favor of #4258 ?

jreback · 2013-07-20T15:36:36Z

@TomAugspurger sorry...I realized that's on top of yours..never mind :)

hayd mentioned this pull request Jul 16, 2013

ENH: Dataframe isin2 #4258

Merged

hayd merged commit 0309899 into pandas-dev:master Jul 24, 2013

TomAugspurger deleted the df_isin branch July 25, 2013 04:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: DataFrame isin #4237

ENH: DataFrame isin #4237

TomAugspurger commented Jul 14, 2013

cpcloud commented Jul 14, 2013

cpcloud commented Jul 14, 2013

cpcloud commented Jul 14, 2013

hayd commented Jul 14, 2013

hayd commented Jul 14, 2013

cpcloud commented Jul 14, 2013

cpcloud commented Jul 14, 2013

hayd commented Jul 14, 2013

hayd commented Jul 14, 2013

cpcloud commented Jul 14, 2013

TomAugspurger commented Jul 15, 2013

TomAugspurger commented Jul 15, 2013

hayd commented Jul 15, 2013

TomAugspurger commented Jul 16, 2013

hayd commented Jul 16, 2013

TomAugspurger commented Jul 16, 2013

jreback commented Jul 16, 2013

jreback commented Jul 20, 2013

jreback commented Jul 20, 2013

ENH: DataFrame isin #4237

ENH: DataFrame isin #4237

Conversation

TomAugspurger commented Jul 14, 2013

cpcloud commented Jul 14, 2013

cpcloud commented Jul 14, 2013

cpcloud commented Jul 14, 2013

hayd commented Jul 14, 2013

hayd commented Jul 14, 2013

cpcloud commented Jul 14, 2013

cpcloud commented Jul 14, 2013

hayd commented Jul 14, 2013

hayd commented Jul 14, 2013

cpcloud commented Jul 14, 2013

TomAugspurger commented Jul 15, 2013

TomAugspurger commented Jul 15, 2013

hayd commented Jul 15, 2013

TomAugspurger commented Jul 16, 2013

hayd commented Jul 16, 2013

TomAugspurger commented Jul 16, 2013

jreback commented Jul 16, 2013

jreback commented Jul 20, 2013

jreback commented Jul 20, 2013