Skip to content

ENH: DataFrame isin #4237

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 24, 2013
Merged

ENH: DataFrame isin #4237

merged 1 commit into from
Jul 24, 2013

Conversation

TomAugspurger
Copy link
Contributor

WIP for now.

See #4211 for more info.

A few things to check before this is ready to merge:

This is basically a convenience method. I use the Series method for each column passed, and aggregate the methods sensibly.

A few questions before merge though:

  • Do we have a preference for any/all vs. and/or? Seems like any/all would be more consistent with other methods like df.dropna(how='all')
  • I'm not thrilled about all the nested if statements. The problem is accepting both dicts and just flat arrays as values. The idea is to use dicts like {colname: array_of_possible_values} if the set of possible values to match against differs by columns (most likely case I think). I need the extra nested if statement here:
        if how == 'and':
            if isinstance(values, dict):
                cond_n = len(values)
            else:
                cond_n = len(self.columns)  # Flat matching.
        elif how == 'or':
            cond_n = 1

since the two ways of calling will be a bit different. If you're passing a dict it will probably just be df.isin(values=dict). If you don't care which column matches what, the call will be df[subset_of_columns].isin(values). It's messy, but it works.

  • I check for empty values, and raise an error instead of just returning all Falses. Passing an empty list as values, it would actually would work. Series.isin([]) will return all Falses. But it would fail on an empty dict, so I just check for it ahead of time and raise an error.
  • Also, I may have messed up the history. I thought I rebased and squashed everything into one commit, but apparently not. I had some merge conflicts that I had to fix (just picking HEAD over the old stuff everywhere). I'll take a closer look at this. Also I may have done git pull instead of git pull --rebase at some point. Will that mess up the history?

Happy to move this to .13 also.

@cpcloud
Copy link
Member

cpcloud commented Jul 14, 2013

would an axis parameter be useful? i think it would be more consistent with the rest of pandas

@cpcloud
Copy link
Member

cpcloud commented Jul 14, 2013

you should remove that merge commit

# if you haven't added a pandas master remote
git remote add upstream git://github.com/pydata/pandas.git

# fetch latest changes
git fetch upstream
export EDITOR='gvim -f' # or emacs or whatever you want
git rebase -i upstream/master

after that you can hit the f key (fixup) or s key (squash) when the cursor is over the commit you'd like to fixup or squash

@cpcloud
Copy link
Member

cpcloud commented Jul 14, 2013

what i like to do when i'm close to finishing a PR and before I add release notes is

git fetch upstream
git rebase upstream/master

in fact i have alias gfrb='git fetch upstream && git rebase upstream/master' in my .zshrc the mnemonic is "git fetch rebase", it's not super general, but i pretty much name all of my upstreams upstream so it works in most cases and if it doesn't then there's always git reflog

this puts your changes right on top of the latest in master and generally makes things easier when you want to merge. you can add release notes before doing this but it most likely will cause a merge conflict (not a big deal, but still annoying)

check out the pandas wiki for some more git and dev tips.

@hayd
Copy link
Contributor

hayd commented Jul 14, 2013

I really think that the default behaviour should apply across every element (i.e. not be row-wise), like naive (incorrect) implementation:

def isin(self, items):
    return self.applymap(items.__contains__)

Can always then apply all/any and think it's probably easier to read:

df.isin(blah).all() # possibly setting axis

Saying that, if it can be more efficient to do the row-wise behaviour, having that as an option is a good thing. Though not entirely sold on 'and' and 'any' as argument values (but perhaps these are used elsewhere in codebase)...

@hayd
Copy link
Contributor

hayd commented Jul 14, 2013

but +1 for isin as method, good addition

@cpcloud
Copy link
Member

cpcloud commented Jul 14, 2013

i could also see this taking a dict of columns => sequences

@cpcloud
Copy link
Member

cpcloud commented Jul 14, 2013

nice dict is already there

@hayd
Copy link
Contributor

hayd commented Jul 14, 2013

Also, does this work for DataFrames with repeated cols/index (think you may get bit here)?

I'm afraid the naive implementation may be the easiest (and make most sense here)...

@hayd
Copy link
Contributor

hayd commented Jul 14, 2013

@TomAugspurger lib has an ismember cython function (which is what Series.isin uses), so this way is pretty fast):

def isin(self, items):
    return pd.DataFrame(lib.ismember(df.values.ravel(), set(items)).reshape(df.shape), df.index, df.columns)

could later generalise ismember to take a 2d array if you wanted.

@cpcloud
Copy link
Member

cpcloud commented Jul 14, 2013

flatten makes a copy use ravel instead

@TomAugspurger
Copy link
Contributor Author

Thinking more about the axis argument. I've been thinking about row-wise mostly (axis=0). This will return a Series of bools where the index is the same as the original DataFrame. Presumably axis=1 will return a Series of bools whose index is the same as the columns of the Dataframe isin is called on. I'd also like to use @hayd's naiive implementation. Should this go in the axis argument also? Do we have something like axis=None to do things element-wise?

I think generalizing ismember will be worthwhile down the road. But maybe we should get a python implementation first, and I can tackle the Cython later.

@TomAugspurger
Copy link
Contributor Author

Uhmm, I think I'm losing my mind a bit. axis=1 operates column-wise so with

In [9]: df
Out[9]: 
  ids ids2  vals
0   a    e     1
1   b    f     2
2   f    c     3
3   f    f     4

then we should get

In [15]: df.isin({'ids': ['a', 'b'], 'vals': [1, 3]}, how='any', axis=1)
Out[15]: 
0     True
1     True
2     True
3    False
dtype: bool

and axis=0 returns

In [16]: df.isin({'ids': ['a', 'b'], 'vals': [1, 3]}, how='any', axis=0)
Out[16]: 
vals    True
ids     True
dtype: bool

Or do I have that backwards?

Edit: I'm pretty sure this is backwards.

@hayd
Copy link
Contributor

hayd commented Jul 15, 2013

@TomAugspurger Just to step back a little, what is the benefit of using how (and axis) rather than just doing .any() or .all() after ?

I think for passing a dict behaviour should be something like

if isinstance(items, dict):
    if axis == 1:
       df = df.T
    return pd.concat(df[[k]].isin(v) for k, v in items.items())

Edit this isn't quite the way to do it, for one thing the df[[k]] term is pretty hairy, probably better to use loc/iloc (maybe even have an iloc argument) :s

docs. to be rebased

ENH: Add isin method to DataFrame

Basic tests.

Added method and fixed tests.

ENH: Add ordered argument to df.isin()

Expects a sequence of arrays.

Updated release notes for df.isin()

CLN: cleanup

going to remove ordered argument.

Using a dict for ordered matching. Docs

BUG: fixed subselection length check issues.

Updated release notes for df.isin()

remove merge conflict note
@TomAugspurger
Copy link
Contributor Author

Okay, I went with Andy's suggestion and simplified things. Two cases

  • values is a flat array: will return a dataframe shaped identical to the first one. Plays nicely with .any() or .all() afterward
  • values is a dict of either {columns:[values]} or {rows:[values]} returns a subset of the original dataframe with just those columns/rows in the dict. This requires you to specify axis=1 for columns or axis=0 for rows.

Two questions before the PR is ready:

  • Is there a better way to handle the axis argument? I ignore it if values is flat, and require it if values is a dict.
  • When values is a dict, the result of df.isin() might not be sorted the same way as the original dataframe. I should probably sort it before returning, right? What's the best way of doing that?

Hopefully I've got it to a clean merge. I accidentally included one of @cpcloud's commits in there at one point and had to reset and force push.

@hayd
Copy link
Contributor

hayd commented Jul 16, 2013

I think we don't need {row:[values]} so can just lose the axis argument (they can always transpose, and I think that's probably an unusual use case) that way we can drop it entirely. Also (I think) your implementation breaks with duplicate column names :s

I have added a couple of commits to this (and changed some of the behaviour... will add as alternative pr). Let me know what you think.

@TomAugspurger
Copy link
Contributor Author

Thanks. Agreed on the axis argument. It makes much more sense column--wise since the types will be homogenous down columns, but (probably) heterogenous across columns.

@hayd hayd mentioned this pull request Jul 16, 2013
@jreback
Copy link
Contributor

jreback commented Jul 16, 2013

this is how you should iterate to potentially deal with dups (this is from core/frame.py)

 def itertuples(self, index=True):
        """
        Iterate over rows of DataFrame as tuples, with index value
        as first element of the tuple
        """
        arrays = []
        if index:
            arrays.append(self.index)

        # use integer indexing because of possible duplicate column names
        arrays.extend(self.iloc[:, k] for k in xrange(len(self.columns)))
        return izip(*arrays)

@jreback
Copy link
Contributor

jreback commented Jul 20, 2013

@TomAugspurger close this in favor of #4258 ?

@jreback
Copy link
Contributor

jreback commented Jul 20, 2013

@TomAugspurger sorry...I realized that's on top of yours..never mind :)

@hayd hayd merged commit 0309899 into pandas-dev:master Jul 24, 2013
@TomAugspurger TomAugspurger deleted the df_isin branch July 25, 2013 04:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants