Skip to content

ENH/API: DataFrame's isin should accept DataFrames #4421

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Jul 31, 2013 · 14 comments · Fixed by #5199
Closed

ENH/API: DataFrame's isin should accept DataFrames #4421

TomAugspurger opened this issue Jul 31, 2013 · 14 comments · Fixed by #5199
Milestone

Comments

@TomAugspurger
Copy link
Contributor

From this SO question.

The user had two dataframes and wanted to use the second as the values argument to isin.

We'd need to think about how to handle the other index. In this case the user only cared about the columns, not the index labels.

Previous issues/PRs: #4258 and #4211

@hayd
Copy link
Contributor

hayd commented Aug 1, 2013

So, as you say, at the moment this works by passing in:

df1.isin(df2.to_dict(outtype='list'))

Is this an efficient operation? Should we just call that if passed a DataFrame?

@jreback
Copy link
Contributor

jreback commented Aug 1, 2013

FYI I believe this will fail with non-unique columns (as to_dict will fail); look at itertuples on a way to deal with this

@jreback
Copy link
Contributor

jreback commented Aug 12, 2013

Should isin accept something like a set operator mode?

maybe mode=diff for set differences, or mode=and for inclusion?

http://stackoverflow.com/questions/18180763/set-difference-for-pandas/18187648#18187648

@hayd
Copy link
Contributor

hayd commented Aug 21, 2013

see #4617 from same OP.

I'm not sure I'm 100% with how this was going to work, because the column approach no longer makes sense once we're talking about DataFrames, or at least is more ambiguous (aside from repeated columns):

In [2]: df1 = pd.DataFrame([[1, 2], [2, 3]])

In [3]: df2 = pd.DataFrame([[1, 3]])

In [4]: df2.isin(df1.to_dict(outtype='list'))
Out[4]: 
      0     1
0  True  True

which probably isn't what the user was expecting.

Which may be something like (possibly user doesn't care about the index):

In [11]: pd.Series(map(set(df1.itertuples()).__contains__, df2.itertuples()), df2.index)
Out[11]: 
0    False
dtype: bool

Ahem... Not sure what would be an efficient way to do that?

@jreback imo set intersection stuff can relatively easily be done after the fact, e.g. (all/any and boolean indexing, really don't think we need to add loads of kwargs.

@jreback
Copy link
Contributor

jreback commented Sep 30, 2013

@hayd @TomAugspurger for 0.13?

@hayd
Copy link
Contributor

hayd commented Sep 30, 2013

Definitely.

@jreback
Copy link
Contributor

jreback commented Oct 11, 2013

@hayd @TomAugspurger what is left on this?

@hayd
Copy link
Contributor

hayd commented Oct 11, 2013

To actually write this up... I think it's kind of important for 0.13 as otherwise isin is a little confusing (and will break code if we change the API later)....

Thinking about it again, do we actually want to eq (ag. there a problem with dupes here...):

In [32]: df1 = pd.DataFrame([[1, 2], [2, 3]], columns=['a', 'b'])

In [33]: df2 = pd.DataFrame([[1, 3]], columns=['a', 'b'])

In [34]: df2.isin(df1.to_dict(outtype='list'))
Out[34]: 
      a     b
0  True  True

In [35]: df2.eq(df1.reindex_like(df2))  # is this isin for DataFrames?
Out[35]: 
      a      b
0  True  False

Worst case we should NotImplement it with a mention of eq...

@TomAugspurger
Copy link
Contributor Author

@hayd Are you working on this one? I can give it a shot tonight / tomorrow if you're wanting to focus on the SQLAlchemy stuff.


Let's collect some thoughts on what should happen when another DataFrame is passed as values.

  1. Identical indicies, no dupes
In [20]: df1
Out[20]: 
   A   B
0  1   2
1  2 NaN
2  3   4
3  4   4

In [21]: df2
Out[21]: 
    A   B
0   0   2
1   2 NaN
2  12   4
3   4   5

# Expected: df1.isin(df2.to_dict(outtype='list')) does not work
# df1.eq(df2.reindex_like(df1)) does work
Out[23]: 
       A      B
0  False   True
1   True  False
2  False   True
3   True   False
  1. Dissimilar indices, still no dupes. Should be intersection of indices, check isin on those (optional parameter to ignore index?).
# df1 same as before
In [35]: df2 = pd.DataFrame({'A': [0, 2, 12, 4], 'C': [2, np.nan, 4, 5]}, index=[0, 1, 3, 4])

In [36]: df2
Out[36]: 
    A   C
0   0   2
1   2 NaN
3  12   4
4   4   5

# Intersection of columns is A, intersection of indices is [0, 1, 3]
# df1.isin(df2.to_dict(outtype='list')) does *NOT* work since it matches (A, 3), it ignores the index.
# Andy's df1.eq(df2.reindex_like(df1)) does work here.
In [39]: df1.isin(df2.to_dict(outtype='list'))
Out[39]: 
       A      B
0  False  False
1   True  False
2  False  False
3  False  False
  1. Duplicate columns on values (the argument). The two options are to raise or return True wherever there's a match.
In [74]: df1
Out[74]: 
   A   B
0  1   2
1  2 NaN
2  3   4
3  4   4

In [75]: df2
Out[75]: 
    A   A
0   0   1
1   2   4
3   2 NaN
4   4   5

# Expected if we return True wherever there's a True

       A      B
0   True  False   # from the second A column
1   True  False   # from the first A column
2  False  False
3  False  False

The df1.eq(df2.reindex_like(df1)) does not work here, this hits the maximum recursion depth.
I had my dfs backwards; that's actually a ValueError: cannot reindex from a duplicate axis.

@TomAugspurger
Copy link
Contributor Author

Oh we definitely need to get this in for .13. I think DataFrames and Series should behave similarly with respect to labels. Currently, if values is a Series, the index is ignored:

In [150]: df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [2, np.nan, 4, 4]}, index=['a','b','c','d'])

In [151]: s = pd.Series([1, 3, 11, 12], index=['a','b','c','d'])

In [152]: df.isin(s)
Out[152]: 
       A      B
a   True  False
b  False  False
c   True  False
d  False  False

So (A, c) matches on the 3 from the Series, despite the index of that 3 being b, not c.

What's everyone's thoughts on respecting index/column labels? I'd say that (A, c) should be False here since the labels don't match.

@jreback
Copy link
Contributor

jreback commented Oct 12, 2013

@TomAugspurger there was a bug in comparing duplicate frames (so your above comparison will work as you had it, or raise if you don't reindex_like), but won't infinity recurse (which generally is a bad thing :)

will merge in a few

In [1]: df1 = DataFrame([[1,2],[2,np.nan],[3,4],[4,4]],columns=['A','B'])

In [2]: df2 = DataFrame([[0,1],[2,4],[2,np.nan],[4,5]],columns=['A','A'])

In [3]: df1
Out[3]: 
   A   B
0  1   2
1  2 NaN
2  3   4
3  4   4

In [4]: df2
Out[4]: 
   A   A
0  0   1
1  2   4
2  2 NaN
3  4   5

In [5]: df1.reindex_like(df2) == df2
Out[5]: 
       A      A
0  False   True
1   True  False
2  False  False
3   True  False

@jreback
Copy link
Contributor

jreback commented Oct 12, 2013

@TomAugspurger if you need to test if the the columns are duplicated, use df.columns.is_unique

@jreback
Copy link
Contributor

jreback commented Oct 12, 2013

@TomAugspurger that bug fix merged in....

@hayd
Copy link
Contributor

hayd commented Oct 14, 2013

One unfortunate part of this the NaN handling... In your first example:

In [15]: df1
Out[15]: 
   A   B
0  1   2
1  2 NaN
2  3   4
3  4   4

In [16]: df2
Out[16]: 
    A   B
0   0   2
1   2 NaN
2  12   4
3   4   5

In [17]: df1.eq(df2.reindex_like(df1))
Out[17]: 
       A      B
0  False   True
1   True  False
2  False   True
3   True  False

But actually that's ok (or perhaps a separate issue) since this is the behavior of Series isin...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants