-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH/API: DataFrame's isin should accept DataFrames #4421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So, as you say, at the moment this works by passing in:
Is this an efficient operation? Should we just call that if passed a DataFrame? |
FYI I believe this will fail with non-unique columns (as to_dict will fail); look at itertuples on a way to deal with this |
Should maybe http://stackoverflow.com/questions/18180763/set-difference-for-pandas/18187648#18187648 |
see #4617 from same OP. I'm not sure I'm 100% with how this was going to work, because the column approach no longer makes sense once we're talking about DataFrames, or at least is more ambiguous (aside from repeated columns):
which probably isn't what the user was expecting. Which may be something like (possibly user doesn't care about the index):
Ahem... Not sure what would be an efficient way to do that? @jreback imo set intersection stuff can relatively easily be done after the fact, e.g. (all/any and boolean indexing, really don't think we need to add loads of kwargs. |
@hayd @TomAugspurger for 0.13? |
Definitely. |
@hayd @TomAugspurger what is left on this? |
To actually write this up... I think it's kind of important for 0.13 as otherwise isin is a little confusing (and will break code if we change the API later).... Thinking about it again, do we actually want to eq (ag. there a problem with dupes here...):
Worst case we should NotImplement it with a mention of eq... |
@hayd Are you working on this one? I can give it a shot tonight / tomorrow if you're wanting to focus on the SQLAlchemy stuff. Let's collect some thoughts on what should happen when another DataFrame is passed as values.
In [20]: df1
Out[20]:
A B
0 1 2
1 2 NaN
2 3 4
3 4 4
In [21]: df2
Out[21]:
A B
0 0 2
1 2 NaN
2 12 4
3 4 5
# Expected: df1.isin(df2.to_dict(outtype='list')) does not work
# df1.eq(df2.reindex_like(df1)) does work
Out[23]:
A B
0 False True
1 True False
2 False True
3 True False
# df1 same as before
In [35]: df2 = pd.DataFrame({'A': [0, 2, 12, 4], 'C': [2, np.nan, 4, 5]}, index=[0, 1, 3, 4])
In [36]: df2
Out[36]:
A C
0 0 2
1 2 NaN
3 12 4
4 4 5
# Intersection of columns is A, intersection of indices is [0, 1, 3]
# df1.isin(df2.to_dict(outtype='list')) does *NOT* work since it matches (A, 3), it ignores the index.
# Andy's df1.eq(df2.reindex_like(df1)) does work here.
In [39]: df1.isin(df2.to_dict(outtype='list'))
Out[39]:
A B
0 False False
1 True False
2 False False
3 False False
In [74]: df1
Out[74]:
A B
0 1 2
1 2 NaN
2 3 4
3 4 4
In [75]: df2
Out[75]:
A A
0 0 1
1 2 4
3 2 NaN
4 4 5
# Expected if we return True wherever there's a True
A B
0 True False # from the second A column
1 True False # from the first A column
2 False False
3 False False
|
Oh we definitely need to get this in for .13. I think DataFrames and Series should behave similarly with respect to labels. Currently, if In [150]: df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [2, np.nan, 4, 4]}, index=['a','b','c','d'])
In [151]: s = pd.Series([1, 3, 11, 12], index=['a','b','c','d'])
In [152]: df.isin(s)
Out[152]:
A B
a True False
b False False
c True False
d False False So (A, c) matches on the 3 from the Series, despite the index of that 3 being What's everyone's thoughts on respecting index/column labels? I'd say that (A, c) should be False here since the labels don't match. |
@TomAugspurger there was a bug in comparing duplicate frames (so your above comparison will work as you had it, or raise if you don't reindex_like), but won't infinity recurse (which generally is a bad thing :) will merge in a few
|
@TomAugspurger if you need to test if the the columns are duplicated, use |
@TomAugspurger that bug fix merged in.... |
One unfortunate part of this the NaN handling... In your first example:
But actually that's ok (or perhaps a separate issue) since this is the behavior of Series isin... |
From this SO question.
The user had two dataframes and wanted to use the second as the values argument to
isin
.We'd need to think about how to handle the other index. In this case the user only cared about the columns, not the index labels.
Previous issues/PRs: #4258 and #4211
The text was updated successfully, but these errors were encountered: