-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH/API: Accept DataFrame for isin #5199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
" a duplicate axis.") | ||
return self.eq(values.reindex_like(self), axis='index') | ||
elif isinstance(values, DataFrame): | ||
if not values.columns.is_unique and values.index.is_unique: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should prob disallow dups on both self and values....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going back and forth on allowing dupes on self
.
This seems well defined:
# Dupes on just self's index.
In [21]: df = pd.DataFrame({'A': [1, 1, 0, 0], 'B': [2, np.nan, 4, 4]}, index=['a','a','c','d'])
In [22]: df
Out[22]:
A B
a 1 2
a 1 NaN
c 0 4
d 0 4
In [26]: other = pd.Series([1, 0, -1, -1], index=['a','b','c','d'])
In [27]: df.isin(other) # Just the 1 matches. df.ix['a'] is True for both a's.
Out[27]:
A B
a True False
a True False
c False False
d False False
What wouldn't be well defined if other
's index was instead ['a', 'a', 'c', 'd']
. Then a
would be getting True from other
's first a
and False from the second. But this case will already raise a ValueError since values has dupes.
maybe add/modify the example in v0.13.0 and/or docs? (to show using df), which IMHO will prob be more of a usecase than a dict |
looks good....@hayd ? |
return self.eq(values.reindex_like(self), axis='index') | ||
elif isinstance(values, DataFrame): | ||
if not (values.columns.is_unique and values.index.is_unique): | ||
raise ValueError("ValueError: cannot compute isin from with" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a NotImplementedError ? (since I think it makes sense with duplicates) :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean NotImplementedError for duplicates on self
and ValueError for duplicates on values
? From an older diff: #5199 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my last comment was wrong. You're obviously talking about values
for both. What do you think about
In [7]: df
Out[7]:
A B
0 1 1
1 0 1
2 1 0
3 0 0
In [8]: other
Out[8]:
A A
0 1 1
1 1 0
2 0 0
in [9]: df.isin(other)
(A, 0)
is clearly True and (A, 2)
clearly False, but I don't see any way to make sense of (A, 1)
.
Aside from that little thing, this looks good! tbh I think ignore_index may be useful kwarg for some, but we could push it back to 0.14. |
expected = DataFrame(False, index=df1.index, columns=df1.columns) | ||
expected['B'].loc[[0, 2]] = True | ||
result = df1.isin(df2) | ||
assert_frame_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you test this with a MultiIndex that's not monotonically sorted (or make an error on that (property is is_monotonic
)? (and, preferably) has more than 2 levels, and then test comparison between MI and regular Index? (if that's an error, then great).
@hayd sorry, my last push squashed the discussion about isin with duplicate indicies/columns on
since the first Once we get this figured out I'll rebase / squash and it will be good to go. |
@TomAugspurger ah ha, you're right that makes no sense, I was thinking the other direction:
which would (in this case) make sense, since df has unique column names... hence the NotImplemented, as potentially be fleshed out for this behaviour in the future/someday (!) |
match. If values is a Series, that's the index. If values is a | ||
dictionary of column names, that's the columm names. If values | ||
is a DataFrame, that's both. | ||
iloc : boolean, if passing other as a dict as values, describe columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put the explanation on a second indented line to follow the numpy docstring standard?
Is this a known problem or did I break something in the build script? |
just have it rebuild again
travis hiccup |
@hayd I think you already had that one sorted out: In [8]: df
Out[8]:
A B
0 1 1
1 0 1
2 1 0
3 0 0
In [9]: other
Out[9]:
A A
0 1 1
1 1 0
2 0 0
In [10]: other.isin(df)
Out[10]:
A A
0 True True
1 False True
2 False False Good to go on my end: https://travis-ci.org/TomAugspurger/pandas/builds/12617771 |
Ah ha, excellent! Then this looks great :), is there/can we add a test for that? I guess @jorisvandenbossche's point was that the arg description of values is confusing (mainly as there are several cases for it)! To address this one option is to add a few examples into the docstring (copy some of the ones already written in the docs :) ), comparing to list, series/dict, dataframe. |
@hayd No, it was actually only from an aesthetical point :-) But after reading it, you are right that is is not that clear. Maybe the options can be listed in bullets. |
@hayd added examples in doctoring and a test for dupes on self. |
which must match. If `values` is a DataFrame, | ||
then both the index and column labels must match. | ||
iloc : booleans | ||
if passing other as a dict as values, describe columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose if this? why wouldn't you always do it this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is for if the keys of the values dict are column locations instead of labels:
In [46]: df = DataFrame({'A': ['a', 'b', 'c'], 'B': ['a', 'e', 'f']})
In [47]: d = {0: ['a']}
In [48]: df.isin(d, iloc=True)
Out[48]:
A B
0 True False
1 False False
2 False False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't do this anywhere else, this is confusing, and IMHO is not very useful, I think you should take it out @hayd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put it in, so I'm biased. The difference like loc and iloc, rather than ix (if integer col names could be ambiguous, so just force user to be explicit). I still think the behaviour is useful, but if there is a better argname...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you can now pass a frame (and determine unambiguously whether you NEED to use iloc's (e.g. a non-unique self or other index), in what situation would you actually need / require this? (I think it would have to by definition be a dict with integer keys that you want to interpret as locations rather than labels).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in case of frame may not make sense to use iloc=True, but like you say I think it still does with a dict (and possibly Series)... default is to use labels.
@TomAugspurger Short, clear but illustrative examples. Nice! |
2 True True | ||
|
||
>>> df = DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']}) | ||
>>> # values is a dataframe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and maybe you can also repeat here what is said in parameter explanation (then it is clear on what the user should pay attention), something like
When ``values`` is a dataframe, values and index and column labels must match:
@hayd ok just wanted to be sure (and since the default is the use labels this is sort of innocuous) |
OK so we're keeping the API the same? The |
@hayd I reviewed this again |
@jreback there is a test for it: |
You're right it shouldn't be called iloc, as it's not clear it's to do with column location (rather than index). I don't see how this is incompatible with Series isin and thought it could be useful.... |
|
||
Examples | ||
-------- | ||
When `values` is a list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small remark, if you want the values
to show as 'code', you have to use double backquotes. With single backquotes as here it will appear as 'italic'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these rules still apply in the examples section? https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt#common-rest-concepts
Use italics, bold, and monospace if needed in any explanations (but not for variable names and doctest code or multi-line code). Variable, module, function, and class names should be written between single back-ticks (
numpy
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, you are right that they advise to use single back-ticks to reference variables. However, they will still appear as italic and not as monospaced code. But of course, then it is a choice which of the two you want. And it seems that the numpy docstring standard has chosen to use variable
for variables and `int`
for eg to specify the python type.
Although I think I would prefer the variable as code, if we follow the numpy docstring standard, you can leave it like this :-)
@hayd so in the future, I think an ignore index argument would achieve the same thing? Something like |
@hayd @TomAugspurger I like prob not going to cut RC for a few days and you can change even after that...so lmk |
API: Series should respect index labels. DOC: Update what's new to use a DataFrame API: Remove iloc argument to isin.
i thought that ignore index is slightly different to iloc:
Whereas iloc is the position of column:
I think iloc does not work with DataFrames or with Series, so if True and one of those is passed it should raise. Could cut for now I guess... :( |
In [14]: df = pd.DataFrame([['a', 'b'], ['c', 'a']], columns=['A', 'B'])
In [17]: df.isin(pd.Series(['a', 'c'], index=[0, 2]))
Out[17]:
A B
0 True False
1 False False
In [18]: df.isin(pd.Series(['a', 'c'], index=[0, 2]), ignore_index=True)
Out[18]:
A B
0 True False
1 True False # ignore that 1 != 2
In [33]: other = pd.DataFrame({'A': ['a', 'c'], 'C': ['b', 'd']}, index=[0, 2])
In [34]: other
Out[34]:
A C
0 a b
2 c d
In [35]: df.isin(other)
Out[35]:
A B
0 True False
1 False False
In [35]: df.isin(other, ignore_index=0) # ignore the index
Out[35]:
A B
0 True False # b still false since B != C
1 True False # now A,2 is true
In [35]: df.isin(other, ignore_index=1) # ignore cols; like ignoring keys for dict?
Out[35]:
A B
0 True True
1 False False
In [35]: df.isin(other, ignore_index='both') # ignore both;
Out[35]:
A B
0 True True
1 True False I think conceptually the idea of matching based on location rather than label is similar enough between dicts and indexes/columns that they can share the kwarg. |
I still don't see a nice use case for Unless I am missing something.... |
I think Andy put it in seconds after .12 was released :) I just pushed a version without @jorisvandenbossche I went with the double backticks. Looks better in the documentation. |
ok....this looks good.... @hayd ok? |
ENH/API: Accept DataFrame for isin
sulk! fine... Thanks for putting this all together, merged! :) |
Note: there is still a use of the
|
API: Series should respect index labels. This would be incompatible, but there hasn't been any official release since
isin
was merged.Will close #4421. Still a WIP for now. Needs more testing.
Thoughts on adding an argument to ignore the index? i.e.
I would say there's no need for an
ignore_index
since you could just dodf.isin(s.values)
.