ENH/API: Accept DataFrame for isin #5199

TomAugspurger · 2013-10-12T22:26:09Z

API: Series should respect index labels. This would be incompatible, but there hasn't been any official release since isin was merged.

Will close #4421. Still a WIP for now. Needs more testing.

Thoughts on adding an argument to ignore the index? i.e.

In [150]: df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [2, np.nan, 4, 4]}, index=['a','b','c','d'])

In [151]: s = pd.Series([1, 3, 11, 12], index=['a','b','c','d'])

In [152]: df.isin(s)
Out[152]: 
       A      B
a   True  False
b  False  False
c  False  False
d  False  False

In [152]: df.isin(s, ignore_index=True)
Out[152]: 
       A      B
a   True  False
b  False  False
c   True  False
d  False  False

I would say there's no need for an ignore_index since you could just do df.isin(s.values).

jreback · 2013-10-12T22:50:27Z

pandas/core/frame.py

+                                 " a duplicate axis.")
+            return self.eq(values.reindex_like(self), axis='index')
+        elif isinstance(values, DataFrame):
+            if not values.columns.is_unique and values.index.is_unique:


you should prob disallow dups on both self and values....

I'm going back and forth on allowing dupes on self.

This seems well defined:

# Dupes on just self's index. In [21]: df = pd.DataFrame({'A': [1, 1, 0, 0], 'B': [2, np.nan, 4, 4]}, index=['a','a','c','d']) In [22]: df Out[22]: A B a 1 2 a 1 NaN c 0 4 d 0 4 In [26]: other = pd.Series([1, 0, -1, -1], index=['a','b','c','d']) In [27]: df.isin(other) # Just the 1 matches. df.ix['a'] is True for both a's. Out[27]: A B a True False a True False c False False d False False

What wouldn't be well defined if other's index was instead ['a', 'a', 'c', 'd']. Then a would be getting True from other's first a and False from the second. But this case will already raise a ValueError since values has dupes.

jreback · 2013-10-12T22:51:12Z

maybe add/modify the example in v0.13.0 and/or docs? (to show using df), which IMHO will prob be more of a usecase than a dict

jreback · 2013-10-12T23:55:10Z

looks good....@hayd ?

hayd · 2013-10-14T21:20:53Z

pandas/core/frame.py

+            return self.eq(values.reindex_like(self), axis='index')
+        elif isinstance(values, DataFrame):
+            if not (values.columns.is_unique and values.index.is_unique):
+                raise ValueError("ValueError: cannot compute isin from with"


I think this should be a NotImplementedError ? (since I think it makes sense with duplicates) :)

Do you mean NotImplementedError for duplicates on self and ValueError for duplicates on values? From an older diff: #5199 (comment)

Sorry, my last comment was wrong. You're obviously talking about values for both. What do you think about

In [7]: df Out[7]: A B 0 1 1 1 0 1 2 1 0 3 0 0 In [8]: other Out[8]: A A 0 1 1 1 1 0 2 0 0 in [9]: df.isin(other)

(A, 0) is clearly True and (A, 2) clearly False, but I don't see any way to make sense of (A, 1).

hayd · 2013-10-14T22:00:34Z

Aside from that little thing, this looks good!

tbh I think ignore_index may be useful kwarg for some, but we could push it back to 0.14.

jtratner · 2013-10-14T22:03:19Z

pandas/tests/test_frame.py

+        expected = DataFrame(False, index=df1.index, columns=df1.columns)
+        expected['B'].loc[[0, 2]] = True
+        result = df1.isin(df2)
+        assert_frame_equal(result, expected)


can you test this with a MultiIndex that's not monotonically sorted (or make an error on that (property is is_monotonic)? (and, preferably) has more than 2 levels, and then test comparison between MI and regular Index? (if that's an error, then great).

TomAugspurger · 2013-10-15T21:18:14Z

@hayd sorry, my last push squashed the discussion about isin with duplicate indicies/columns on values here: #5199 (comment)

Sorry, my last comment was wrong. You're obviously talking about values for both. What do you think about
In [7]: df
Out[7]: 
   A  B
0  1  1
1  0  1
2  1  0
3  0  0

In [8]: other
Out[8]: 
   A  A
0  1  1
1  1  0
2  0  0

in [9]: df.isin(other)
(A, 0) is clearly True and (A, 2) clearly False, but I don't see any way to make sense of (A, 1).

since the first A in other (1) says False, but the second A in other (0) says False

Once we get this figured out I'll rebase / squash and it will be good to go.

hayd · 2013-10-16T05:23:32Z

@TomAugspurger ah ha, you're right that makes no sense, I was thinking the other direction:

other.isin(df)

which would (in this case) make sense, since df has unique column names... hence the NotImplemented, as potentially be fleshed out for this behaviour in the future/someday (!)

jorisvandenbossche · 2013-10-16T12:15:11Z

pandas/core/frame.py

+            match. If values is a Series, that's the index. If values is a
+            dictionary of column names, that's the columm names. If values
+            is a DataFrame, that's both.
+        iloc : boolean, if passing other as a dict as values, describe columns


Can you put the explanation on a second indented line to follow the numpy docstring standard?

TomAugspurger · 2013-10-16T13:20:17Z

Failing on 2.6

cannot import name hashtable
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "pandas/__init__.py", line 9, in <module>
    from . import hashtable, tslib, lib
ImportError: cannot import name hash table

Is this a known problem or did I break something in the build script?

jreback · 2013-10-16T13:22:51Z

just have it rebuild again

git commit -C HEAD --amend and force push....

travis hiccup

TomAugspurger · 2013-10-16T14:02:00Z

@hayd I think you already had that one sorted out:

In [8]: df
Out[8]: 
   A  B
0  1  1
1  0  1
2  1  0
3  0  0

In [9]: other
Out[9]: 
   A  A
0  1  1
1  1  0
2  0  0

In [10]: other.isin(df)
Out[10]: 
       A      A
0   True   True
1  False   True
2  False  False

Good to go on my end: https://travis-ci.org/TomAugspurger/pandas/builds/12617771

hayd · 2013-10-16T17:11:47Z

Ah ha, excellent! Then this looks great :), is there/can we add a test for that?

I guess @jorisvandenbossche's point was that the arg description of values is confusing (mainly as there are several cases for it)! To address this one option is to add a few examples into the docstring (copy some of the ones already written in the docs :) ), comparing to list, series/dict, dataframe.

jorisvandenbossche · 2013-10-16T17:40:27Z

@hayd No, it was actually only from an aesthetical point :-) But after reading it, you are right that is is not that clear. Maybe the options can be listed in bullets.
And some examples are always good! I am much in favor of more examples in docstrings

TomAugspurger · 2013-10-16T17:40:45Z

@hayd added examples in doctoring and a test for dupes on self.

jreback · 2013-10-16T17:41:04Z

pandas/core/frame.py

+            which must match. If `values` is a DataFrame,
+            then both the index and column labels must match.
+        iloc : booleans
+            if passing other as a dict as values, describe columns


what is the purpose if this? why wouldn't you always do it this way?

I think this is for if the keys of the values dict are column locations instead of labels:

In [46]: df = DataFrame({'A': ['a', 'b', 'c'], 'B': ['a', 'e', 'f']}) In [47]: d = {0: ['a']} In [48]: df.isin(d, iloc=True) Out[48]: A B 0 True False 1 False False 2 False False

we don't do this anywhere else, this is confusing, and IMHO is not very useful, I think you should take it out @hayd?

I put it in, so I'm biased. The difference like loc and iloc, rather than ix (if integer col names could be ambiguous, so just force user to be explicit). I still think the behaviour is useful, but if there is a better argname...

since you can now pass a frame (and determine unambiguously whether you NEED to use iloc's (e.g. a non-unique self or other index), in what situation would you actually need / require this? (I think it would have to by definition be a dict with integer keys that you want to interpret as locations rather than labels).

Yes, in case of frame may not make sense to use iloc=True, but like you say I think it still does with a dict (and possibly Series)... default is to use labels.

jorisvandenbossche · 2013-10-16T20:51:36Z

@TomAugspurger Short, clear but illustrative examples. Nice!

jorisvandenbossche · 2013-10-16T20:59:41Z

pandas/core/frame.py

+        2   True   True
+
+        >>> df = DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
+        >>> # values is a dataframe


and maybe you can also repeat here what is said in parameter explanation (then it is clear on what the user should pay attention), something like

When ``values`` is a dataframe, values and index and column labels must match:

jreback · 2013-10-16T21:52:25Z

@hayd ok just wanted to be sure (and since the default is the use labels this is sort of innocuous)

TomAugspurger · 2013-10-17T00:51:50Z

OK so we're keeping the API the same? The iloc kwarg stays in?
@jorisvandenbossche I've fixed those typos/formatting/suggestions. Thanks.

jreback · 2013-10-17T01:06:47Z

@hayd I reviewed this again
this iloc arg needs to go; don't have any tests for it in any event
and it makes isin incompatible with series.isin
I don't see a compelling case for it at all

hayd · 2013-10-17T05:35:05Z

@jreback there is a test for it: test_isin_dict

hayd · 2013-10-17T05:40:32Z

You're right it shouldn't be called iloc, as it's not clear it's to do with column location (rather than index). I don't see how this is incompatible with Series isin and thought it could be useful....

jorisvandenbossche · 2013-10-17T07:15:22Z

pandas/core/frame.py

+
+        Examples
+        --------
+        When `values` is a list:


Small remark, if you want the values to show as 'code', you have to use double backquotes. With single backquotes as here it will appear as 'italic'.

Do these rules still apply in the examples section? https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt#common-rest-concepts

Use italics, bold, and monospace if needed in any explanations (but not for variable names and doctest code or multi-line code). Variable, module, function, and class names should be written between single back-ticks (numpy).

Ah, yes, you are right that they advise to use single back-ticks to reference variables. However, they will still appear as italic and not as monospaced code. But of course, then it is a choice which of the two you want. And it seems that the numpy docstring standard has chosen to use variable for variables and `int` for eg to specify the python type.
Although I think I would prefer the variable as code, if we follow the numpy docstring standard, you can leave it like this :-)

TomAugspurger · 2013-10-17T12:17:16Z

@hayd so in the future, I think an ignore index argument would achieve the same thing? Something like by_loc Would be useful for dicts, frames, and series. Ignore labels just match on location. I wouldn't be able to get to that this week, so maybe remove the iloc arg for now, to be reimplemented in q.14.

jreback · 2013-10-17T13:10:40Z

@hayd @TomAugspurger I like ignore_index rather than iloc....

prob not going to cut RC for a few days and you can change even after that...so lmk

API: Series should respect index labels. DOC: Update what's new to use a DataFrame API: Remove iloc argument to isin.

hayd · 2013-10-17T19:59:53Z

i thought that ignore index is slightly different to iloc:

In [1]: df = pd.DataFrame([['a', 'b'], ['c', 'a']], columns=['A', 'B'])

In [2]: df.isin(pd.Series(['a']))
Out[2]: 
       A      B
0   True  False
1  False  False

In [3]: df.isin(['a']) # this would be ignore_index=True of [2]
Out[3]: 
       A      B
0   True  False
1  False   True

Whereas iloc is the position of column:

In [4]: df.isin({0: ['a']})
Out[4]: 
       A      B
0  False  False
1  False  False

In [5]: df.isin({0: ['a']}, iloc=True)  # 0th columns
Out[5]: 
       A      B
0   True  False
1  False  False

I think iloc does not work with DataFrames or with Series, so if True and one of those is passed it should raise. Could cut for now I guess... :(

TomAugspurger · 2013-10-17T20:14:39Z

ignore_index would need an axis argument for which axis (or axes for DataFrame) you'd like to ignore. This is kinda-sorta like ignoring the keys of the dict and going with the location (but still slightly-different than iloc like you say).

In [14]: df = pd.DataFrame([['a', 'b'], ['c', 'a']], columns=['A', 'B'])

In [17]: df.isin(pd.Series(['a', 'c'], index=[0, 2]))
Out[17]: 
       A      B
0   True  False
1  False  False

In [18]: df.isin(pd.Series(['a', 'c'], index=[0, 2]), ignore_index=True)
Out[18]: 
       A      B
0  True  False
1  True  False  #  ignore that 1 != 2

In [33]: other = pd.DataFrame({'A': ['a', 'c'], 'C': ['b', 'd']}, index=[0, 2])


In [34]: other
Out[34]: 
   A  C
0  a  b
2  c  d

In [35]: df.isin(other)
Out[35]: 
       A      B
0   True  False
1  False  False

In [35]: df.isin(other, ignore_index=0)  # ignore the index
Out[35]: 
       A      B
0   True  False  # b still false since B != C
1   True  False  # now A,2 is true

In [35]: df.isin(other, ignore_index=1)  # ignore cols; like ignoring keys for dict?
Out[35]: 
       A      B
0   True  True
1  False  False

In [35]: df.isin(other, ignore_index='both')  # ignore both; 
Out[35]: 
       A      B
0   True  True
1   True  False

I think conceptually the idea of matching based on location rather than label is similar enough between dicts and indexes/columns that they can share the kwarg.

jreback · 2013-10-17T20:23:03Z

I still don't see a nice use case for iloc. Since its much harder to take out a parameter after its release, and if IIRC DataFrame.isin hasn't been released yet (was put in right after 0.13 started out?), let's drop iloc.

Unless I am missing something....

TomAugspurger · 2013-10-17T20:29:43Z

I think Andy put it in seconds after .12 was released :)

I just pushed a version without iloc and with the iloc tests removed. I'd be more comfortable going with this for now, and adding it in later. (But my opinion may not count for much since I still don't quite understand how the iloc code works in the first place :) )

@jorisvandenbossche I went with the double backticks. Looks better in the documentation.

jreback · 2013-10-17T20:43:23Z

ok....this looks good....

@hayd ok?

ENH/API: Accept DataFrame for isin

hayd · 2013-10-17T20:50:10Z

sulk! fine... Thanks for putting this all together, merged! :)

jorisvandenbossche · 2013-10-22T08:53:24Z

Note: there is still a use of the iloc keyword in the docs, which now causes errors. From indexing.rst:

You can also describe columns using integer location:

.. ipython:: python

   values = {0: ['a', 'b']}

   df.isin(values, iloc=True)

jreback reviewed Oct 12, 2013
View reviewed changes

hayd reviewed Oct 14, 2013
View reviewed changes

jtratner reviewed Oct 14, 2013
View reviewed changes

jorisvandenbossche reviewed Oct 16, 2013
View reviewed changes

jreback reviewed Oct 16, 2013
View reviewed changes

jorisvandenbossche reviewed Oct 16, 2013
View reviewed changes

jorisvandenbossche reviewed Oct 17, 2013
View reviewed changes

ENH/API: Accept DataFrame for isin

38b8fca

API: Series should respect index labels. DOC: Update what's new to use a DataFrame API: Remove iloc argument to isin.

hayd added a commit that referenced this pull request Oct 17, 2013

Merge pull request #5199 from TomAugspurger/isin_dfs

b538892

ENH/API: Accept DataFrame for isin

hayd merged commit b538892 into pandas-dev:master Oct 17, 2013

TomAugspurger mentioned this pull request Oct 22, 2013

DOC: Remove iloc example. #5296

Merged

TomAugspurger mentioned this pull request May 18, 2014

How should DataFrame.isin() handle DataFrame input? #7158

Closed

TomAugspurger deleted the isin_dfs branch April 5, 2017 02:05

Uh oh!

ENH/API: Accept DataFrame for isin #5199

ENH/API: Accept DataFrame for isin #5199

Uh oh!

Conversation

TomAugspurger commented Oct 12, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Oct 12, 2013

Uh oh!

jreback commented Oct 12, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hayd commented Oct 14, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Oct 15, 2013

Uh oh!

hayd commented Oct 16, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Oct 16, 2013

Uh oh!

jreback commented Oct 16, 2013

Uh oh!

TomAugspurger commented Oct 16, 2013

Uh oh!

hayd commented Oct 16, 2013

Uh oh!

jorisvandenbossche commented Oct 16, 2013

Uh oh!

TomAugspurger commented Oct 16, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Oct 16, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Oct 16, 2013

Uh oh!

TomAugspurger commented Oct 17, 2013

Uh oh!

jreback commented Oct 17, 2013

Uh oh!

hayd commented Oct 17, 2013

Uh oh!

hayd commented Oct 17, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Oct 17, 2013

Uh oh!

jreback commented Oct 17, 2013

Uh oh!

hayd commented Oct 17, 2013