Skip to content

ENH/API: Accept DataFrame for isin #5199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 17, 2013
Merged

Conversation

TomAugspurger
Copy link
Contributor

API: Series should respect index labels. This would be incompatible, but there hasn't been any official release since isin was merged.

Will close #4421. Still a WIP for now. Needs more testing.

Thoughts on adding an argument to ignore the index? i.e.

In [150]: df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [2, np.nan, 4, 4]}, index=['a','b','c','d'])

In [151]: s = pd.Series([1, 3, 11, 12], index=['a','b','c','d'])

In [152]: df.isin(s)
Out[152]: 
       A      B
a   True  False
b  False  False
c  False  False
d  False  False

In [152]: df.isin(s, ignore_index=True)
Out[152]: 
       A      B
a   True  False
b  False  False
c   True  False
d  False  False

I would say there's no need for an ignore_index since you could just do df.isin(s.values).

" a duplicate axis.")
return self.eq(values.reindex_like(self), axis='index')
elif isinstance(values, DataFrame):
if not values.columns.is_unique and values.index.is_unique:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should prob disallow dups on both self and values....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going back and forth on allowing dupes on self.

This seems well defined:

# Dupes on just self's index.
In [21]: df = pd.DataFrame({'A': [1, 1, 0, 0], 'B': [2, np.nan, 4, 4]}, index=['a','a','c','d'])

In [22]: df
Out[22]: 
   A   B
a  1   2
a  1 NaN
c  0   4
d  0   4

In [26]: other = pd.Series([1, 0, -1, -1], index=['a','b','c','d'])

In [27]: df.isin(other)  # Just the 1 matches. df.ix['a'] is True for both a's.
Out[27]: 
       A      B
a   True  False
a   True  False
c  False  False
d  False  False

What wouldn't be well defined if other's index was instead ['a', 'a', 'c', 'd']. Then a would be getting True from other's first a and False from the second. But this case will already raise a ValueError since values has dupes.

@jreback
Copy link
Contributor

jreback commented Oct 12, 2013

maybe add/modify the example in v0.13.0 and/or docs? (to show using df), which IMHO will prob be more of a usecase than a dict

@jreback
Copy link
Contributor

jreback commented Oct 12, 2013

looks good....@hayd ?

return self.eq(values.reindex_like(self), axis='index')
elif isinstance(values, DataFrame):
if not (values.columns.is_unique and values.index.is_unique):
raise ValueError("ValueError: cannot compute isin from with"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a NotImplementedError ? (since I think it makes sense with duplicates) :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean NotImplementedError for duplicates on self and ValueError for duplicates on values? From an older diff: #5199 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my last comment was wrong. You're obviously talking about values for both. What do you think about

In [7]: df
Out[7]: 
   A  B
0  1  1
1  0  1
2  1  0
3  0  0

In [8]: other
Out[8]: 
   A  A
0  1  1
1  1  0
2  0  0

in [9]: df.isin(other)

(A, 0) is clearly True and (A, 2) clearly False, but I don't see any way to make sense of (A, 1).

@hayd
Copy link
Contributor

hayd commented Oct 14, 2013

Aside from that little thing, this looks good!

tbh I think ignore_index may be useful kwarg for some, but we could push it back to 0.14.

expected = DataFrame(False, index=df1.index, columns=df1.columns)
expected['B'].loc[[0, 2]] = True
result = df1.isin(df2)
assert_frame_equal(result, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you test this with a MultiIndex that's not monotonically sorted (or make an error on that (property is is_monotonic)? (and, preferably) has more than 2 levels, and then test comparison between MI and regular Index? (if that's an error, then great).

@TomAugspurger
Copy link
Contributor Author

@hayd sorry, my last push squashed the discussion about isin with duplicate indicies/columns on values here: #5199 (comment)

Sorry, my last comment was wrong. You're obviously talking about values for both. What do you think about

In [7]: df
Out[7]: 
   A  B
0  1  1
1  0  1
2  1  0
3  0  0

In [8]: other
Out[8]: 
   A  A
0  1  1
1  1  0
2  0  0

in [9]: df.isin(other)

(A, 0) is clearly True and (A, 2) clearly False, but I don't see any way to make sense of (A, 1).

since the first A in other (1) says False, but the second A in other (0) says False

Once we get this figured out I'll rebase / squash and it will be good to go.

@hayd
Copy link
Contributor

hayd commented Oct 16, 2013

@TomAugspurger ah ha, you're right that makes no sense, I was thinking the other direction:

other.isin(df)

which would (in this case) make sense, since df has unique column names... hence the NotImplemented, as potentially be fleshed out for this behaviour in the future/someday (!)

match. If values is a Series, that's the index. If values is a
dictionary of column names, that's the columm names. If values
is a DataFrame, that's both.
iloc : boolean, if passing other as a dict as values, describe columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put the explanation on a second indented line to follow the numpy docstring standard?

@TomAugspurger
Copy link
Contributor Author

Failing on 2.6

cannot import name hashtable
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "pandas/__init__.py", line 9, in <module>
    from . import hashtable, tslib, lib
ImportError: cannot import name hash table

Is this a known problem or did I break something in the build script?

@jreback
Copy link
Contributor

jreback commented Oct 16, 2013

just have it rebuild again

git commit -C HEAD --amend and force push....

travis hiccup

@TomAugspurger
Copy link
Contributor Author

@hayd I think you already had that one sorted out:

In [8]: df
Out[8]: 
   A  B
0  1  1
1  0  1
2  1  0
3  0  0

In [9]: other
Out[9]: 
   A  A
0  1  1
1  1  0
2  0  0

In [10]: other.isin(df)
Out[10]: 
       A      A
0   True   True
1  False   True
2  False  False

Good to go on my end: https://travis-ci.org/TomAugspurger/pandas/builds/12617771

@hayd
Copy link
Contributor

hayd commented Oct 16, 2013

Ah ha, excellent! Then this looks great :), is there/can we add a test for that?

I guess @jorisvandenbossche's point was that the arg description of values is confusing (mainly as there are several cases for it)! To address this one option is to add a few examples into the docstring (copy some of the ones already written in the docs :) ), comparing to list, series/dict, dataframe.

@jorisvandenbossche
Copy link
Member

@hayd No, it was actually only from an aesthetical point :-) But after reading it, you are right that is is not that clear. Maybe the options can be listed in bullets.
And some examples are always good! I am much in favor of more examples in docstrings

@TomAugspurger
Copy link
Contributor Author

@hayd added examples in doctoring and a test for dupes on self.

which must match. If `values` is a DataFrame,
then both the index and column labels must match.
iloc : booleans
if passing other as a dict as values, describe columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose if this? why wouldn't you always do it this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is for if the keys of the values dict are column locations instead of labels:

In [46]: df = DataFrame({'A': ['a', 'b', 'c'], 'B': ['a', 'e', 'f']})

In [47]: d = {0: ['a']}

In [48]: df.isin(d, iloc=True)
Out[48]: 
       A      B
0   True  False
1  False  False
2  False  False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't do this anywhere else, this is confusing, and IMHO is not very useful, I think you should take it out @hayd?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it in, so I'm biased. The difference like loc and iloc, rather than ix (if integer col names could be ambiguous, so just force user to be explicit). I still think the behaviour is useful, but if there is a better argname...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you can now pass a frame (and determine unambiguously whether you NEED to use iloc's (e.g. a non-unique self or other index), in what situation would you actually need / require this? (I think it would have to by definition be a dict with integer keys that you want to interpret as locations rather than labels).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in case of frame may not make sense to use iloc=True, but like you say I think it still does with a dict (and possibly Series)... default is to use labels.

@jorisvandenbossche
Copy link
Member

@TomAugspurger Short, clear but illustrative examples. Nice!

2 True True

>>> df = DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
>>> # values is a dataframe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and maybe you can also repeat here what is said in parameter explanation (then it is clear on what the user should pay attention), something like

When ``values`` is a dataframe, values and index and column labels must match:

@jreback
Copy link
Contributor

jreback commented Oct 16, 2013

@hayd ok just wanted to be sure (and since the default is the use labels this is sort of innocuous)

@TomAugspurger
Copy link
Contributor Author

OK so we're keeping the API the same? The iloc kwarg stays in?
@jorisvandenbossche I've fixed those typos/formatting/suggestions. Thanks.

@jreback
Copy link
Contributor

jreback commented Oct 17, 2013

@hayd I reviewed this again
this iloc arg needs to go; don't have any tests for it in any event
and it makes isin incompatible with series.isin
I don't see a compelling case for it at all

@hayd
Copy link
Contributor

hayd commented Oct 17, 2013

@jreback there is a test for it: test_isin_dict

@hayd
Copy link
Contributor

hayd commented Oct 17, 2013

You're right it shouldn't be called iloc, as it's not clear it's to do with column location (rather than index). I don't see how this is incompatible with Series isin and thought it could be useful....


Examples
--------
When `values` is a list:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small remark, if you want the values to show as 'code', you have to use double backquotes. With single backquotes as here it will appear as 'italic'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these rules still apply in the examples section? https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt#common-rest-concepts

Use italics, bold, and monospace if needed in any explanations (but not for variable names and doctest code or multi-line code). Variable, module, function, and class names should be written between single back-ticks (numpy).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, you are right that they advise to use single back-ticks to reference variables. However, they will still appear as italic and not as monospaced code. But of course, then it is a choice which of the two you want. And it seems that the numpy docstring standard has chosen to use variable for variables and `int` for eg to specify the python type.
Although I think I would prefer the variable as code, if we follow the numpy docstring standard, you can leave it like this :-)

@TomAugspurger
Copy link
Contributor Author

@hayd so in the future, I think an ignore index argument would achieve the same thing? Something like by_loc Would be useful for dicts, frames, and series. Ignore labels just match on location. I wouldn't be able to get to that this week, so maybe remove the iloc arg for now, to be reimplemented in q.14.

@jreback
Copy link
Contributor

jreback commented Oct 17, 2013

@hayd @TomAugspurger I like ignore_index rather than iloc....

prob not going to cut RC for a few days and you can change even after that...so lmk

API: Series should respect index labels.

DOC: Update what's new to use a DataFrame

API: Remove iloc argument to isin.
@hayd
Copy link
Contributor

hayd commented Oct 17, 2013

i thought that ignore index is slightly different to iloc:

In [1]: df = pd.DataFrame([['a', 'b'], ['c', 'a']], columns=['A', 'B'])

In [2]: df.isin(pd.Series(['a']))
Out[2]: 
       A      B
0   True  False
1  False  False

In [3]: df.isin(['a']) # this would be ignore_index=True of [2]
Out[3]: 
       A      B
0   True  False
1  False   True

Whereas iloc is the position of column:

In [4]: df.isin({0: ['a']})
Out[4]: 
       A      B
0  False  False
1  False  False

In [5]: df.isin({0: ['a']}, iloc=True)  # 0th columns
Out[5]: 
       A      B
0   True  False
1  False  False

I think iloc does not work with DataFrames or with Series, so if True and one of those is passed it should raise. Could cut for now I guess... :(

@TomAugspurger
Copy link
Contributor Author

ignore_index would need an axis argument for which axis (or axes for DataFrame) you'd like to ignore. This is kinda-sorta like ignoring the keys of the dict and going with the location (but still slightly-different than iloc like you say).

In [14]: df = pd.DataFrame([['a', 'b'], ['c', 'a']], columns=['A', 'B'])

In [17]: df.isin(pd.Series(['a', 'c'], index=[0, 2]))
Out[17]: 
       A      B
0   True  False
1  False  False

In [18]: df.isin(pd.Series(['a', 'c'], index=[0, 2]), ignore_index=True)
Out[18]: 
       A      B
0  True  False
1  True  False  #  ignore that 1 != 2

In [33]: other = pd.DataFrame({'A': ['a', 'c'], 'C': ['b', 'd']}, index=[0, 2])


In [34]: other
Out[34]: 
   A  C
0  a  b
2  c  d

In [35]: df.isin(other)
Out[35]: 
       A      B
0   True  False
1  False  False

In [35]: df.isin(other, ignore_index=0)  # ignore the index
Out[35]: 
       A      B
0   True  False  # b still false since B != C
1   True  False  # now A,2 is true

In [35]: df.isin(other, ignore_index=1)  # ignore cols; like ignoring keys for dict?
Out[35]: 
       A      B
0   True  True
1  False  False

In [35]: df.isin(other, ignore_index='both')  # ignore both; 
Out[35]: 
       A      B
0   True  True
1   True  False

I think conceptually the idea of matching based on location rather than label is similar enough between dicts and indexes/columns that they can share the kwarg.

@jreback
Copy link
Contributor

jreback commented Oct 17, 2013

I still don't see a nice use case for iloc. Since its much harder to take out a parameter after its release, and if IIRC DataFrame.isin hasn't been released yet (was put in right after 0.13 started out?), let's drop iloc.

Unless I am missing something....

@TomAugspurger
Copy link
Contributor Author

I think Andy put it in seconds after .12 was released :)

I just pushed a version without iloc and with the iloc tests removed. I'd be more comfortable going with this for now, and adding it in later. (But my opinion may not count for much since I still don't quite understand how the iloc code works in the first place :) )

@jorisvandenbossche I went with the double backticks. Looks better in the documentation.

@jreback
Copy link
Contributor

jreback commented Oct 17, 2013

ok....this looks good....

@hayd ok?

hayd added a commit that referenced this pull request Oct 17, 2013
ENH/API: Accept DataFrame for isin
@hayd hayd merged commit b538892 into pandas-dev:master Oct 17, 2013
@hayd
Copy link
Contributor

hayd commented Oct 17, 2013

sulk! fine... Thanks for putting this all together, merged! :)

@jorisvandenbossche
Copy link
Member

Note: there is still a use of the iloc keyword in the docs, which now causes errors. From indexing.rst:

You can also describe columns using integer location:

.. ipython:: python

   values = {0: ['a', 'b']}

   df.isin(values, iloc=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH/API: DataFrame's isin should accept DataFrames
6 participants