Skip to content

Now boolean operators work with NDFrames? #4633

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cpcloud opened this issue Aug 22, 2013 · 22 comments · Fixed by #4657
Closed

Now boolean operators work with NDFrames? #4633

cpcloud opened this issue Aug 22, 2013 · 22 comments · Fixed by #4657
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@cpcloud
Copy link
Member

cpcloud commented Aug 22, 2013

Awkward side effect of the new and improved Series probably, I typed something wrong and was shocked to not get the usual numpy ValueError barf:

In [57]: a, b = Series(rand(10) > 0.5), Series(rand(10) > 0.5)

In [58]: a
Out[58]:
0     True
1     True
2    False
3    False
4     True
5    False
6    False
7     True
8     True
9     True
dtype: bool

In [59]: b
Out[59]:
0     True
1     True
2     True
3     True
4    False
5     True
6    False
7     True
8     True
9     True
dtype: bool

In [60]: a and b # what?! no way!!
Out[60]:
0     True
1     True
2     True
3     True
4    False
5     True
6    False
7     True
8     True
9     True
dtype: bool

In [61]: a & b # whew
Out[61]:
0     True
1     True
2    False
3    False
4    False
5    False
6    False
7     True
8     True
9     True
dtype: bool
@jreback
Copy link
Contributor

jreback commented Aug 22, 2013

In [11]: dfa, dfb = DataFrame(randn(10,2)),DataFrame(randn(10,2))

In [13]: dfa
Out[13]: 
          0         1
0  1.032736 -0.981451
1  0.154600  1.458225
2  0.688022  0.793732
3 -1.171007 -0.643027
4 -0.559731 -0.019199
5  1.079594 -0.549702
6 -1.803250  1.319622
7  0.602654  0.550604
8 -0.908063 -1.304829
9  0.103192  1.542427

In [14]: dfb
Out[14]: 
          0         1
0 -0.445965 -1.055636
1  0.929273  0.160086
2  0.302718 -0.653348
3 -1.227953 -1.627101
4  1.832395 -0.036561
5  0.806457  1.730636
6  1.805333 -0.299774
7  0.283562 -0.263815
8  0.749068  0.589955
9  0.523333 -2.085325

In [15]: dfa and dfb
Out[15]: 
          0         1
0 -0.445965 -1.055636
1  0.929273  0.160086
2  0.302718 -0.653348
3 -1.227953 -1.627101
4  1.832395 -0.036561
5  0.806457  1.730636
6  1.805333 -0.299774
7  0.283562 -0.263815
8  0.749068  0.589955
9  0.523333 -2.085325

@hayd
Copy link
Contributor

hayd commented Aug 22, 2013

This seems like a good thing. Now truthy-ness of Series/DataFrames is consistent, empty Series are Falsey.

In [6]: s1 = pd.Series([1])

In [7]: s_ = pd.Series()

In [8]: s_ or s
Out[8]: 
0    1
dtype: int64

Not sure I can use this anywhere, but I'll try.

@miketkelly
Copy link

Consistency aside, this seems likely to lead to lots of subtle bugs when people use df[a or b] when they mean df[a | b].

@jtratner
Copy link
Contributor

Along the lines of @mtkni's comment : is it a good thing that Series behaves more like a Python container vs a numpy container?

@hayd
Copy link
Contributor

hayd commented Aug 23, 2013

Ew:

In [1]: bool(np.array([]))
Out[1]: False

In [2]: bool(np.array([1]))
Out[2]: True

In [3]: bool(np.array([1, 2]))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-5ba97928842c> in <module>()
----> 1 bool(np.array([1, 2]))

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

That said, you're probably right that and/or should raise. :(

@kjordahl
Copy link

You also get fun things like this:

>>> a = Series(rand(10))
>>> bool(a == 0)
True
>>> bool(a == 1)
True

I think it should raise as well. It's doing numpy-like comparisons and broadcasting, not throwing an exception can allow subtle bugs to go undetected.

Possibly related: #1069 (old but apparently reverted), #4576, #4537

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

this like this I think are easily caught

I had to add a couple of coercion method which can handle these cases

eg bool , int etc

@kjordahl
Copy link

What's the use case for allowing bool(s) to return True for any series with nonzero length? It has never been the case in previous releases, since Series inherited from numpy.ndarray. This is new behavior, and IMO it does not add anything useful, only a source of bugs.

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

oh I agree

however we have a couple of tests that would fail otherwise so need to fix those

iIIRC there was a couple of places this is relied upon (which is just wrong)

@cpcloud pls mark this and we can take a look

@kjordahl
Copy link

OK, thanks.

It looks like raising ValueError for truth tests on DataFrames (which had been introduced in PR #1073) was reverted by PR #3696. I would argue that should be restored as well.

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

yeh these should all be consistent
I'll take alook tomorrow

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

In [17]: s = Series(randn(4))

In [18]: df = DataFrame(randn(10,2))

In [19]: s_empty = Series()

In [20]: df_empty = DataFrame()

This is from 0.12

In [21]: bool(s)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [22]: bool(s_empty)
Out[22]: False

In [23]: bool(df)
Out[23]: True

In [24]: bool(df_empty)
Out[24]: False

master

In [5]: bool(s)
Out[5]: True

In [6]: bool(s_empty)
Out[6]: False

In [7]: bool(df)
Out[7]: True

In [8]: bool(df_empty)
Out[8]: False

These are the same in 0.12 and master, e.g. a 1-element Series is special cased
to evaluate to return the boolean of a single value

In [9]: bool(Series([True]))
Out[9]: True

In [10]: bool(Series([False]))
Out[10]: False

I am going to argue that 0.12 is inconsistent and master is correct

However, one could have a point of view that ALL of these should raise

any thoughts?

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

@wesm ?

@kjordahl
Copy link

pandas 0.11.0:

>>> bool(s)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

>>> bool(s_empty)
False

>>> bool(df)
ValueError: Cannot call bool() on DataFrame.

>>> bool(df_empty)
ValueError: Cannot call bool() on DataFrame.

To me, it looks like this is the right behavior (with the possible exception of bool(df_empty)).

@hayd
Copy link
Contributor

hayd commented Aug 23, 2013

I'm strongly of the opinion it should be all or nothing. The numpy behaviour is stupid.

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

Changing to this (for all NDFrame, and eliminating the special case from Series)

    def __nonzero__(self):
        raise ValueError("cannot use __nonzero__ in NDframe")
    __bool__ = __nonzero__

causes very few failures, (aside from the ones I am purposely testing), really just a couple of test cases that are not formed well

e.g. things like these, which are simply incorrect usages....will change the PR and eveyone can look

s1 = Series.....
s2 = Series....
assert s1 == s2
if s1:
    .....

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

this is a revert to #1073/#1069

now a call to nonzero raises TypeError ALWAYS

The following is the behavior

In [17]: s = Series(randn(4))

In [18]: df = DataFrame(randn(10,2))

In [19]: s_empty = Series()

In [20]: df_empty = DataFrame()

In [5]: bool(s)
Out[5]: TypeError: cannot use __nonzero__ in NDframe
  try using empty to evaluate whether the object is empty  or .all()/.any() for elementwise comparisons

In [6]: bool(s_empty)
Out[6]: TypeError: cannot use __nonzero__ in NDframe
  try using empty to evaluate whether the object is empty  or .all()/.any() for elementwise comparisons

In [7]: bool(df)
Out[7]: TypeError: cannot use __nonzero__ in NDframe
  try using empty to evaluate whether the object is empty  or .all()/.any() for elementwise comparisons

In [8]: bool(df_empty)
Out[8]: TypeError: cannot use __nonzero__ in NDframe
  try using empty to evaluate whether the object is empty  or .all()/.any() for elementwise comparisons

And prevents these fun ones (same for Series/Panel)

In [4]: df1 = DataFrame(np.ones((4,4)))

In [5]: df2 = DataFrame(np.zeros((4,4)))

In [6]: df1 and df2
TypeError: cannot use __nonzero__ in NDframe
  try using empty to evaluate whether the object is empty  or .all()/.any() for elementwise comparisons

In [7]: df1 or df2
TypeError: cannot use __nonzero__ in NDframe
  try using empty to evaluate whether the object is empty  or .all()/.any() for elementwise comparisons

In [8]: not df1
TypeError: cannot use __nonzero__ in NDframe
  try using empty to evaluate whether the object is empty  or .all()/.any() for elementwise comparisons

In [9]: def f():
   ...:     if df1:
   ...:         print("this is cool")
   ...:         

In [10]: f()
TypeError: cannot use __nonzero__ in NDframe
  try using empty to evaluate whether the object is empty  or .all()/.any() for elementwise comparisons

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

anyone think this should be a ValueError (like numpy)?

@jtratner
Copy link
Contributor

I think you should change the error message - the numpy error message is
clearer. (and most users aren't going to understand why bool(...) or if
... ends up with something about nonzero) Whether it should be
TypeError or ValueError: I think it makes sense to follow numpy's lead
here, if only for compatibility - no strong preference for me.

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

@jtratner
do you want to suggest something?

numpy

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

make the same? I also want to mention empty?

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

revised to be the same as numpy

@jreback
Copy link
Contributor

jreback commented Aug 24, 2013

see revised PR for updates on the error messages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants