API: allow single element boolean Series to improve on numpy behavior( related GH4657) #4738

jreback · 2013-09-03T22:01:37Z

related #4657

jreback · 2013-09-03T22:15:20Z

jtratner · 2013-09-03T22:58:46Z

Looks good to me...

hayd · 2013-09-03T22:59:10Z

Can we ping out google groups to see what people think?

I still don't see the logic to this (other than numpy doing it), it seems to me there is always a non-ambiguous way to write things...

jseabold · 2013-09-03T23:16:51Z

Practicality beats purity for me here. It saves typing and seems quite logical to me. At least, I don't see any ambiguity, especially if the empty Series/DataFrame case is explicitly handled to raise. Incidentally, the behavior for an empty objects differs from numpy (and Python) but I agree with the change.

hayd · 2013-09-03T23:23:28Z

To me this seems impractical and impure! :(

I'm yet to see an example where it saves typing (and which can't be done without item), and I don't see why the one item case should be special (if we're breaking numpy behaviour anyway....).

jseabold · 2013-09-03T23:28:44Z

Every example I showed involved more typing?

hayd · 2013-09-03T23:32:53Z

I thought there was always a way it could be written to not be, e.g. the example with pd.isnull(df.ix[]) could be tweaked so the ix returned a value (not a Series)...

jseabold · 2013-09-03T23:36:35Z

ix will only return a value if I don't use a boolean index, but this involves more code to get the single index value -- either a (horribly ugly) call to np.where, some pandas-fu that I didn't try yet, or setting an index on the frame (which isn't always possible).

jseabold · 2013-09-04T00:44:08Z

Please do bring it up on the ML though. Input from others would be helpful.

jtratner · 2013-09-04T00:48:19Z

I didn't realize how this handles empty Series. Should definitely be the same as Python/numpy:

>>> bool(Series([]))
False
>>> bool(np.array([])
False
>>> bool(np.array([[]])
False
>>> bool(DataFrame([[]]))
False

jseabold · 2013-09-04T00:58:08Z

Ha, now there's some ambiguity if it's true there wasn't before. I thought this was an improvement over numpy and the warning is clear that maybe you should use 'empty'.

Anway, my vote is

Empty series raises. Maybe you screwed up your index? What is the 'correct' output of this?
```
if pd.isnull(pd.DataFrame([])):
    print 'this dataframe has no missing values?'
```
Ambiguous?
1 element is fine. You know what you're doing, carry on. Also .all() == .any() in this case, so it's not ambiguous.
Length > 1 raises . This is ambiguous. Ask for all, any, or empty. Maybe you screwed up your index?

You can read old numpy threads about this if no one wants to discuss on the ML, but I think these issues may be of general interest.

jreback · 2013-09-04T01:04:09Z

current behavior (with this PR) is to raise on empty Series/DataFrame (note that this was not true in 0.12,
but actually WAS true IIRC before 0.12)

I agree that we should absolutely raise on an empty object

if not df:
   print 'ok'

should NEVER work (and if we allowed it to work for an empty frame it would)

correct is

if not df.empty:
     print 'ok'

jreback · 2013-09-04T01:04:49Z

@jseabold proposal is what this PR does

jseabold · 2013-09-04T01:10:18Z

Yeah. I'm +1.

jtratner · 2013-09-04T01:13:09Z

hmm...I'm mixed on only following some but not all of the numpy behavior
here...seems slightly confusing to me, but this is not an area I use much
with pandas.

hayd · 2013-09-04T01:13:22Z

Obviously I'm still massive -1, I think we should be raising on all __nonzero__ (i.e. not this pr, but current master).

1 element is fine. You know what you're doing

Totally disagree. I would also argue that it is ambiguous.

In [1]: bool([False])
Out[1]: True

In [2]: bool(np.array([np.nan]))
Out[2]: True

In [3]: bool(np.array([False]))
Out[3]: False

pd.isnull(pd.DataFrame([])) is an empty DataFrame (not ambiguous)...

jseabold · 2013-09-04T01:29:01Z

On Tue, Sep 3, 2013 at 9:13 PM, Andy Hayden [email protected]:

Obviously I'm still massive -1, I think we should be raising on all
nonzero (i.e. not this pr, but current master).

1 element is fine. You know what you're doing

Totally disagree. I would also argue that it is ambiguous.

In [1]: bool([False])
Out[1]: True

Hmm, ok, this is a bit confusing, but I'd argue not a show stopper. I
can't speak for everyone, but I don't find this difficult to differentiate.
Most of Numpy/pandas operates element-wise. So element-wise it's not
ambiguous. If there's one element, we can return an answer.

In [2]: bool(np.array([np.nan]))
Out[2]: True

NumPy's nan handling is inconsistent and definitely one of the areas where
pandas should/exists to improve. It and operations on empty arrays are two
issues that are pointed to as warts (by some) in the numpy threads about
similar. That said, I'm not sure what bool(pd.Series([np.nan])) should do.
I'd probably say that it should raise an error.

In [3]: bool(np.array([False]))
Out[3]: False

pd.isnull(pd.DataFrame([])) is an empty DataFrame (not ambiguous)...

If the null DataFrame is the result of an indexing operation, then it is
ambiguous I'd say. If your question you want to answer is, given this
index, is this value True? The answer is, we can't evaluate this question
because your indexing operation didn't return any information.

jreback · 2013-09-04T01:33:24Z

we would have to special case Series([np.nan]) as I think this would evaluate to bool(np.nan) which is True

hayd · 2013-09-04T01:41:19Z

I think this special casing is going to confuse users (one rule for python, one (especially confusing) for numpy, one for pandas)... I think it's consistency to say no to __nonzero__. I think this is an improvement.

re the empty DataFrame it's just like doing a function over emptyset is empty:

map(pd.isnull, []) == []

the current behaviour is well-defined (and imo correct). However bool of it is ambiguous (if that's what you were saying then we agree :) ).

Some more random stuff we wanted to stop (but work in numpy):

In [4]: not np.array([False])
Out[4]: True

In [5]: np.array([False]) or np.array([np.nan])
Out[5]: array([ nan])

Special casing this will be a rabbit hole.

jseabold · 2013-09-04T01:50:04Z

On Tue, Sep 3, 2013 at 9:41 PM, Andy Hayden [email protected]:

I think this special casing is going to confuse users (one rule for
python, one (especially confusing) for numpy, one for pandas)... I think
it's consistency to say no to nonzero. I think this is an improvement.

re the empty DataFrame it's just like doing a function over emptyset is
empty:

map(pd.isnull, []) == []

the current behaviour is well-defined (and imo correct). However bool of
it is ambiguous (if that's what you were saying then we agree :) ).

Yes, we agree.

Some more random stuff we wanted to stop (but work in numpy):

In [4]: not np.array([False])
Out[4]: True

No problem here in current usage. It maps.

In [5]: np.array([False]) or np.array([np.nan])
Out[5]: array([ nan])

Either the RHS raises or it short circuits if the first element is True.
Special casing this will be a rabbit hole.

There are 5 (?) eventualities I can think of -- empty (raise), 1 element
castable to float (return a boolean), more than one element (raise), 1 nan
only (raise).

The other one I can think of is a single-element object array. I'd argue we
should raise in this case too. What else?

jreback · 2013-09-04T01:57:46Z

actually most of these are solved by limiting the single element resolving to only a boolean array
by definition then a single nan will raise, object array will raise, cast able to bool will raise (I have tests for these)

ONLY a single element series of True/False will pass thru the bool of the element

jreback · 2013-09-06T01:18:24Z

the only remaining point is do we allow in bool(Series([True])) == True and bool(Series([False])) == False

numpy does this, but it is inconsistent with all other bool operations on a PandasObject which will raise

@hayd -1
@jseabold +1

any more strong opionions? @wesm @y-p ?

jtratner · 2013-09-06T01:25:16Z

FWIW, I'm +1 so we're consistent.

jreback · 2013-09-06T01:27:35Z

so @jtratner vote I think is -1 (so with @hayd for consistency) (the +- are for this PR which implements the above)

jtratner · 2013-09-06T01:28:18Z

@jreback no sorry I wasn't clear, I vote for being consistent with numpy.

hayd · 2013-09-06T08:23:16Z

Just to have a last comment on the subject (I think arguments are clear in this and the previous PR):

You're not voting for consistency with numpy, since we'd still raise on empty Series, but some kind of weird middle ground.

jseabold · 2013-09-06T14:57:09Z

Weird...or sane... depending on how you come down I guess.

hayd · 2013-09-06T16:31:34Z

(slighter saner... ;) )

hayd · 2013-09-10T14:24:57Z

Should I put this on mailing list? I've tried to sound impartial.

changing bool behaviour of pandas objects

There's discussion on github to change the behaviour of __nonzero__ for pandas objects.

Bool behaviour in pandas (and numpy) often trips up and surprises new (and experienced) users, for one thing because it differs from many python objects.

For empty arrays it's Falsey
For length one arrays it's bool of item (Note: bool(nan) is True)
Otherwise it raises a ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

One option (currently implemented in master) is to turn off bool always:

raise ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

An alternative proposal being discussed is (4738):

For length one arrays it's bool of item (Perhaps raising on bool(Series([nan])).)
Otherwise raise a ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Note: bool of empty objets would be disallowed.

In both cases:

not/and/or would be specifically disallowed.

jreback · 2013-09-10T14:31:15Z

sure...let's put it out there (numpy list too?)

jseabold · 2013-09-22T15:44:40Z

If item goes, you're going to have to use something like if (one_item_series.all() and one_item_series.any()) to ensure you have one item. I still think the current behavior (or what I've proposed, if there are corner cases lurking still) is the best possible unambiguous but also convenient solution.

jreback · 2013-09-22T15:46:34Z

.item() is equivalent to .iloc[0] which will raise on a 0-len

jseabold · 2013-09-22T15:50:46Z

My original use case is that I do an operation that I think returns a single-element pandas object. Then I can do

if obj > x:
   do something

If I didn't get a single-element object then there is a built-in check for me that raises because it's ambiguous.

Now I'll have to do

if obj.iloc[0] > x and len(obj) == 1:
    do something

Yuck.

Ok, I have to get back to work.

jreback · 2013-09-22T15:53:30Z

ok....IIRC, item was added for compatiblity (I think you use it quite a bit in statsmodels 0.5.0), so can't/won't remove at this time.

@hayd @jtratner PR as is? (with allowing special cased len-1 bool inference with a deprecation message)

hayd · 2013-09-23T02:07:12Z

@jseabold Could we create a helper function for this? (with an appropriate name... which I can't think of) Something like item, but which only works for boolean?

jseabold · 2013-09-23T02:19:42Z

item is intended for scalar-like objects and since y'all don't want anything to be scalar-like I don't see how this could work. pd.try_to_treat_like_scalar() doesn't have much appeal to me. I'll likely just avoid doing this with pandas objects altogether and work with .values and go back to checking for NaNs.

hayd · 2013-09-23T03:07:37Z

What about .bool ?

jtratner · 2013-09-23T03:12:03Z

I disagree with having a warning: we should either break it now and see if more people come forward or allow it for now into the foreseeable future. If this is an issue for people, we should find out about it as soon as the decision is made. This is a really subtle change that could be coming up multiple times without the user realizing it (which would cause a performance hit too for all the warnings generated), plus it breaks all code that relies upon it and, as @jseabold points out, may not be trivial to fix. On the flipside, it's weird to me that bool(np.array([np.nan])) == True [but so is Series([np.nan]).all()

I tried to look around for background on this. There are a number of places where numpy devs explain this decision as avoiding 'subtle errors' that cropped up because of the ambiguity: numpy/numpy#1595 , and on the mailing list - 1, 2, 3.

The one wart that this introduces is that you can't be sure if NDFrame.all() actually will return something you can use for boolean comparison as opposed to generating a ValueError. I think it might make sense to add a keyword argument (maybe recursive?) for both all and any that forces them to return a scalar, boolean value.

jtratner · 2013-09-23T03:35:16Z

other option: have it raise by default and allow a config option that causes it to emit a warning instead. Then we deprecate the option some time in the future.

hayd · 2013-09-23T03:37:34Z

I quite like the current thing where this works with only bool len 1 Series, but think in future this would be better as a short snappy method... => much more explicit.

I think warning is sensible, and any perf hit is negligible, no need to config it to raise. You can monkey patch it to raise easily (indeed, you can monkey patch it to do whatever you want!).

jtratner · 2013-09-23T04:10:18Z

Again, if you're going to deprecate, it should happen now. Otherwise, the warning should drop the deprecated part and just be giving a heads up that you're doing something that could be weird with anything other than a len 1 Series. [Why restrict it to bool? - seems arbitrary]

jreback · 2013-09-23T10:17:03Z

@jtratner restriction to bool is easy, what should bool(Series([np.nan])) yield?

hayd · 2013-09-23T14:01:53Z

Would it be weird to disallow _bool__ but have a special .bool method to do the special len 1 bool thing (coud also work with size one DataFrames too).

jreback · 2013-09-23T14:18:26Z

In [1]: Series([True]).bool()
Out[1]: True

In [2]: Series([False]).bool()
Out[2]: False

In [3]: DataFrame([[False]]).bool()
Out[3]: False

In [4]: DataFrame([[True]]).bool()
Out[4]: True

In [5]: DataFrame([[True,False]]).bool()
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.item(), a.any() or a.all().

Of course then this comes back up.....(but could limit to only boolean arrays) and not just single object arrrays

In [1]: Series([np.nan]).bool()
Out[1]: True

hayd · 2013-09-23T14:31:22Z

@jreback not if this was to only allowed for len 1 boolean (like in this pr), so Series([nan]).bool() would raise.

hayd · 2013-09-23T14:42:39Z

This says (explicitly) "I know I'm going to have a boolean thing of size 1, and I want this interpreted as that boolean thing".

"Interpret a length 1 boolean Series (or size 1 DataFrame/Panel?) as a boolean."

imo Series([0]).bool() and Series([1]).bool() should also raise, I think they do in this pr...

jreback · 2013-09-23T14:47:30Z

I will raise a slightly different error for a resulting scalar that is not-boolean

In [1]: Series([0]).bool()
ValueError: bool cannot act on a non-boolean single element Series

hayd · 2013-09-23T15:08:44Z

pandas/core/generic.py

@@ -558,10 +558,20 @@ def empty(self):
        return not all(len(self._get_axis(a)) > 0 for a in self._AXIS_ORDERS)

    def __nonzero__(self):
-        raise ValueError("The truth value of an array is ambiguous. Use a.empty, a.item(), a.any() or a.all().")
+        raise ValueError("The truth value of a {0} is ambiguous. "
+                         "Use a.empty, a.item(), a.any() or a.all().".format(self.__class__.__name__))


add in bool?

jreback · 2013-09-23T15:13:41Z

  def bool(self):
        """ Return the bool of a single element PandasObject
            This must be a boolean scalar value, either True or False

            Raise a ValueError if the PandasObject does not have exactly
            1 element, or that element is not boolean """
        v = self.squeeze()
        if isinstance(v, (bool,np.bool_)):
            return bool(v)
        elif np.isscalar(v):
            raise ValueError("bool cannot act on a non-boolean single element {0}".format(self.__class__.__name__))

        self.__nonzero__()

hayd · 2013-09-23T15:21:46Z

@jreback Thanks, this looks good.

@jseabold I hope this appeases somewhat, and is something you could use!

jreback · 2013-09-27T13:07:36Z

@hayd @jseabold unless any objections, I think should merge this to at least get some experience before releasing.

again to summarize:

restores bool(pandas_object) to raise ValueError in all cases
adds method pandas_object.bool() to return the value in the case of a single element bool dtyped pandas NDFrame object, otherwise raising (different but consistent ValueError)
docs updated to same (in gotchas and basics, with an expanded boolean reductions section, e.g. how to use bool(),empty,any(),all())

…ated GH4657) TST: test for single element with null-like DOC: deprecation message for using bool(Series([True]))

jreback · 2013-09-30T23:15:38Z

@wesm ?

wesm · 2013-09-30T23:19:24Z

I'm fine with these changes. I don't love the NumPy behavior but perhaps it's POLS in this case.

API: allow single element boolean Series to improve on numpy behavior( related GH4657)

eyadsibai · 2014-02-15T22:44:05Z

What if I have an object it could be None or a DataFrame. So I cannot write:
if obj:
return 1
I cannot do that anymore and at the same time I dunno if it is a dataframe or None so I cannot write obj.empty
in that sitaution I have to write it as:
if obj is None:
return 1
I think this should be changed...

jreback · 2014-02-15T22:59:25Z

@eyadsibai

the correct pythonism is

if obj is not None:
    # do something with the object
else:
    # it is None

see the docs for more information: http://pandas-docs.github.io/pandas-docs-travis/gotchas.html#using-if-truth-statements-with-pandas

hayd reviewed Sep 23, 2013
View reviewed changes

jreback added 2 commits September 30, 2013 16:41

API: allow single element boolean Series to mimic numpy behavior( rel…

5afd4b4

…ated GH4657) TST: test for single element with null-like DOC: deprecation message for using bool(Series([True]))

CLN: add .bool() method for single length evaluation

fe97e1a

jreback added a commit that referenced this pull request Oct 1, 2013

Merge pull request #4738 from jreback/bool_single

dce05cf

API: allow single element boolean Series to improve on numpy behavior( related GH4657)

jreback merged commit dce05cf into pandas-dev:master Oct 1, 2013

API: allow single element boolean Series to improve on numpy behavior( related GH4657) #4738

API: allow single element boolean Series to improve on numpy behavior( related GH4657) #4738

Conversation

jreback commented Sep 3, 2013

jreback commented Sep 3, 2013

jtratner commented Sep 3, 2013

hayd commented Sep 3, 2013

jseabold commented Sep 3, 2013

hayd commented Sep 3, 2013

jseabold commented Sep 3, 2013

hayd commented Sep 3, 2013

jseabold commented Sep 3, 2013

jseabold commented Sep 4, 2013

jtratner commented Sep 4, 2013

jseabold commented Sep 4, 2013

jreback commented Sep 4, 2013

jreback commented Sep 4, 2013

jseabold commented Sep 4, 2013

jtratner commented Sep 4, 2013

hayd commented Sep 4, 2013

jseabold commented Sep 4, 2013

jreback commented Sep 4, 2013

hayd commented Sep 4, 2013

jseabold commented Sep 4, 2013

jreback commented Sep 4, 2013

jreback commented Sep 6, 2013

jtratner commented Sep 6, 2013

jreback commented Sep 6, 2013

jtratner commented Sep 6, 2013

hayd commented Sep 6, 2013

jseabold commented Sep 6, 2013

hayd commented Sep 6, 2013

hayd commented Sep 10, 2013

changing bool behaviour of pandas objects

jreback commented Sep 10, 2013

jseabold commented Sep 22, 2013

jreback commented Sep 22, 2013

jseabold commented Sep 22, 2013

jreback commented Sep 22, 2013

hayd commented Sep 23, 2013

jseabold commented Sep 23, 2013

hayd commented Sep 23, 2013

jtratner commented Sep 23, 2013

jtratner commented Sep 23, 2013

hayd commented Sep 23, 2013

jtratner commented Sep 23, 2013

jreback commented Sep 23, 2013

hayd commented Sep 23, 2013

jreback commented Sep 23, 2013

hayd commented Sep 23, 2013

hayd commented Sep 23, 2013

jreback commented Sep 23, 2013

hayd Sep 23, 2013

Choose a reason for hiding this comment

jreback Sep 23, 2013

Choose a reason for hiding this comment

jreback commented Sep 23, 2013

hayd commented Sep 23, 2013

jreback commented Sep 27, 2013

jreback commented Sep 30, 2013

wesm commented Sep 30, 2013

eyadsibai commented Feb 15, 2014

jreback commented Feb 15, 2014