Skip to content

API: allow single element boolean Series to improve on numpy behavior( related GH4657) #4738

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 1, 2013

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Sep 3, 2013

related #4657

@jreback
Copy link
Contributor Author

jreback commented Sep 3, 2013

@jseabold @jtratner @hayd

@jtratner
Copy link
Contributor

jtratner commented Sep 3, 2013

Looks good to me...

@hayd
Copy link
Contributor

hayd commented Sep 3, 2013

Can we ping out google groups to see what people think?

I still don't see the logic to this (other than numpy doing it), it seems to me there is always a non-ambiguous way to write things...

@jseabold
Copy link
Contributor

jseabold commented Sep 3, 2013

Practicality beats purity for me here. It saves typing and seems quite logical to me. At least, I don't see any ambiguity, especially if the empty Series/DataFrame case is explicitly handled to raise. Incidentally, the behavior for an empty objects differs from numpy (and Python) but I agree with the change.

@hayd
Copy link
Contributor

hayd commented Sep 3, 2013

To me this seems impractical and impure! :(

I'm yet to see an example where it saves typing (and which can't be done without item), and I don't see why the one item case should be special (if we're breaking numpy behaviour anyway....).

@jseabold
Copy link
Contributor

jseabold commented Sep 3, 2013

Every example I showed involved more typing?

@hayd
Copy link
Contributor

hayd commented Sep 3, 2013

I thought there was always a way it could be written to not be, e.g. the example with pd.isnull(df.ix[]) could be tweaked so the ix returned a value (not a Series)...

@jseabold
Copy link
Contributor

jseabold commented Sep 3, 2013

ix will only return a value if I don't use a boolean index, but this involves more code to get the single index value -- either a (horribly ugly) call to np.where, some pandas-fu that I didn't try yet, or setting an index on the frame (which isn't always possible).

@jseabold
Copy link
Contributor

jseabold commented Sep 4, 2013

Please do bring it up on the ML though. Input from others would be helpful.

@jtratner
Copy link
Contributor

jtratner commented Sep 4, 2013

I didn't realize how this handles empty Series. Should definitely be the same as Python/numpy:

>>> bool(Series([]))
False
>>> bool(np.array([])
False
>>> bool(np.array([[]])
False
>>> bool(DataFrame([[]]))
False

@jseabold
Copy link
Contributor

jseabold commented Sep 4, 2013

Ha, now there's some ambiguity if it's true there wasn't before. I thought this was an improvement over numpy and the warning is clear that maybe you should use 'empty'.

Anway, my vote is

  1. Empty series raises. Maybe you screwed up your index? What is the 'correct' output of this?

    if pd.isnull(pd.DataFrame([])):
        print 'this dataframe has no missing values?'
    

    Ambiguous?

  2. 1 element is fine. You know what you're doing, carry on. Also .all() == .any() in this case, so it's not ambiguous.

  3. Length > 1 raises . This is ambiguous. Ask for all, any, or empty. Maybe you screwed up your index?

You can read old numpy threads about this if no one wants to discuss on the ML, but I think these issues may be of general interest.

@jreback
Copy link
Contributor Author

jreback commented Sep 4, 2013

current behavior (with this PR) is to raise on empty Series/DataFrame (note that this was not true in 0.12,
but actually WAS true IIRC before 0.12)

I agree that we should absolutely raise on an empty object

if not df:
   print 'ok'

should NEVER work (and if we allowed it to work for an empty frame it would)

correct is

if not df.empty:
     print 'ok'

@jreback
Copy link
Contributor Author

jreback commented Sep 4, 2013

@jseabold proposal is what this PR does

@jseabold
Copy link
Contributor

jseabold commented Sep 4, 2013

Yeah. I'm +1.

@jtratner
Copy link
Contributor

jtratner commented Sep 4, 2013

hmm...I'm mixed on only following some but not all of the numpy behavior
here...seems slightly confusing to me, but this is not an area I use much
with pandas.

@hayd
Copy link
Contributor

hayd commented Sep 4, 2013

Obviously I'm still massive -1, I think we should be raising on all __nonzero__ (i.e. not this pr, but current master).

1 element is fine. You know what you're doing

Totally disagree. I would also argue that it is ambiguous.

In [1]: bool([False])
Out[1]: True

In [2]: bool(np.array([np.nan]))
Out[2]: True

In [3]: bool(np.array([False]))
Out[3]: False

pd.isnull(pd.DataFrame([])) is an empty DataFrame (not ambiguous)...

@jseabold
Copy link
Contributor

jseabold commented Sep 4, 2013

On Tue, Sep 3, 2013 at 9:13 PM, Andy Hayden [email protected]:

Obviously I'm still massive -1, I think we should be raising on all
nonzero (i.e. not this pr, but current master).

1 element is fine. You know what you're doing

Totally disagree. I would also argue that it is ambiguous.

In [1]: bool([False])
Out[1]: True

Hmm, ok, this is a bit confusing, but I'd argue not a show stopper. I
can't speak for everyone, but I don't find this difficult to differentiate.
Most of Numpy/pandas operates element-wise. So element-wise it's not
ambiguous. If there's one element, we can return an answer.

In [2]: bool(np.array([np.nan]))
Out[2]: True

NumPy's nan handling is inconsistent and definitely one of the areas where
pandas should/exists to improve. It and operations on empty arrays are two
issues that are pointed to as warts (by some) in the numpy threads about
similar. That said, I'm not sure what bool(pd.Series([np.nan])) should do.
I'd probably say that it should raise an error.

In [3]: bool(np.array([False]))
Out[3]: False

pd.isnull(pd.DataFrame([])) is an empty DataFrame (not ambiguous)...

If the null DataFrame is the result of an indexing operation, then it is
ambiguous I'd say. If your question you want to answer is, given this
index, is this value True? The answer is, we can't evaluate this question
because your indexing operation didn't return any information.

@jreback
Copy link
Contributor Author

jreback commented Sep 4, 2013

we would have to special case Series([np.nan]) as I think this would evaluate to bool(np.nan) which is True

@hayd
Copy link
Contributor

hayd commented Sep 4, 2013

I think this special casing is going to confuse users (one rule for python, one (especially confusing) for numpy, one for pandas)... I think it's consistency to say no to __nonzero__. I think this is an improvement.

re the empty DataFrame it's just like doing a function over emptyset is empty:

map(pd.isnull, []) == []

the current behaviour is well-defined (and imo correct). However bool of it is ambiguous (if that's what you were saying then we agree :) ).

Some more random stuff we wanted to stop (but work in numpy):

In [4]: not np.array([False])
Out[4]: True

In [5]: np.array([False]) or np.array([np.nan])
Out[5]: array([ nan])

Special casing this will be a rabbit hole.

@jseabold
Copy link
Contributor

jseabold commented Sep 4, 2013

On Tue, Sep 3, 2013 at 9:41 PM, Andy Hayden [email protected]:

I think this special casing is going to confuse users (one rule for
python, one (especially confusing) for numpy, one for pandas)... I think
it's consistency to say no to nonzero. I think this is an improvement.

re the empty DataFrame it's just like doing a function over emptyset is
empty:

map(pd.isnull, []) == []

the current behaviour is well-defined (and imo correct). However bool of
it is ambiguous (if that's what you were saying then we agree :) ).

Yes, we agree.

Some more random stuff we wanted to stop (but work in numpy):

In [4]: not np.array([False])
Out[4]: True

No problem here in current usage. It maps.

In [5]: np.array([False]) or np.array([np.nan])
Out[5]: array([ nan])

Either the RHS raises or it short circuits if the first element is True.
Special casing this will be a rabbit hole.

There are 5 (?) eventualities I can think of -- empty (raise), 1 element
castable to float (return a boolean), more than one element (raise), 1 nan
only (raise).

The other one I can think of is a single-element object array. I'd argue we
should raise in this case too. What else?

@jreback
Copy link
Contributor Author

jreback commented Sep 4, 2013

actually most of these are solved by limiting the single element resolving to only a boolean array
by definition then a single nan will raise, object array will raise, cast able to bool will raise (I have tests for these)

ONLY a single element series of True/False will pass thru the bool of the element

@jreback
Copy link
Contributor Author

jreback commented Sep 6, 2013

the only remaining point is do we allow in bool(Series([True])) == True and bool(Series([False])) == False

numpy does this, but it is inconsistent with all other bool operations on a PandasObject which will raise

@hayd -1
@jseabold +1

any more strong opionions? @wesm @y-p ?

@jtratner
Copy link
Contributor

jtratner commented Sep 6, 2013

FWIW, I'm +1 so we're consistent.

@jreback
Copy link
Contributor Author

jreback commented Sep 6, 2013

so @jtratner vote I think is -1 (so with @hayd for consistency) (the +- are for this PR which implements the above)

@jtratner
Copy link
Contributor

jtratner commented Sep 6, 2013

@jreback no sorry I wasn't clear, I vote for being consistent with numpy.

@hayd
Copy link
Contributor

hayd commented Sep 6, 2013

Just to have a last comment on the subject (I think arguments are clear in this and the previous PR):

You're not voting for consistency with numpy, since we'd still raise on empty Series, but some kind of weird middle ground.

@jseabold
Copy link
Contributor

jseabold commented Sep 6, 2013

Weird...or sane... depending on how you come down I guess.

@hayd
Copy link
Contributor

hayd commented Sep 6, 2013

(slighter saner... ;) )

@hayd
Copy link
Contributor

hayd commented Sep 10, 2013

Should I put this on mailing list? I've tried to sound impartial.

changing bool behaviour of pandas objects

There's discussion on github to change the behaviour of __nonzero__ for pandas objects.

Bool behaviour in pandas (and numpy) often trips up and surprises new (and experienced) users, for one thing because it differs from many python objects.

  • For empty arrays it's Falsey
  • For length one arrays it's bool of item (Note: bool(nan) is True)
  • Otherwise it raises a ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

One option (currently implemented in master) is to turn off bool always:

  • raise ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

An alternative proposal being discussed is (4738):

  • For length one arrays it's bool of item (Perhaps raising on bool(Series([nan])).)
  • Otherwise raise a ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Note: bool of empty objets would be disallowed.

In both cases:

  • not/and/or would be specifically disallowed.

@jreback
Copy link
Contributor Author

jreback commented Sep 10, 2013

sure...let's put it out there (numpy list too?)

@jseabold
Copy link
Contributor

If item goes, you're going to have to use something like if (one_item_series.all() and one_item_series.any()) to ensure you have one item. I still think the current behavior (or what I've proposed, if there are corner cases lurking still) is the best possible unambiguous but also convenient solution.

@jreback
Copy link
Contributor Author

jreback commented Sep 22, 2013

.item() is equivalent to .iloc[0] which will raise on a 0-len

@jseabold
Copy link
Contributor

My original use case is that I do an operation that I think returns a single-element pandas object. Then I can do

if obj > x:
   do something

If I didn't get a single-element object then there is a built-in check for me that raises because it's ambiguous.

Now I'll have to do

if obj.iloc[0] > x and len(obj) == 1:
    do something

Yuck.

Ok, I have to get back to work.

@jreback
Copy link
Contributor Author

jreback commented Sep 22, 2013

ok....IIRC, item was added for compatiblity (I think you use it quite a bit in statsmodels 0.5.0), so can't/won't remove at this time.

@hayd @jtratner PR as is? (with allowing special cased len-1 bool inference with a deprecation message)

@hayd
Copy link
Contributor

hayd commented Sep 23, 2013

@jseabold Could we create a helper function for this? (with an appropriate name... which I can't think of) Something like item, but which only works for boolean?

@jseabold
Copy link
Contributor

item is intended for scalar-like objects and since y'all don't want anything to be scalar-like I don't see how this could work. pd.try_to_treat_like_scalar() doesn't have much appeal to me. I'll likely just avoid doing this with pandas objects altogether and work with .values and go back to checking for NaNs.

@hayd
Copy link
Contributor

hayd commented Sep 23, 2013

What about .bool ?

@jtratner
Copy link
Contributor

I disagree with having a warning: we should either break it now and see if more people come forward or allow it for now into the foreseeable future. If this is an issue for people, we should find out about it as soon as the decision is made. This is a really subtle change that could be coming up multiple times without the user realizing it (which would cause a performance hit too for all the warnings generated), plus it breaks all code that relies upon it and, as @jseabold points out, may not be trivial to fix. On the flipside, it's weird to me that bool(np.array([np.nan])) == True [but so is Series([np.nan]).all()

I tried to look around for background on this. There are a number of places where numpy devs explain this decision as avoiding 'subtle errors' that cropped up because of the ambiguity: numpy/numpy#1595 , and on the mailing list - 1, 2, 3.

The one wart that this introduces is that you can't be sure if NDFrame.all() actually will return something you can use for boolean comparison as opposed to generating a ValueError. I think it might make sense to add a keyword argument (maybe recursive?) for both all and any that forces them to return a scalar, boolean value.

@jtratner
Copy link
Contributor

other option: have it raise by default and allow a config option that causes it to emit a warning instead. Then we deprecate the option some time in the future.

@hayd
Copy link
Contributor

hayd commented Sep 23, 2013

I quite like the current thing where this works with only bool len 1 Series, but think in future this would be better as a short snappy method... => much more explicit.

I think warning is sensible, and any perf hit is negligible, no need to config it to raise. You can monkey patch it to raise easily (indeed, you can monkey patch it to do whatever you want!).

@jtratner
Copy link
Contributor

Again, if you're going to deprecate, it should happen now. Otherwise, the warning should drop the deprecated part and just be giving a heads up that you're doing something that could be weird with anything other than a len 1 Series. [Why restrict it to bool? - seems arbitrary]

@jreback
Copy link
Contributor Author

jreback commented Sep 23, 2013

@jtratner restriction to bool is easy, what should bool(Series([np.nan])) yield?

@hayd
Copy link
Contributor

hayd commented Sep 23, 2013

Would it be weird to disallow _bool__ but have a special .bool method to do the special len 1 bool thing (coud also work with size one DataFrames too).

@jreback
Copy link
Contributor Author

jreback commented Sep 23, 2013

In [1]: Series([True]).bool()
Out[1]: True

In [2]: Series([False]).bool()
Out[2]: False

In [3]: DataFrame([[False]]).bool()
Out[3]: False

In [4]: DataFrame([[True]]).bool()
Out[4]: True

In [5]: DataFrame([[True,False]]).bool()
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.item(), a.any() or a.all().

Of course then this comes back up.....(but could limit to only boolean arrays) and not just single object arrrays

In [1]: Series([np.nan]).bool()
Out[1]: True

@hayd
Copy link
Contributor

hayd commented Sep 23, 2013

@jreback not if this was to only allowed for len 1 boolean (like in this pr), so Series([nan]).bool() would raise.

@hayd
Copy link
Contributor

hayd commented Sep 23, 2013

This says (explicitly) "I know I'm going to have a boolean thing of size 1, and I want this interpreted as that boolean thing".

"Interpret a length 1 boolean Series (or size 1 DataFrame/Panel?) as a boolean."

imo Series([0]).bool() and Series([1]).bool() should also raise, I think they do in this pr...

@jreback
Copy link
Contributor Author

jreback commented Sep 23, 2013

I will raise a slightly different error for a resulting scalar that is not-boolean

In [1]: Series([0]).bool()
ValueError: bool cannot act on a non-boolean single element Series

@@ -558,10 +558,20 @@ def empty(self):
return not all(len(self._get_axis(a)) > 0 for a in self._AXIS_ORDERS)

def __nonzero__(self):
raise ValueError("The truth value of an array is ambiguous. Use a.empty, a.item(), a.any() or a.all().")
raise ValueError("The truth value of a {0} is ambiguous. "
"Use a.empty, a.item(), a.any() or a.all().".format(self.__class__.__name__))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add in bool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jreback
Copy link
Contributor Author

jreback commented Sep 23, 2013

  def bool(self):
        """ Return the bool of a single element PandasObject
            This must be a boolean scalar value, either True or False

            Raise a ValueError if the PandasObject does not have exactly
            1 element, or that element is not boolean """
        v = self.squeeze()
        if isinstance(v, (bool,np.bool_)):
            return bool(v)
        elif np.isscalar(v):
            raise ValueError("bool cannot act on a non-boolean single element {0}".format(self.__class__.__name__))

        self.__nonzero__()

@hayd
Copy link
Contributor

hayd commented Sep 23, 2013

@jreback Thanks, this looks good.

@jseabold I hope this appeases somewhat, and is something you could use!

@jreback
Copy link
Contributor Author

jreback commented Sep 27, 2013

@hayd @jseabold unless any objections, I think should merge this to at least get some experience before releasing.

again to summarize:

  • restores bool(pandas_object) to raise ValueError in all cases
  • adds method pandas_object.bool() to return the value in the case of a single element bool dtyped pandas NDFrame object, otherwise raising (different but consistent ValueError)
  • docs updated to same (in gotchas and basics, with an expanded boolean reductions section, e.g. how to use bool(),empty,any(),all())

…ated GH4657)

TST: test for single element with null-like

DOC: deprecation message for using bool(Series([True]))
@jreback
Copy link
Contributor Author

jreback commented Sep 30, 2013

@wesm ?

@wesm
Copy link
Member

wesm commented Sep 30, 2013

I'm fine with these changes. I don't love the NumPy behavior but perhaps it's POLS in this case.

jreback added a commit that referenced this pull request Oct 1, 2013
API: allow single element boolean Series to improve on numpy behavior( related GH4657)
@jreback jreback merged commit dce05cf into pandas-dev:master Oct 1, 2013
@eyadsibai
Copy link

What if I have an object it could be None or a DataFrame. So I cannot write:
if obj:
return 1
I cannot do that anymore and at the same time I dunno if it is a dataframe or None so I cannot write obj.empty
in that sitaution I have to write it as:
if obj is None:
return 1
I think this should be changed...

@jreback
Copy link
Contributor Author

jreback commented Feb 15, 2014

@eyadsibai

the correct pythonism is

if obj is not None:
    # do something with the object
else:
    # it is None

see the docs for more information: http://pandas-docs.github.io/pandas-docs-travis/gotchas.html#using-if-truth-statements-with-pandas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants