-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: allow single element boolean Series to improve on numpy behavior( related GH4657) #4738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Looks good to me... |
Can we ping out google groups to see what people think? I still don't see the logic to this (other than numpy doing it), it seems to me there is always a non-ambiguous way to write things... |
Practicality beats purity for me here. It saves typing and seems quite logical to me. At least, I don't see any ambiguity, especially if the empty Series/DataFrame case is explicitly handled to raise. Incidentally, the behavior for an empty objects differs from numpy (and Python) but I agree with the change. |
To me this seems impractical and impure! :( I'm yet to see an example where it saves typing (and which can't be done without item), and I don't see why the one item case should be special (if we're breaking numpy behaviour anyway....). |
Every example I showed involved more typing? |
I thought there was always a way it could be written to not be, e.g. the example with pd.isnull(df.ix[]) could be tweaked so the ix returned a value (not a Series)... |
ix will only return a value if I don't use a boolean index, but this involves more code to get the single index value -- either a (horribly ugly) call to np.where, some pandas-fu that I didn't try yet, or setting an index on the frame (which isn't always possible). |
Please do bring it up on the ML though. Input from others would be helpful. |
I didn't realize how this handles empty Series. Should definitely be the same as Python/numpy: >>> bool(Series([]))
False
>>> bool(np.array([])
False
>>> bool(np.array([[]])
False
>>> bool(DataFrame([[]]))
False |
Ha, now there's some ambiguity if it's true there wasn't before. I thought this was an improvement over numpy and the warning is clear that maybe you should use 'empty'. Anway, my vote is
You can read old numpy threads about this if no one wants to discuss on the ML, but I think these issues may be of general interest. |
current behavior (with this PR) is to raise on empty Series/DataFrame (note that this was not true in 0.12, I agree that we should absolutely raise on an empty object
should NEVER work (and if we allowed it to work for an empty frame it would) correct is
|
@jseabold proposal is what this PR does |
Yeah. I'm +1. |
hmm...I'm mixed on only following some but not all of the numpy behavior |
Obviously I'm still massive -1, I think we should be raising on all
Totally disagree. I would also argue that it is ambiguous.
|
On Tue, Sep 3, 2013 at 9:13 PM, Andy Hayden [email protected]:
Hmm, ok, this is a bit confusing, but I'd argue not a show stopper. I
NumPy's nan handling is inconsistent and definitely one of the areas where
If the null DataFrame is the result of an indexing operation, then it is |
we would have to special case Series([np.nan]) as I think this would evaluate to bool(np.nan) which is True |
I think this special casing is going to confuse users (one rule for python, one (especially confusing) for numpy, one for pandas)... I think it's consistency to say no to re the empty DataFrame it's just like doing a function over emptyset is empty:
the current behaviour is well-defined (and imo correct). However bool of it is ambiguous (if that's what you were saying then we agree :) ). Some more random stuff we wanted to stop (but work in numpy):
Special casing this will be a rabbit hole. |
On Tue, Sep 3, 2013 at 9:41 PM, Andy Hayden [email protected]:
Yes, we agree.
No problem here in current usage. It maps.
There are 5 (?) eventualities I can think of -- empty (raise), 1 element The other one I can think of is a single-element object array. I'd argue we |
actually most of these are solved by limiting the single element resolving to only a boolean array ONLY a single element series of True/False will pass thru the bool of the element |
FWIW, I'm +1 so we're consistent. |
@jreback no sorry I wasn't clear, I vote for being consistent with numpy. |
Just to have a last comment on the subject (I think arguments are clear in this and the previous PR): You're not voting for consistency with numpy, since we'd still raise on empty Series, but some kind of weird middle ground. |
Weird...or sane... depending on how you come down I guess. |
(slighter saner... ;) ) |
Should I put this on mailing list? I've tried to sound impartial. changing bool behaviour of pandas objectsThere's discussion on github to change the behaviour of Bool behaviour in pandas (and numpy) often trips up and surprises new (and experienced) users, for one thing because it differs from many python objects.
One option (currently implemented in master) is to turn off bool always:
An alternative proposal being discussed is (4738):
Note: bool of empty objets would be disallowed. In both cases:
|
sure...let's put it out there (numpy list too?) |
If item goes, you're going to have to use something like |
|
My original use case is that I do an operation that I think returns a single-element pandas object. Then I can do
If I didn't get a single-element object then there is a built-in check for me that raises because it's ambiguous. Now I'll have to do
Yuck. Ok, I have to get back to work. |
@jseabold Could we create a helper function for this? (with an appropriate name... which I can't think of) Something like item, but which only works for boolean? |
|
What about |
I disagree with having a warning: we should either break it now and see if more people come forward or allow it for now into the foreseeable future. If this is an issue for people, we should find out about it as soon as the decision is made. This is a really subtle change that could be coming up multiple times without the user realizing it (which would cause a performance hit too for all the warnings generated), plus it breaks all code that relies upon it and, as @jseabold points out, may not be trivial to fix. On the flipside, it's weird to me that I tried to look around for background on this. There are a number of places where numpy devs explain this decision as avoiding 'subtle errors' that cropped up because of the ambiguity: numpy/numpy#1595 , and on the mailing list - 1, 2, 3. The one wart that this introduces is that you can't be sure if |
other option: have it raise by default and allow a config option that causes it to emit a warning instead. Then we deprecate the option some time in the future. |
I quite like the current thing where this works with only bool len 1 Series, but think in future this would be better as a short snappy method... => much more explicit. I think warning is sensible, and any perf hit is negligible, no need to config it to raise. You can monkey patch it to raise easily (indeed, you can monkey patch it to do whatever you want!). |
Again, if you're going to deprecate, it should happen now. Otherwise, the warning should drop the deprecated part and just be giving a heads up that you're doing something that could be weird with anything other than a len 1 Series. [Why restrict it to bool? - seems arbitrary] |
@jtratner restriction to bool is easy, what should |
Would it be weird to disallow |
Of course then this comes back up.....(but could limit to only boolean arrays) and not just single object arrrays
|
@jreback not if this was to only allowed for len 1 boolean (like in this pr), so Series([nan]).bool() would raise. |
This says (explicitly) "I know I'm going to have a boolean thing of size 1, and I want this interpreted as that boolean thing". "Interpret a length 1 boolean Series (or size 1 DataFrame/Panel?) as a boolean." imo |
I will raise a slightly different error for a resulting scalar that is not-boolean
|
@@ -558,10 +558,20 @@ def empty(self): | |||
return not all(len(self._get_axis(a)) > 0 for a in self._AXIS_ORDERS) | |||
|
|||
def __nonzero__(self): | |||
raise ValueError("The truth value of an array is ambiguous. Use a.empty, a.item(), a.any() or a.all().") | |||
raise ValueError("The truth value of a {0} is ambiguous. " | |||
"Use a.empty, a.item(), a.any() or a.all().".format(self.__class__.__name__)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add in bool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
@hayd @jseabold unless any objections, I think should merge this to at least get some experience before releasing. again to summarize:
|
…ated GH4657) TST: test for single element with null-like DOC: deprecation message for using bool(Series([True]))
@wesm ? |
I'm fine with these changes. I don't love the NumPy behavior but perhaps it's POLS in this case. |
API: allow single element boolean Series to improve on numpy behavior( related GH4657)
What if I have an object it could be None or a DataFrame. So I cannot write: |
the correct pythonism is
see the docs for more information: http://pandas-docs.github.io/pandas-docs-travis/gotchas.html#using-if-truth-statements-with-pandas |
related #4657