BUG: Make sure series-series boolean comparions are label based (GH4947) #4953

jreback · 2013-09-23T18:15:50Z

You ask for a boolean comparsion, that is what you get

In [1]: a = Series([True, False, True], list('bca'))

In [2]: b = Series([False, True, False], list('abc'))

In [19]: a
Out[19]: 
b     True
c    False
a     True
dtype: bool

In [20]: b
Out[20]: 
a    False
b     True
c    False
dtype: bool

In [21]: a & b
Out[21]: 
b     True
c    False
a    False
dtype: bool

In [22]: a | b
Out[22]: 
b     True
c    False
a     True
dtype: bool

In [23]: a ^ b
Out[23]: 
b    False
c    False
a     True
dtype: bool

In [24]: a & Series([])
Out[24]: 
b    False
c    False
a    False
dtype: bool

In [25]: a | Series([])
Out[25]: 
b     True
c    False
a     True
dtype: bool

these ok?

In [9]: Series([np.nan,np.nan])&Series([np.nan,np.nan])
Out[9]: 
0    False
1    False
dtype: bool

In [10]: Series([np.nan,np.nan])|Series([np.nan,np.nan])
Out[10]: 
0    False
1    False
dtype: bool

scalars


In [30]: a | True
Out[30]: 
b    True
c    True
a    True
dtype: bool

In [31]: a | False
Out[31]: 
b     True
c    False
a     True
dtype: bool

In [32]: a & True
Out[32]: 
b     True
c    False
a     True
dtype: bool

In [33]: a & False
Out[33]: 
b    False
c    False
a    False
dtype: bool

Invalid scalars

In [28]: a | 'foo'
TypeError: cannot compare a dtyped [bool] array with a scalar of type [bool]

In [29]: a | np.nan
TypeError: cannot compare a dtyped [bool] array with a scalar of type [float]

jreback · 2013-09-23T18:17:31Z

@jtratner I think you might just want to incorporate this in your ops changes (or I can merge directly first), u lmk

@hayd, @jtratner what do you think about non-matching cases. I am basically only matching on the lhs labels,
so that's what you get back. If their are non-matches, then should prob fill with False I think? (so you always are returned a boolean array)

jtratner · 2013-09-23T21:16:39Z

I'd prefer you merge this first (if you can do it tonight/tomorrow). I need to restart the arithmetic refactor soon, otherwise it's going to cover waay too much, but it'd be helpful to have this change in already with tests, so I don't break it or abstract too much.

jreback · 2013-09-23T21:21:10Z

do u think I should be filling these with False
or True if its a not operation ? (to be consistent and alway return a bool type)?

jreback · 2013-09-23T22:53:15Z

@jtratner @hayd pls look at the top of the PR for op results thxs

hayd · 2013-09-24T04:45:02Z

Not sure about:

In [8]: a | Series([])
Out[8]: 
b    True
c    True
a    True
dtype: bool

shouldn't this be a ? Or is this a nan being truthy thing... :s

hayd · 2013-09-24T04:47:41Z

Realised you were saying exactly the same above: I also think this should be filled with False.

jreback · 2013-09-24T11:39:45Z

updated...default is now False even for |

jreback · 2013-09-24T14:37:09Z

@hayd @jtratner all ok with this now?

hayd · 2013-09-24T15:22:36Z

Not quite, I think a | Series() should equal a... ?

Cripes, it's pretty tricky to be consistent here cos of bool(nan)...

jreback · 2013-09-24T15:27:01Z

This is why I think default for | should be True

currently (default is False)

In [1]: a = Series([True, False, True], list('bca'))

In [2]: a | Series([])
Out[2]: 
b    False
c    False
a    False
dtype: bool

and this is odd

In [3]: a[a | Series([])]
Out[3]: Series([], dtype: bool)

THIS should return a (and will if | defaults nan to True)

jreback · 2013-09-25T12:48:38Z

@hayd ?

hayd · 2013-09-25T15:31:22Z

meh... so two things:

These operations should commute, atm: (a & e) != (e & a)

I think the fillna should happen before the constructor with False (currently it's too coarse), so something like:

    index = self.index | other.index
    new_self = self.reindex(index).fillna(False).astype(bool)
    new_other = other.reindex(index).fillna(False).astype(bool)
    return self._constructor(na_op(new_self.values, new_other.values),
                             index=index, name=name)

I suspect this is too much overhead (too slow?).... :S but that would give what I would expect.

In [2]: a = pd.Series([True, False, True], list('bca'))

In [3]: e = pd.Series()

In [4]: e & a
Out[4]: 
b    False
c    False
a    False
dtype: bool

In [5]: e | a
Out[5]: 
b     True
c    False
a     True
dtype: bool

I guess I should just accept that NaN is True. (...never!!)

jreback · 2013-09-25T16:00:11Z

@hayd these are not comparing bools (except maybe in a special case). They are comparing values (possilby) nan, so the issue is that a nan compare with another nan gives a nan (I think in all cases). which you then should fill

hayd · 2013-09-25T16:13:22Z

@jreback but then afterwards it's being forced to bool rather than values? (I think that confuses things...!)

jreback · 2013-09-25T16:22:25Z

no....its a comparison operation between 2 objects, you should get a bool dtype

hayd · 2013-09-25T16:39:52Z

What do you think about the reindex thing?

I think we agree that when both are NaN it should be False, however the weird behaviour imo is when only one is NaN - I think then or should be the non-NaN truey value i.e. True (and not what we have above), whilst and should be False (as we already have).

This is also a bit sketch as True | np.nan == True whist NaN | True == NaN, when we fillna (and I think this may be crux of the issue - grrr).

Should it be commute? I think yes (because we're talking bool):

'a' | 'b' # doesn't commute
`bool('a'|'b') # does

Apologies if I'm not making any sense here...

jreback · 2013-09-25T16:45:00Z

I originally was fillna for & with False, logic being that they both would have to be valid if you don't want a nan
while True with | as only 1 needs to be valid (ignore both are nan for a second``), then these give

True | np.nan == True and NaN | True == True and True & np.nan == False and np.nan == False

'special' case of: np.nan & np.nan == False and np.nan | np.nan == False

jreback · 2013-09-26T20:29:56Z

@hayd what do you think? default 'or' to True? (as I show above)

hayd · 2013-09-26T23:03:58Z

@jreback Does that mean that in psuedo: (e | a) == a == (a | e) ?

What do you think about the index thing (should we | the indexes result)?

hayd · 2013-09-26T23:10:28Z

To separate the issues here:

For any Series a and empty Series e, does (a | e) == a ?
For a and b Series, atm the index of (a | b) is a.index, should it be (a.index | b.index) ?

I think the rest of the logic follows neatly from these...

jreback · 2013-09-27T12:52:56Z

@hayd

I think the answer is yes to both. as I think the identity s[s|e] == s should be true for all e (whether empty or not)

so for 2) that would essentially be an align operation (and not what you are asking for here)
for 1) same thing

hayd · 2013-09-27T14:48:49Z

Currently this test seems confusing to me then:

result = a | Series([])
expected = Series([True, True, True], list('bca'))
assert_series_equal(result,expected)

Shouldn't this be assert_series_equal(result, a) ?

jreback · 2013-09-27T15:11:20Z

nope. this is a boolean evaluation, say basically, return me the 'truth' value element-wise with self 'or' other. The input series dtype is completely irrelevant; actually the only relevant characteristic of the input is the index.

In [3]: index = list('bca')

In [4]: Series([True,False,True],index=index) | Series([])
Out[4]: 
b    True
c    True
a    True
dtype: bool

In [5]: Series([1,2,3],index=index) | Series([])
Out[5]: 
b    True
c    True
a    True
dtype: bool

In [6]: Series([np.nan,np.nan,np.nan],index=index) | Series([])
Out[6]: 
b    True
c    True
a    True
dtype: bool

jreback · 2013-09-27T15:11:44Z

fyi...just found a bug when doing this with a scalar on the rhs

jreback · 2013-09-27T15:48:41Z

These were blowing up before

In [1]: Series([True,False,True]) | np.nan
Out[1]: 
0    True
1    True
2    True
dtype: bool

In [2]: Series([True,False,True]) & np.nan
Out[2]: 
0    False
1    False
2    False
dtype: bool

These are really not 'valid' comparsions as they are effectiviely bitwise but numpy just gives an answer
should we do something about it?

In [3]: Series([True,False,True]) & 1
Out[3]: 
0     True
1    False
2     True
dtype: bool

In [4]: Series([True,False,True]) & 2
Out[4]: 
0    False
1    False
2    False
dtype: bool

In [5]: Series([True,False,True]) & 3
Out[5]: 
0     True
1    False
2     True
dtype: bool

hayd · 2013-09-27T16:16:44Z

Ok, I think the thing I'm upset with is that NaN should be overridden to be falsey before applying this, otherwise the results don't make sense (I don't see when you would ever use "or" to mean "or, or NaN"). This logic is overridden at other times e.g. when masking with NaNs, and I think it should be here too.

As it stands, I disagree with... :(. I think we need to fillna left and right before op.

To me the above seem valid (imo & is a relational operator just like ==... I get the feeling we are thinking about this differently...). The commutative/alignment thing raises it's head again here with 1 & s... (but I guess we can't control that).

jreback · 2013-09-27T18:30:03Z

@hayd you are talking about the scalar rhs (which have to admit is an odd thing to do in any event). or the Series that has nans?

filling is very difficult as what do you fill with?

hayd · 2013-09-27T21:14:42Z

I still don't see why this wouldn't work ? (We special case the NaN)

new_self = self.fillna(False)
new_other = other.reindex(index).fillna(False)
return self._constructor(na_op(new_self.values, new_other.values),
                         index=index, name=name)

jreback · 2013-09-30T23:22:09Z

why are you filling with False? (that's ONLY ok for a boolean series, but self and other at this point could be anything). I mean, I CAN fill with a non-true value but I think much easier to mask each arrray and then define what to do ( at that point all you have to do is solve these cases)

NaN op NaN
value op NaN

so then easy to define what to do.
which would be False always? (for both the above cases)?

hayd · 2013-09-30T23:59:59Z

Maybe I'm confused, my thinking was that this result is bool so why does it matter that self/other could be anything (non-bool)?

jreback · 2013-10-01T00:03:30Z

because you are comparing self and other for | or the operation &

I guess you DO have a good point, what is the expected behavior for non-bool dtype series that are passed in?

I guess I have been assuming you can pass arbitrary stuff.....maybe just filling NaN with False is enough then....
(because it IS conceivable that you would have an object series that is bool if you fill it)

ok..let me try what you suggested above

jtratner · 2013-10-01T00:06:04Z

I'd expect it to refill with nan with a mask, right?

jtratner · 2013-10-01T00:09:42Z

Nvm, that's not existing behavior on anything else. Whoops.

jreback · 2013-10-01T00:38:32Z

@hayd @jtratner ok..updated the top of the PR, much nicer now

any other cases?

hayd · 2013-10-01T03:54:58Z

pandas/tests/test_frame.py

@@ -4523,8 +4523,10 @@ def f():
    def test_logical_with_nas(self):
        d = DataFrame({'a': [np.nan, False], 'b': [True, True]})

+        # GH4947
+        # bool comparisons should return bool


Maybe I've been confused by this statement (that bool comparisons always return bool?) This isn't #4947.... :s

hayd · 2013-10-01T03:55:54Z

I think this looks good.

BUG: Make sure series-series boolean comparions are label based (GH4947)

cpcloud mentioned this pull request Sep 25, 2013

BUG: allow Timestamp comparisons on the left #4983

Merged

hayd reviewed Oct 1, 2013
View reviewed changes

jreback added 2 commits October 1, 2013 09:13

BUG: Make sure series-series boolean comparions are label based (GH4947)

fb2bb58

ENH: Series lhs, scalar rhs bool comparison support

0de0459

jreback added a commit that referenced this pull request Oct 1, 2013

Merge pull request #4953 from jreback/bool_intersect

a653af4

BUG: Make sure series-series boolean comparions are label based (GH4947)

jreback merged commit a653af4 into pandas-dev:master Oct 1, 2013

hayd mentioned this pull request Mar 3, 2014

NaN values impact binary or operations asymmetrically #6528

Closed

BUG: Make sure series-series boolean comparions are label based (GH4947) #4953

BUG: Make sure series-series boolean comparions are label based (GH4947) #4953

Conversation

jreback commented Sep 23, 2013

jreback commented Sep 23, 2013

jtratner commented Sep 23, 2013

jreback commented Sep 23, 2013

jreback commented Sep 23, 2013

hayd commented Sep 24, 2013

hayd commented Sep 24, 2013

jreback commented Sep 24, 2013

jreback commented Sep 24, 2013

hayd commented Sep 24, 2013

jreback commented Sep 24, 2013

jreback commented Sep 25, 2013

hayd commented Sep 25, 2013

jreback commented Sep 25, 2013

hayd commented Sep 25, 2013

jreback commented Sep 25, 2013

hayd commented Sep 25, 2013

jreback commented Sep 25, 2013

jreback commented Sep 26, 2013

hayd commented Sep 26, 2013

hayd commented Sep 26, 2013

jreback commented Sep 27, 2013

hayd commented Sep 27, 2013

jreback commented Sep 27, 2013

jreback commented Sep 27, 2013

jreback commented Sep 27, 2013

hayd commented Sep 27, 2013

jreback commented Sep 27, 2013

hayd commented Sep 27, 2013

jreback commented Sep 30, 2013

hayd commented Sep 30, 2013

jreback commented Oct 1, 2013

jtratner commented Oct 1, 2013

jtratner commented Oct 1, 2013

jreback commented Oct 1, 2013

hayd Oct 1, 2013

Choose a reason for hiding this comment

hayd commented Oct 1, 2013