Skip to content

BUG: Make sure series-series boolean comparions are label based (GH4947) #4953

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 1, 2013

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Sep 23, 2013

closes #4947

You ask for a boolean comparsion, that is what you get

In [1]: a = Series([True, False, True], list('bca'))

In [2]: b = Series([False, True, False], list('abc'))

In [19]: a
Out[19]: 
b     True
c    False
a     True
dtype: bool

In [20]: b
Out[20]: 
a    False
b     True
c    False
dtype: bool

In [21]: a & b
Out[21]: 
b     True
c    False
a    False
dtype: bool

In [22]: a | b
Out[22]: 
b     True
c    False
a     True
dtype: bool

In [23]: a ^ b
Out[23]: 
b    False
c    False
a     True
dtype: bool

In [24]: a & Series([])
Out[24]: 
b    False
c    False
a    False
dtype: bool

In [25]: a | Series([])
Out[25]: 
b     True
c    False
a     True
dtype: bool

these ok?

In [9]: Series([np.nan,np.nan])&Series([np.nan,np.nan])
Out[9]: 
0    False
1    False
dtype: bool

In [10]: Series([np.nan,np.nan])|Series([np.nan,np.nan])
Out[10]: 
0    False
1    False
dtype: bool

scalars


In [30]: a | True
Out[30]: 
b    True
c    True
a    True
dtype: bool

In [31]: a | False
Out[31]: 
b     True
c    False
a     True
dtype: bool

In [32]: a & True
Out[32]: 
b     True
c    False
a     True
dtype: bool

In [33]: a & False
Out[33]: 
b    False
c    False
a    False
dtype: bool

Invalid scalars

In [28]: a | 'foo'
TypeError: cannot compare a dtyped [bool] array with a scalar of type [bool]

In [29]: a | np.nan
TypeError: cannot compare a dtyped [bool] array with a scalar of type [float]

@jreback
Copy link
Contributor Author

jreback commented Sep 23, 2013

@jtratner I think you might just want to incorporate this in your ops changes (or I can merge directly first), u lmk

@hayd, @jtratner what do you think about non-matching cases. I am basically only matching on the lhs labels,
so that's what you get back. If their are non-matches, then should prob fill with False I think? (so you always are returned a boolean array)

@jtratner
Copy link
Contributor

I'd prefer you merge this first (if you can do it tonight/tomorrow). I need to restart the arithmetic refactor soon, otherwise it's going to cover waay too much, but it'd be helpful to have this change in already with tests, so I don't break it or abstract too much.

@jreback
Copy link
Contributor Author

jreback commented Sep 23, 2013

do u think I should be filling these with False
or True if its a not operation ? (to be consistent and alway return a bool type)?

@jreback
Copy link
Contributor Author

jreback commented Sep 23, 2013

@jtratner @hayd pls look at the top of the PR for op results thxs

@hayd
Copy link
Contributor

hayd commented Sep 24, 2013

Not sure about:

In [8]: a | Series([])
Out[8]: 
b    True
c    True
a    True
dtype: bool

shouldn't this be a ? Or is this a nan being truthy thing... :s

@hayd
Copy link
Contributor

hayd commented Sep 24, 2013

Realised you were saying exactly the same above: I also think this should be filled with False.

@jreback
Copy link
Contributor Author

jreback commented Sep 24, 2013

updated...default is now False even for |

@jreback
Copy link
Contributor Author

jreback commented Sep 24, 2013

@hayd @jtratner all ok with this now?

@hayd
Copy link
Contributor

hayd commented Sep 24, 2013

Not quite, I think a | Series() should equal a... ?

Cripes, it's pretty tricky to be consistent here cos of bool(nan)...

@jreback
Copy link
Contributor Author

jreback commented Sep 24, 2013

This is why I think default for | should be True

currently (default is False)

In [1]: a = Series([True, False, True], list('bca'))

In [2]: a | Series([])
Out[2]: 
b    False
c    False
a    False
dtype: bool

and this is odd

In [3]: a[a | Series([])]
Out[3]: Series([], dtype: bool)

THIS should return a (and will if | defaults nan to True)

@jreback
Copy link
Contributor Author

jreback commented Sep 25, 2013

@hayd ?

@hayd
Copy link
Contributor

hayd commented Sep 25, 2013

meh... so two things:

  1. These operations should commute, atm: (a & e) != (e & a)

  2. I think the fillna should happen before the constructor with False (currently it's too coarse), so something like:

        index = self.index | other.index
        new_self = self.reindex(index).fillna(False).astype(bool)
        new_other = other.reindex(index).fillna(False).astype(bool)
        return self._constructor(na_op(new_self.values, new_other.values),
                                 index=index, name=name)
    

I suspect this is too much overhead (too slow?).... :S but that would give what I would expect.

In [2]: a = pd.Series([True, False, True], list('bca'))

In [3]: e = pd.Series()

In [4]: e & a
Out[4]: 
b    False
c    False
a    False
dtype: bool

In [5]: e | a
Out[5]: 
b     True
c    False
a     True
dtype: bool

I guess I should just accept that NaN is True. (...never!!)

@jreback
Copy link
Contributor Author

jreback commented Sep 25, 2013

@hayd these are not comparing bools (except maybe in a special case). They are comparing values (possilby) nan, so the issue is that a nan compare with another nan gives a nan (I think in all cases). which you then should fill

@hayd
Copy link
Contributor

hayd commented Sep 25, 2013

@jreback but then afterwards it's being forced to bool rather than values? (I think that confuses things...!)

@jreback
Copy link
Contributor Author

jreback commented Sep 25, 2013

no....its a comparison operation between 2 objects, you should get a bool dtype

@hayd
Copy link
Contributor

hayd commented Sep 25, 2013

What do you think about the reindex thing?

I think we agree that when both are NaN it should be False, however the weird behaviour imo is when only one is NaN - I think then or should be the non-NaN truey value i.e. True (and not what we have above), whilst and should be False (as we already have).

This is also a bit sketch as True | np.nan == True whist NaN | True == NaN, when we fillna (and I think this may be crux of the issue - grrr).

Should it be commute? I think yes (because we're talking bool):

'a' | 'b' # doesn't commute
`bool('a'|'b') # does

Apologies if I'm not making any sense here...

@jreback
Copy link
Contributor Author

jreback commented Sep 25, 2013

I originally was fillna for & with False, logic being that they both would have to be valid if you don't want a nan
while True with | as only 1 needs to be valid (ignore both are nan for a second``), then these give

True | np.nan == True and NaN | True == True and True & np.nan == False and np.nan == False

'special' case of: np.nan & np.nan == False and np.nan | np.nan == False

@jreback
Copy link
Contributor Author

jreback commented Sep 26, 2013

@hayd what do you think? default 'or' to True? (as I show above)

@hayd
Copy link
Contributor

hayd commented Sep 26, 2013

@jreback Does that mean that in psuedo: (e | a) == a == (a | e) ?

What do you think about the index thing (should we | the indexes result)?

@hayd
Copy link
Contributor

hayd commented Sep 26, 2013

To separate the issues here:

  1. For any Series a and empty Series e, does (a | e) == a ?
  2. For a and b Series, atm the index of (a | b) is a.index, should it be (a.index | b.index) ?

I think the rest of the logic follows neatly from these...

@jreback
Copy link
Contributor Author

jreback commented Sep 27, 2013

@hayd

I think the answer is yes to both. as I think the identity s[s|e] == s should be true for all e (whether empty or not)

so for 2) that would essentially be an align operation (and not what you are asking for here)
for 1) same thing

@hayd
Copy link
Contributor

hayd commented Sep 27, 2013

  1. Currently this test seems confusing to me then:

    result = a | Series([])
    expected = Series([True, True, True], list('bca'))
    assert_series_equal(result,expected) 
    

Shouldn't this be assert_series_equal(result, a) ?

@jreback
Copy link
Contributor Author

jreback commented Sep 27, 2013

nope. this is a boolean evaluation, say basically, return me the 'truth' value element-wise with self 'or' other. The input series dtype is completely irrelevant; actually the only relevant characteristic of the input is the index.

In [3]: index = list('bca')

In [4]: Series([True,False,True],index=index) | Series([])
Out[4]: 
b    True
c    True
a    True
dtype: bool

In [5]: Series([1,2,3],index=index) | Series([])
Out[5]: 
b    True
c    True
a    True
dtype: bool

In [6]: Series([np.nan,np.nan,np.nan],index=index) | Series([])
Out[6]: 
b    True
c    True
a    True
dtype: bool

@jreback
Copy link
Contributor Author

jreback commented Sep 27, 2013

fyi...just found a bug when doing this with a scalar on the rhs

@jreback
Copy link
Contributor Author

jreback commented Sep 27, 2013

These were blowing up before

In [1]: Series([True,False,True]) | np.nan
Out[1]: 
0    True
1    True
2    True
dtype: bool

In [2]: Series([True,False,True]) & np.nan
Out[2]: 
0    False
1    False
2    False
dtype: bool

These are really not 'valid' comparsions as they are effectiviely bitwise but numpy just gives an answer
should we do something about it?

In [3]: Series([True,False,True]) & 1
Out[3]: 
0     True
1    False
2     True
dtype: bool

In [4]: Series([True,False,True]) & 2
Out[4]: 
0    False
1    False
2    False
dtype: bool

In [5]: Series([True,False,True]) & 3
Out[5]: 
0     True
1    False
2     True
dtype: bool

@hayd
Copy link
Contributor

hayd commented Sep 27, 2013

Ok, I think the thing I'm upset with is that NaN should be overridden to be falsey before applying this, otherwise the results don't make sense (I don't see when you would ever use "or" to mean "or, or NaN"). This logic is overridden at other times e.g. when masking with NaNs, and I think it should be here too.

As it stands, I disagree with... :(. I think we need to fillna left and right before op.

To me the above seem valid (imo & is a relational operator just like ==... I get the feeling we are thinking about this differently...). The commutative/alignment thing raises it's head again here with 1 & s... (but I guess we can't control that).

@jreback
Copy link
Contributor Author

jreback commented Sep 27, 2013

@hayd you are talking about the scalar rhs (which have to admit is an odd thing to do in any event). or the Series that has nans?

filling is very difficult as what do you fill with?

@hayd
Copy link
Contributor

hayd commented Sep 27, 2013

I still don't see why this wouldn't work ? (We special case the NaN)

new_self = self.fillna(False)
new_other = other.reindex(index).fillna(False)
return self._constructor(na_op(new_self.values, new_other.values),
                         index=index, name=name)

@jreback
Copy link
Contributor Author

jreback commented Sep 30, 2013

why are you filling with False? (that's ONLY ok for a boolean series, but self and other at this point could be anything). I mean, I CAN fill with a non-true value but I think much easier to mask each arrray and then define what to do ( at that point all you have to do is solve these cases)

NaN op NaN
value op NaN

so then easy to define what to do.
which would be False always? (for both the above cases)?

@hayd
Copy link
Contributor

hayd commented Sep 30, 2013

Maybe I'm confused, my thinking was that this result is bool so why does it matter that self/other could be anything (non-bool)?

@jreback
Copy link
Contributor Author

jreback commented Oct 1, 2013

because you are comparing self and other for | or the operation &

I guess you DO have a good point, what is the expected behavior for non-bool dtype series that are passed in?

I guess I have been assuming you can pass arbitrary stuff.....maybe just filling NaN with False is enough then....
(because it IS conceivable that you would have an object series that is bool if you fill it)

ok..let me try what you suggested above

@jtratner
Copy link
Contributor

jtratner commented Oct 1, 2013

I'd expect it to refill with nan with a mask, right?

@jtratner
Copy link
Contributor

jtratner commented Oct 1, 2013

Nvm, that's not existing behavior on anything else. Whoops.

@jreback
Copy link
Contributor Author

jreback commented Oct 1, 2013

@hayd @jtratner ok..updated the top of the PR, much nicer now

any other cases?

@@ -4523,8 +4523,10 @@ def f():
def test_logical_with_nas(self):
d = DataFrame({'a': [np.nan, False], 'b': [True, True]})

# GH4947
# bool comparisons should return bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I've been confused by this statement (that bool comparisons always return bool?) This isn't #4947.... :s

@hayd
Copy link
Contributor

hayd commented Oct 1, 2013

I think this looks good.

jreback added a commit that referenced this pull request Oct 1, 2013
BUG: Make sure series-series boolean comparions are label based (GH4947)
@jreback jreback merged commit a653af4 into pandas-dev:master Oct 1, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Series boolean intersection not label based
3 participants