-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Comparisons with non-comparable objects in DataFrame return True #4537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
what do we so for the reverse case? |
@jreback this is easily fixed actually, fi you agree with this, I have a PR brewing. |
@jreback it actually fails pretty spectacularly with integers: >>> df > 2
Traceback
...
AttributeError: 'int' object has no attribute 'view' |
more weird behavior: In [3]: df = DataFrame(dict(A = date_range('20130102',periods=5),
B = date_range('20130104',periods=5),
C = np.random.randn(5)))
In [4]: df
Out[4]:
A B C
0 2013-01-02 00:00:00 2013-01-04 00:00:00 1.029470
1 2013-01-03 00:00:00 2013-01-05 00:00:00 -0.589214
2 2013-01-04 00:00:00 2013-01-06 00:00:00 -1.979968
3 2013-01-05 00:00:00 2013-01-07 00:00:00 0.196116
4 2013-01-06 00:00:00 2013-01-08 00:00:00 -0.190880
In [5]: df > np.array([2])
Out[5]:
A B C
0 True True False
1 True True False
2 True True False
3 True True False
4 True True False
In [6]: df == np.array([2])
Out[6]:
A B C
0 False False False
1 False False False
2 False False False
3 False False False
4 False False False
In [7]: df != np.array([2])
Out[7]:
A B C
0 True True True
1 True True True
2 True True True
3 True True True
4 True True True
In [8]: df < np.array([2])
Out[8]:
A B C
0 False False True
1 False False True
2 False False True
3 False False True
4 False False True |
apparently this is not in the language spec and changes behavior in Python 3 as well - http://stackoverflow.com/questions/3270680/how-does-python-compare-string-and-int |
in your example though they are ultimately comparing integers since that's how |
@cpcloud ah, good to know. |
@cpcloud but example mostly stays the same for things like string dataframes compared to integers, etc... it's kind of an edge case that shouldn't be relied upon. |
i think we should leave comparison (with datetimes) as is...though i agree that's it's weird to compare a what were you thinking of doing instead of just letting python's/numpy's comparison behavior do its thing? if two types are not equal then their values are not equal? maybe just for object arrays? seems hard since pandas is really just sort of "numeric", "object", and "datetimes" but lets you retain dtypes if you want. so comparing int to float should be valid (like it is now), but comparing object to not object should never return |
@cpcloud honestly, I think we should just leave it, but have its fill value in comparisons be |
@cpcloud or maybe no change at all. Just weird that the fill value is Here's what it is now: def _comp_method(func, name, str_rep):
@Appender('Wrapper for comparison method %s' % name)
def f(self, other):
if isinstance(other, DataFrame): # Another DataFrame
return self._compare_frame(other, func, str_rep)
elif isinstance(other, Series):
return self._combine_series_infer(other, func)
else:
# straight boolean comparisions we want to allow all columns
# (regardless of dtype to pass thru)
return self._combine_const(other, func, raise_on_error = False).fillna(True).astype(bool)
f.__name__ = name
return f I think it should be: def _comp_method(func, name, str_rep):
@Appender('Wrapper for comparison method %s' % name)
def f(self, other):
if isinstance(other, DataFrame): # Another DataFrame
return self._compare_frame(other, func, str_rep)
elif isinstance(other, Series):
return self._combine_series_infer(other, func)
else:
if func == operator.ne:
masker = True
else:
masker = False
# straight boolean comparisions we want to allow all columns
# (regardless of dtype to pass thru)
return self._combine_const(other, func, raise_on_error = False).fillna(masker).astype(bool)
f.__name__ = name
return f Similar to how @jreback set it up in the Series refactor |
here's an idea generally numeric operations in a frame are done using the numeric columns only (using _get_numeric_data) I think for comparisons you could do something similar, only allow a compatible dtype comparator on the rhs (eg say its a datetime then exclude non-datetimes on the lhs), this prevents the integer/datetime comparison issues; booleans prob need some special consideration here (there is a _get_boolean_data); easy to create other of these so for frames this makes perfect sense |
@jreback would this still work: df = DataFrame({"A": [0, 1.2, np.nan], "B": [datetime.now(), datetime.now(), datetime.now()]})
df[ df > 0] |
this is why the comparisons return True by default (when the comparisons fail, eg fillna(True)), so it passes thru on the selection so will pass thru except where the A value = 0; I don't think this should change |
okay, then I guess this particularly thing should be left as-is? (since that's already the behavior that's happening). A little weird for this to happen In [1]: from pandas import *
In [2]: %paste
row = [datetime(1, 1, 1), datetime(2, 2, 2), datetime(3, 3, 3)]
df = DataFrame({"A": row, "B": row})
df > 3
## -- End pasted text --
Out[2]:
A B
0 True True
1 True True
2 True True |
but I guess you could chain the comparisons...which probably makes sense, like: (df > 3) & (df < datetime(2, 2, 3)) |
@jtratner I think we are not changing this, correct, close? |
I can't tell if this is expected or not (taken from part of
test_where_datetime
in tests/test_frame). The change in #3311 (which the test case came from) didn't specify what should happen with non-datetime comparisons. Right now they return True, which doesn't really make sense. They should probably return False instead.Column C probably should be False for every case.
The text was updated successfully, but these errors were encountered: