Skip to content

Commit b8a3ba3

Browse files
committed
CLN: correct in and not in
Also added tests for nan in and not in and disallowed ops like pd.eval('1 or 2') since that should be performed in regular Python
1 parent 0469fe4 commit b8a3ba3

File tree

6 files changed

+234
-102
lines changed

6 files changed

+234
-102
lines changed

doc/source/enhancingperf.rst

+19-5
Original file line numberDiff line numberDiff line change
@@ -384,6 +384,14 @@ Now let's do the same thing but with comparisons:
384384
385385
%timeit pd.eval('df1 + df2 + df3 + df4 + s')
386386
387+
.. note::
388+
389+
Operations such as ``1 and 2`` should be performed in Python. An exception
390+
will be raised if you try to performed any boolean or bitwise operations
391+
with scalar operands that are not of type ``bool`` or ``np.bool_``. *This
392+
includes bitwise operations on scalars.* You should perform these kinds of
393+
operations in Python.
394+
387395
The ``DataFrame.eval`` method
388396
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
389397

@@ -393,7 +401,7 @@ evaluate an expression in the "context" of a ``DataFrame``.
393401

394402
.. ipython:: python
395403
396-
df = DataFrame(randn(10, 2), columns=['a', 'b'])
404+
df = DataFrame(randn(5, 2), columns=['a', 'b'])
397405
df.eval('a + b')
398406
399407
@@ -410,7 +418,7 @@ You can refer to local variables the same way you would in vanilla Python
410418

411419
.. ipython:: python
412420
413-
df = DataFrame(randn(10, 2), columns=['a', 'b'])
421+
df = DataFrame(randn(5, 2), columns=['a', 'b'])
414422
newcol = randn(len(df))
415423
df.eval('b + newcol')
416424
@@ -419,16 +427,22 @@ You can refer to local variables the same way you would in vanilla Python
419427
The one exception is when you have a local (or global) with the same name as
420428
a column in the ``DataFrame``
421429

422-
.. ipython:: python
423-
:okexcept:
430+
.. code-block:: python
424431
425-
df = DataFrame(randn(10, 2), columns=['a', 'b'])
432+
df = DataFrame(randn(5, 2), columns=['a', 'b'])
426433
a = randn(len(df))
427434
df.eval('a + b')
435+
NameResolutionError: resolvers and locals overlap on names ['a']
436+
428437
429438
To deal with these conflicts, a special syntax exists for referring
430439
variables with the same name as a column
431440

441+
.. ipython:: python
442+
:suppress:
443+
444+
a = randn(len(df))
445+
432446
.. ipython:: python
433447
434448
df.eval('@a + b')

doc/source/indexing.rst

+12-17
Original file line numberDiff line numberDiff line change
@@ -1014,8 +1014,7 @@ The :meth:`~pandas.DataFrame.query` Method
10141014
.. versionadded:: 0.13
10151015

10161016
:class:`~pandas.DataFrame` objects have a :meth:`~pandas.DataFrame.query`
1017-
method that allows selection using a string consisting of columns of the
1018-
calling :class:`~pandas.DataFrame`.
1017+
method that allows selection using a boolean expression.
10191018

10201019
You can get the value of the frame where column ``b`` has values
10211020
between the values of columns ``a`` and ``c``.
@@ -1027,7 +1026,7 @@ between the values of columns ``a`` and ``c``.
10271026
10281027
.. ipython:: python
10291028
1030-
n = 20
1029+
n = 10
10311030
df = DataFrame(rand(n, 3), columns=list('abc'))
10321031
df
10331032
df[(df.a < df.b) & (df.b < df.c)]
@@ -1038,7 +1037,7 @@ with the name ``a``.
10381037

10391038
.. ipython:: python
10401039
1041-
df = DataFrame(randint(n, size=(n, 2)), columns=list('bc'))
1040+
df = DataFrame(randint(n / 2, size=(n, 2)), columns=list('bc'))
10421041
df.index.name = 'a'
10431042
df
10441043
df.query('a < b and b < c')
@@ -1075,13 +1074,14 @@ You can also use the levels of a ``DataFrame`` with a
10751074
10761075
import pandas.util.testing as tm
10771076
1078-
colors = tm.choice(['red', 'green'], size=10)
1079-
foods = tm.choice(['eggs', 'ham'], size=10)
1077+
n = 10
1078+
colors = tm.choice(['red', 'green'], size=n)
1079+
foods = tm.choice(['eggs', 'ham'], size=n)
10801080
colors
10811081
foods
10821082
10831083
index = MultiIndex.from_arrays([colors, foods], names=['color', 'food'])
1084-
df = DataFrame(randn(10, 2), index=index)
1084+
df = DataFrame(randn(n, 2), index=index)
10851085
df
10861086
df.query('color == "red"')
10871087
@@ -1091,8 +1091,7 @@ special names:
10911091

10921092
.. ipython:: python
10931093
1094-
index.names = [None, None]
1095-
df = DataFrame(randn(10, 2), index=index)
1094+
df.index.names = [None, None]
10961095
df
10971096
df.query('ilevel_0 == "red"')
10981097
@@ -1111,9 +1110,9 @@ having to specify which frame you're interested in querying
11111110

11121111
.. ipython:: python
11131112
1114-
df = DataFrame(randint(n, size=(n, 2)), columns=list('bc'))
1113+
df = DataFrame(randint(n / 2, size=(n, 2)), columns=list('bc'))
11151114
df.index.name = 'a'
1116-
df2 = DataFrame(randint(n + 10, size=(n + 10, 3)), columns=list('abc'))
1115+
df2 = DataFrame(randint(n + 5, size=(n + 5, 3)), columns=list('abc'))
11171116
df2
11181117
expr = 'a < b & b < c'
11191118
map(lambda frame: frame.query(expr), [df, df2])
@@ -1141,7 +1140,7 @@ Full numpy-like syntax
11411140

11421141
.. ipython:: python
11431142
1144-
df = DataFrame(randint(n, size=(n, 3)), columns=list('abc'))
1143+
df = DataFrame(randint(n / 2, size=(n, 3)), columns=list('abc'))
11451144
df
11461145
df['(a < b) & (b < c)']
11471146
df[(df.a < df.b) & (df.b < df.c)]
@@ -1164,10 +1163,6 @@ Pretty close to how you might write it on paper
11641163
11651164
df['a < b < c']
11661165
1167-
As you can see, these are all equivalent ways to express the same operation (in
1168-
fact, they are all ultimately parsed into something very similar to the first
1169-
example of the indexing syntax above).
1170-
11711166
The ``in`` and ``not in`` operators
11721167
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11731168

@@ -1184,7 +1179,7 @@ The ``in`` and ``not in`` operators
11841179
.. ipython:: python
11851180
11861181
# get all rows where columns "a" and "b" have overlapping values
1187-
df = DataFrame({'a': list('aaaabbbbcccc'), 'b': list('aabbccddeeff'),
1182+
df = DataFrame({'a': list('aabbccddeeff'), 'b': list('aaaabbbbcccc'),
11881183
'c': randint(5, size=12), 'd': randint(9, size=12)})
11891184
df
11901185
df['a in b']

pandas/computation/expr.py

+8-7
Original file line numberDiff line numberDiff line change
@@ -461,10 +461,10 @@ def _rewrite_membership_op(self, node, left, right):
461461
name = self.env.add_tmp([right.value])
462462
right = self.term_type(name, self.env)
463463

464-
# swap the operands so things like a == [1, 2] are translated to
465-
# [1, 2] in a -> a.isin([1, 2])
466-
if right_list or right_str:
467-
left, right = right, left
464+
if left_str:
465+
self.env.remove_tmp(left.name)
466+
name = self.env.add_tmp([left.value])
467+
left = self.term_type(name, self.env)
468468

469469
op = self.visit(op_instance)
470470
return op, op_instance, left, right
@@ -662,13 +662,14 @@ def visitor(x, y):
662662
return reduce(visitor, operands)
663663

664664

665-
_python_not_supported = frozenset(['Assign', 'Tuple', 'Dict', 'Call',
666-
'BoolOp', 'In', 'NotIn'])
665+
_python_not_supported = frozenset(['Assign', 'Dict', 'Call', 'BoolOp',
666+
'In', 'NotIn'])
667667
_numexpr_supported_calls = frozenset(_reductions + _mathops)
668668

669669

670670
@disallow((_unsupported_nodes | _python_not_supported) -
671-
(_boolop_nodes | frozenset(['BoolOp', 'Attribute', 'In', 'NotIn'])))
671+
(_boolop_nodes | frozenset(['BoolOp', 'Attribute', 'In', 'NotIn',
672+
'Tuple'])))
672673
class PandasExprVisitor(BaseExprVisitor):
673674
def __init__(self, env, engine, parser,
674675
preparser=lambda x: _replace_locals(_replace_booleans(x))):

pandas/computation/ops.py

+36-8
Original file line numberDiff line numberDiff line change
@@ -193,6 +193,13 @@ def name(self):
193193
def name(self, new_name):
194194
self._name = new_name
195195

196+
@property
197+
def ndim(self):
198+
try:
199+
return self._value.ndim
200+
except AttributeError:
201+
return 0
202+
196203

197204
class Constant(Term):
198205
def __init__(self, value, env, side=None, encoding=None):
@@ -207,6 +214,7 @@ def name(self):
207214
return self.value
208215

209216

217+
210218
_bool_op_map = {'not': '~', 'and': '&', 'or': '|'}
211219

212220

@@ -236,29 +244,39 @@ def return_type(self):
236244
return np.bool_
237245
return np.result_type(*(term.type for term in com.flatten(self)))
238246

247+
@property
248+
def isscalar(self):
249+
return all(operand.isscalar for operand in self.operands)
250+
239251

240252
def _in(x, y):
241253
"""Compute the vectorized membership of ``x in y`` if possible, otherwise
242254
use Python.
243255
"""
244256
try:
245-
return y.isin(x)
257+
return x.isin(y)
246258
except AttributeError:
259+
if com.is_list_like(x):
260+
try:
261+
return y.isin(x)
262+
except AttributeError:
263+
pass
247264
return x in y
248-
except TypeError:
249-
return y.isin([x])
250265

251266

252267
def _not_in(x, y):
253268
"""Compute the vectorized membership of ``x not in y`` if possible,
254269
otherwise use Python.
255270
"""
256271
try:
257-
return ~y.isin(x)
272+
return ~x.isin(y)
258273
except AttributeError:
274+
if com.is_list_like(x):
275+
try:
276+
return ~y.isin(x)
277+
except AttributeError:
278+
pass
259279
return x not in y
260-
except TypeError:
261-
return ~y.isin([x])
262280

263281

264282
_cmp_ops_syms = '>', '<', '>=', '<=', '==', '!=', 'in', 'not in'
@@ -322,14 +340,17 @@ def __init__(self, op, lhs, rhs, **kwargs):
322340
self.lhs = lhs
323341
self.rhs = rhs
324342

343+
self._disallow_scalar_only_bool_ops()
344+
325345
self.convert_values()
326346

327347
try:
328348
self.func = _binary_ops_dict[op]
329349
except KeyError:
330-
keys = _binary_ops_dict.keys()
350+
# has to be made a list for python3
351+
keys = list(_binary_ops_dict.keys())
331352
raise ValueError('Invalid binary operator {0!r}, valid'
332-
' operators are {1}'.format(op, keys))
353+
' operators are {1}'.format(op, keys))
333354

334355
def __call__(self, env):
335356
"""Recursively evaluate an expression in Python space.
@@ -425,6 +446,13 @@ def stringify(value):
425446
v = v.tz_convert('UTC')
426447
self.lhs.update(v)
427448

449+
def _disallow_scalar_only_bool_ops(self):
450+
if ((self.lhs.isscalar or self.rhs.isscalar) and
451+
self.op in _bool_ops_dict and
452+
(not (issubclass(self.rhs.return_type, (bool, np.bool_)) and
453+
issubclass(self.lhs.return_type, (bool, np.bool_))))):
454+
raise NotImplementedError("cannot evaluate scalar only bool ops")
455+
428456

429457
class Div(BinOp):
430458
"""Div operator to special case casting.

0 commit comments

Comments
 (0)