Skip to content

ENH: fix eval scoping issues #6366

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Feb 20, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
158 changes: 96 additions & 62 deletions doc/source/enhancingperf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -300,7 +300,7 @@ Expression Evaluation via :func:`~pandas.eval` (Experimental)

.. versionadded:: 0.13

The top-level function :func:`~pandas.eval` implements expression evaluation of
The top-level function :func:`pandas.eval` implements expression evaluation of
:class:`~pandas.Series` and :class:`~pandas.DataFrame` objects.

.. note::
Expand Down Expand Up @@ -336,11 +336,11 @@ engine in addition to some extensions available only in pandas.
Supported Syntax
~~~~~~~~~~~~~~~~

These operations are supported by :func:`~pandas.eval`:
These operations are supported by :func:`pandas.eval`:

- Arithmetic operations except for the left shift (``<<``) and right shift
(``>>``) operators, e.g., ``df + 2 * pi / s ** 4 % 42 - the_golden_ratio``
- Comparison operations, e.g., ``2 < df < df2``
- Comparison operations, including chained comparisons, e.g., ``2 < df < df2``
- Boolean operations, e.g., ``df < df2 and df3 < df4 or not df_bool``
- ``list`` and ``tuple`` literals, e.g., ``[1, 2]`` or ``(1, 2)``
- Attribute access, e.g., ``df.a``
Expand Down Expand Up @@ -373,9 +373,9 @@ This Python syntax is **not** allowed:
:func:`~pandas.eval` Examples
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

:func:`~pandas.eval` works wonders for expressions containing large arrays
:func:`pandas.eval` works well with expressions containing large arrays

First let's create 4 decent-sized arrays to play with:
First let's create a few decent-sized arrays to play with:

.. ipython:: python

Expand Down Expand Up @@ -441,8 +441,10 @@ Now let's do the same thing but with comparisons:
The ``DataFrame.eval`` method (Experimental)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In addition to the top level :func:`~pandas.eval` function you can also
evaluate an expression in the "context" of a ``DataFrame``.
.. versionadded:: 0.13

In addition to the top level :func:`pandas.eval` function you can also
evaluate an expression in the "context" of a :class:`~pandas.DataFrame`.

.. ipython:: python
:suppress:
Expand All @@ -462,10 +464,10 @@ evaluate an expression in the "context" of a ``DataFrame``.
df = DataFrame(randn(5, 2), columns=['a', 'b'])
df.eval('a + b')

Any expression that is a valid :func:`~pandas.eval` expression is also a valid
``DataFrame.eval`` expression, with the added benefit that *you don't have to
prefix the name of the* ``DataFrame`` *to the column(s) you're interested in
evaluating*.
Any expression that is a valid :func:`pandas.eval` expression is also a valid
:meth:`DataFrame.eval` expression, with the added benefit that you don't have to
prefix the name of the :class:`~pandas.DataFrame` to the column(s) you're
interested in evaluating.

In addition, you can perform assignment of columns within an expression.
This allows for *formulaic evaluation*. Only a single assignment is permitted.
Expand All @@ -480,55 +482,75 @@ it must be a valid Python identifier.
df.eval('a = 1')
df

The equivalent in standard Python would be

.. ipython:: python

df = DataFrame(dict(a=range(5), b=range(5, 10)))
df['c'] = df.a + df.b
df['d'] = df.a + df.b + df.c
df['a'] = 1
df

Local Variables
~~~~~~~~~~~~~~~

You can refer to local variables the same way you would in vanilla Python
In pandas version 0.14 the local variable API has changed. In pandas 0.13.x,
you could refer to local variables the same way you would in standard Python.
For example,

.. ipython:: python
.. code-block:: python

df = DataFrame(randn(5, 2), columns=['a', 'b'])
newcol = randn(len(df))
df.eval('b + newcol')

.. note::
UndefinedVariableError: name 'newcol' is not defined

The one exception is when you have a local (or global) with the same name as
a column in the ``DataFrame``
As you can see from the exception generated, this syntax is no longer allowed.
You must *explicitly reference* any local variable that you want to use in an
expression by placing the ``@`` character in front of the name. For example,

.. code-block:: python
.. ipython:: python

df = DataFrame(randn(5, 2), columns=['a', 'b'])
a = randn(len(df))
df.eval('a + b')
NameResolutionError: resolvers and locals overlap on names ['a']
df = DataFrame(randn(5, 2), columns=list('ab'))
newcol = randn(len(df))
df.eval('b + @newcol')
df.query('b < @newcol')

If you don't prefix the local variable with ``@``, pandas will raise an
exception telling you the variable is undefined.

To deal with these conflicts, a special syntax exists for referring
variables with the same name as a column
When using :meth:`DataFrame.eval` and :meth:`DataFrame.query`, this allows you
to have a local variable and a :class:`~pandas.DataFrame` column with the same
name in an expression.

.. ipython:: python
:suppress:

a = randn(len(df))
.. ipython:: python

.. ipython:: python
a = randn()
df.query('@a < a')
df.loc[a < df.a] # same as the previous expression

df.eval('@a + b')
With :func:`pandas.eval` you cannot use the ``@`` prefix *at all*, because it
isn't defined in that context. ``pandas`` will let you know this if you try to
use ``@`` in a top-level call to :func:`pandas.eval`. For example,

The same is true for :meth:`~pandas.DataFrame.query`
.. ipython:: python
:okexcept:

.. ipython:: python
a, b = 1, 2
pd.eval('@a + b')

df.query('@a < b')
In this case, you should simply refer to the variables like you would in
standard Python.

.. ipython:: python
:suppress:
.. ipython:: python

del a
pd.eval('a + b')


:func:`~pandas.eval` Parsers
:func:`pandas.eval` Parsers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There are two different parsers and and two different engines you can use as
Expand Down Expand Up @@ -568,7 +590,7 @@ The ``and`` and ``or`` operators here have the same precedence that they would
in vanilla Python.


:func:`~pandas.eval` Backends
:func:`pandas.eval` Backends
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There's also the option to make :func:`~pandas.eval` operate identical to plain
Expand All @@ -577,12 +599,12 @@ ol' Python.
.. note::

Using the ``'python'`` engine is generally *not* useful, except for testing
other :func:`~pandas.eval` engines against it. You will acheive **no**
performance benefits using :func:`~pandas.eval` with ``engine='python'``.
other evaluation engines against it. You will acheive **no** performance
benefits using :func:`~pandas.eval` with ``engine='python'`` and in fact may
incur a performance hit.

You can see this by using :func:`~pandas.eval` with the ``'python'`` engine is
actually a bit slower (not by much) than evaluating the same expression in
Python:
You can see this by using :func:`pandas.eval` with the ``'python'`` engine. It
is a bit slower (not by much) than evaluating the same expression in Python

.. ipython:: python

Expand All @@ -593,15 +615,15 @@ Python:
%timeit pd.eval('df1 + df2 + df3 + df4', engine='python')


:func:`~pandas.eval` Performance
:func:`pandas.eval` Performance
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

:func:`~pandas.eval` is intended to speed up certain kinds of operations. In
particular, those operations involving complex expressions with large
``DataFrame``/``Series`` objects should see a significant performance benefit.
Here is a plot showing the running time of :func:`~pandas.eval` as function of
the size of the frame involved in the computation. The two lines are two
different engines.
:class:`~pandas.DataFrame`/:class:`~pandas.Series` objects should see a
significant performance benefit. Here is a plot showing the running time of
:func:`pandas.eval` as function of the size of the frame involved in the
computation. The two lines are two different engines.


.. image:: _static/eval-perf.png
Expand All @@ -618,19 +640,31 @@ different engines.
This plot was created using a ``DataFrame`` with 3 columns each containing
floating point values generated using ``numpy.random.randn()``.

Technical Minutia
~~~~~~~~~~~~~~~~~
- Expressions that would result in an object dtype (including simple
variable evaluation) have to be evaluated in Python space. The main reason
for this behavior is to maintain backwards compatbility with versions of
numpy < 1.7. In those versions of ``numpy`` a call to ``ndarray.astype(str)``
will truncate any strings that are more than 60 characters in length. Second,
we can't pass ``object`` arrays to ``numexpr`` thus string comparisons must
be evaluated in Python space.
- The upshot is that this *only* applies to object-dtype'd expressions. So,
if you have an expression--for example--that's a string comparison
``and``-ed together with another boolean expression that's from a numeric
comparison, the numeric comparison will be evaluated by ``numexpr``. In fact,
in general, :func:`~pandas.query`/:func:`~pandas.eval` will "pick out" the
subexpressions that are ``eval``-able by ``numexpr`` and those that must be
evaluated in Python space transparently to the user.
Technical Minutia Regarding Expression Evaluation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expressions that would result in an object dtype or involve datetime operations
(because of ``NaT``) must be evaluated in Python space. The main reason for
this behavior is to maintain backwards compatbility with versions of numpy <
1.7. In those versions of ``numpy`` a call to ``ndarray.astype(str)`` will
truncate any strings that are more than 60 characters in length. Second, we
can't pass ``object`` arrays to ``numexpr`` thus string comparisons must be
evaluated in Python space.

The upshot is that this *only* applies to object-dtype'd expressions. So, if
you have an expression--for example

.. ipython:: python

df = DataFrame({'strings': np.repeat(list('cba'), 3),
'nums': np.repeat(range(3), 3)})
df
df.query('strings == "a" and nums == 1')

the numeric part of the comparison (``nums == 1``) will be evaluated by
``numexpr``.

In general, :meth:`DataFrame.query`/:func:`pandas.eval` will
evaluate the subexpressions that *can* be evaluated by ``numexpr`` and those
that must be evaluated in Python space transparently to the user. This is done
by inferring the result type of an expression from its arguments and operators.
19 changes: 19 additions & 0 deletions doc/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -83,9 +83,26 @@ API Changes
- ``pd.infer_freq()``
- ``pd.infer_freq()`` will now raise a ``TypeError`` if given an invalid ``Series/Index`` type (:issue:`6407`)

- Local variable usage has changed in
:func:`pandas.eval`/:meth:`DataFrame.eval`/:meth:`DataFrame.query`
(:issue:`5987`). For the :class:`~pandas.DataFrame` methods, two things have
changed

- Column names are now given precedence over locals
- Local variables must be referred to explicitly. This means that even if
you have a local variable that is *not* a column you must still refer to
it with the ``'@'`` prefix.
- You can have an expression like ``df.query('@a < a')`` with no complaints
from ``pandas`` about ambiguity of the name ``a``.

- The top-level :func:`pandas.eval` function does not allow you use the
``'@'`` prefix and provides you with an error message telling you so.
- ``NameResolutionError`` was removed because it isn't necessary anymore.

Experimental Features
~~~~~~~~~~~~~~~~~~~~~


Improvements to existing features
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -144,6 +161,8 @@ Bug Fixes
- Bug in DataFrame.dropna with duplicate indices (:issue:`6355`)
- Regression in chained getitem indexing with embedded list-like from 0.12 (:issue:`6394`)
- ``Float64Index`` with nans not comparing correctly
- ``eval``/``query`` expressions with strings containing the ``@`` character
will now work (:issue:`6366`).

pandas 0.13.1
-------------
Expand Down
16 changes: 16 additions & 0 deletions doc/source/v0.14.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,22 @@ API changes
s.year
s.index.year

- Local variable usage has changed in
:func:`pandas.eval`/:meth:`DataFrame.eval`/:meth:`DataFrame.query`
(:issue:`5987`). For the :class:`~pandas.DataFrame` methods, two things have
changed

- Column names are now given precedence over locals
- Local variables must be referred to explicitly. This means that even if
you have a local variable that is *not* a column you must still refer to
it with the ``'@'`` prefix.
- You can have an expression like ``df.query('@a < a')`` with no complaints
from ``pandas`` about ambiguity of the name ``a``.

- The top-level :func:`pandas.eval` function does not allow you use the
``'@'`` prefix and provides you with an error message telling you so.
- ``NameResolutionError`` was removed because it isn't necessary anymore.

MultiIndexing Using Slicers
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
2 changes: 2 additions & 0 deletions pandas/compat/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,8 @@
import pickle as cPickle
import http.client as httplib

from pandas.compat.chainmap import DeepChainMap


if PY3:
def isidentifier(s):
Expand Down
26 changes: 26 additions & 0 deletions pandas/compat/chainmap.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
try:
from collections import ChainMap
except ImportError:
from pandas.compat.chainmap_impl import ChainMap


class DeepChainMap(ChainMap):
def __setitem__(self, key, value):
for mapping in self.maps:
if key in mapping:
mapping[key] = value
return
self.maps[0][key] = value

def __delitem__(self, key):
for mapping in self.maps:
if key in mapping:
del mapping[key]
return
raise KeyError(key)

# override because the m parameter is introduced in Python 3.4
def new_child(self, m=None):
if m is None:
m = {}
return self.__class__(m, *self.maps)
Loading