WIP: add top-level eval function #4037

cpcloud · 2013-06-26T00:34:16Z

closes #3393.

This is a ways from being mergeable but I thought I would put up the PR for people to play with and to get some other eyes on it. Tire-kick at will. A consolidated list of things to be completed to come.

`eval` Roadmap

Documentation

enhancing performance section
eval docstring
Warn users that this is not a magic bullet
- E.g., x = 3; y = 2; pd.eval('x // y', engine='python') is 1000 times slower than the same operation in actual Python.

Engines

python engine for testing (basically a restricted eval, should make sure nothing dangerous a la eval('os.system("sudo rm -rf /")')) is possible here
numexpr engine
pytables engine (WIP: use eval expression parsing as replacement for Term in HDFStore #4155)

Should python be the default engine for small frames?

Functions

toplevel isexpr
move eval to top-level pandas

Error Handling

syntax error handling
name error handling

Alignment

Algorithm
1. flatten the tree of terms
2. resolve the variables, special casing only numpy arrays/scalars/single unary operations (probably room for more abstraction here)
3. align by joining on the indices using the current broadcasting behavior
4. update the original namespace with the newly aligned version of the underlying numpy array
5. reconstruct the final object with the aforementioned joined indices and type
punt to 'python' engine when a Series and DataFrame both have DatetimeIndexes (I would rather not do this and just enforce the to-be-deprecated index-aligning-if-both-indices-are-datetimeindex behavior)
PeriodIndexes don't work (well, they won't pass my tests) because of a stack overflow bug when joining a DatetimeIndex and a PeriodIndex (that bug is slated for 0.13, but a design decision needs to be made)
alignment for NDFrame (is this really necessary?)
unary alignment
binary alignment

Scope

deal with global scoping issues (not sure if this is tested well enough...which means probably not)
allow Expr objects to be passed to eval (Expr will pull in the local and global variables from the calling stack frame)

Miscellaneous

replace current evaluate function
Expr in NDFrame.__getitem__ cc @jreback
reconstruct the final object to return the expected output type, e.g., df + df2 should return a DataFrame if df and df2 are both DataFrame objects.
change Expression to Expr
add truediv keyword to Expr, default to True

Operators

%
**
unary ops don't play well when they are operands of binary ops

Tests

DatetimeIndex and PeriodIndex objects are untested because of a few join and alignment issues.
rewrite a few tests in the python evaluator to make sure the // operator works as it should (it works, but there may be edge cases that I haven't though of yet)
// numexpr only supports scalar evaluation here: variables are not allowed so this will always eval in python space (half-assed it a bit in the tests here need to fix that)
truediv keyword tests
add tests for nans
fix ** associativity issue in testing

Near Future

allow and, or and not aliases for the operators &, | and ~, respectively
attribute accessing e.g., df.A syntax
reductions (sum and prod)
math ops (sin, cos, tan, sqrt, etc.)

Far Future

non-python indexing (or just punt to python)

cpcloud · 2013-06-26T00:35:58Z

pandas/core/common.py

+    return isinstance(obj, pd.Panel)
+
+
+def is_pd_obj(obj):


note to self these should most likely be removed or should replace scattered functions that are similar to these

jreback · 2013-06-26T02:00:39Z

quick pass I didn't see a doc-string on eval (its prob there...but I didn't delve too deep)

also a quick writeup for enhancedperf section?

just a couple of examples

cpcloud · 2013-06-28T20:37:40Z

@jreback

question:

i think using eval under the hood for pytables evaluation should replace all of the various incarnations of expressions here...is that ok?

result being that only strings are allowed (which will be converted to Expr and passed to an evaluating function).

going to take much longer to finish this up to support all of those expression forms, but if it needs to be done that's ok too

jreback · 2013-06-28T20:50:26Z

for a string expression, that is exactly right (its conceptually simpler, because only scalars can be on the rhs in any expression), however also need to allow something like this:

E('index>',Timestamp('20130101'))

where E is replace Term and is a top-level that you evaluate

unless you think that you can evaluate:

index>Timestamp('20130101') ?

I think that the current Term should be a sub-class of Expr that just spits out a numexpr evaluable string (and does not actually evaluate it)

cpcloud · 2013-06-28T21:03:50Z

ok i think i'm starting to see the bigger picture here. convert_value and TermValue will be superceeded by eval,

in this code in Selection.select

return self.table.table.readWhere(self.condition, start=self.start, stop=self.stop)

self.condition should be generated by engine.convert()

is that right?

cpcloud · 2013-06-28T21:05:13Z

and queryables is basically an environment internal to the table that has for example the index part of index > 20130101

jreback · 2013-06-28T21:45:18Z

that's about right.

in Selection I am just using a list of Terms that are anded together

queryables is used for validation on the lhs of the expressions (its analagous to you looking up a variable in the enclosing/global scope, e.g. df>1 if df is not defined then you raise, same here

and no alignment necessary, just some conversions on rhs terms

and self.condition is generated by engine.convert() correct

self.condition is equiv to this:

pd.eval('index>Timestamp("20130101")',queryables=queryables,engine='pytables')

cpcloud · 2013-06-28T21:47:50Z

cool maybe this will be easier than i thought...to the pytables-mobile!

jreback · 2013-06-30T19:29:46Z

advertising!

http://stackoverflow.com/questions/17390886/how-to-speed-up-pandas-multilevel-dataframe-sum/17394064#17394064

might be a good benchmark.....and maybe add this 'example' to the EnhancingPerformance section?

cpcloud · 2013-06-30T19:34:57Z

nice! thanks. i gave u +1

cpcloud · 2013-07-04T17:38:39Z

just an update on the and or not deal

essentially:

what you want to write if u like python

pd.eval('not x and y and z or w')

if you want to be terse

pd.eval('~(x & y & z | w)')

what the engine actually sees (you could write this too, if you're practicing typing parens fast w/o looking!)

pd.eval('~((((x) & (y)) & (z)) | (w))')

and of course u can mix and match if you like partial clarity + terseness

what's nice is that the python precedence is preserved, e.g., not x and y rewrites to ~((x) & (y)) (not binds loosely) whereas ~x and y rewrites to '(~(x)) & (y)' (~ binds tightly)

for reference: operator precedence

if u want i can give more details, but essentially the repr of each op is '({self.lhs}) {self.op} ({self.rhs})'.format(self=self) and this happens recursively.

the ast manip is to "rewrite" the ast.BoolOp nodes as their respective Bit* node (really i'm just saying, interpret this as a bitwise op), but the python ast takes care of precedence for u so the precedence semantics remain.

here was the trickiest part

def visit_BoolOp(self, node):
    op = self.visit(node.op)
    def visitor(x, y):
        try:
            lhs = self.visit(x)
        except TypeError:
            lhs = x

        try:
            rhs = self.visit(y)
        except TypeError:
            rhs = y

        return op(lhs, rhs)

    operands = node.values
    return reduce(visitor, operands)

key part is the last two lines which because the python parser treats boolean expressions similar to how lisp does you have to reconstruct the (and x1 x2 ... xN) style op into (and (... (and (and x1 x2) ...) ...) xN)

jreback · 2013-07-04T17:44:42Z

I would have an engine option, maybe parsing=strict as the default
if you want to support python booleans then could pass parsing=extended or something like that
I don't think we want to turn this on by default even though it looks pretty -

cpcloud · 2013-07-04T17:56:36Z

ok. there's not really any automatic way to test that though since and, or and not are not part of the operator module, i suppose python's eval could work though (for testing only)

jreback · 2013-07-04T17:58:51Z

I would put it on back burner anyhow (i know u did it)
but too much of a chnge right now

cpcloud · 2013-07-04T18:01:52Z

sounds good, not worth too much time.

the pytables changes are quite daunting. i know those are most important for this to be merged, still sifting through how it works...reading tests and trying to understand what the classes do...which ones are important etc...

plus the issues with dt and period indexes

i think after those two then this might warrant some serious tire kicking

cpcloud · 2013-07-04T19:23:29Z

I think i'm going to allow callable parsing in the 'pytables' engine for now, as it will make the transition to using eval and friends under the hood much easier. I'll just subclass the current parser something like TableExprVisitor and overload the visit_Call method...

cpcloud · 2013-07-04T19:27:02Z

i think in 0.14 (not a typo) that Term should be removed...deprecate in 0.13 in favor of what I'm calling TableExpr which is a subclass of Expr designed to work with HDFStore et al.

cpcloud · 2013-07-04T20:18:32Z

@jreback should arbitrary expressions be allowed with 'pytables'?

using filter will be easy if only binary ops are allowed, but will be more complicated if arbitrary exprs are allowed

cpcloud · 2013-07-04T20:27:18Z

nvm. should allow them.

jreback · 2013-07-04T20:29:55Z

whatever is parsable by ne should be allowed
in reality though only simple expressions are prob needed

cpcloud · 2013-07-07T00:13:10Z

@jreback i feel a bit silly asking this but how do i pull down ur changes?

jreback · 2013-07-07T00:16:50Z

@cpcloud i think you have to add me as a remote, then cherry-pick

i pseudo have it working....but maybe you can help with this.....

say I haev a call like this:

Term('index=df.index[0:4]')...where Term is basically Expr

I can't seem to get the variable scope for df....(once I am in the object creation call sys._getframe(2).f_locals is just the local stuff

any idea?

cpcloud · 2013-07-07T00:24:57Z

can u point me to a commit?

jreback · 2013-07-07T00:27:16Z

fbeb99d

cpcloud · 2013-07-07T00:27:28Z

oh same one

jreback · 2013-07-07T00:27:33Z

somewhat old .....will push again in a bit

jreback · 2013-07-07T00:29:14Z

6b67214

new one....still very WIP though!

jreback · 2013-07-07T00:40:00Z

figured out the scoping issue.....I AM capturing scope, but then I was throwing it away because I allow a 'loose' sytax..

e.g.

[ 'columns=A',Term('index=df.index[0:4]')]

is translated to:

columns=A & index=df.index[0:4] as a string. (which I can't deal with yet...but getting there)

cpcloud · 2013-07-07T00:40:51Z

wait where does that commit live? it's on master but not a branch but a tree?

how are you parsing the slice?

cpcloud · 2013-07-07T01:21:19Z

@jreback i've added u as a remote but how can i cherry-pick that commit? it looks like its on master, but where is being pushed? is it detached or something?

jreback · 2013-07-07T01:40:53Z

its on a separate branch of my master

try this: https://github.com/jreback/pandas/tree/eval

its build on top of your branch (which at some point i will rebase; i am a few behind you)

eventually you can just grab some of my commits and add to your tree as you have the PR

not 100% sure how to do that, but since we can't push to each other I think only way

though once you have a PR to pandas master I think I can push to that (I think this is a psudoe private branch....
if I do a PR I think my commits will be added after yours.....)

jreback · 2013-07-07T01:43:19Z

on #4151 now

jreback · 2013-07-07T01:43:42Z

of course going to fail all tests...but stil a WIP

cpcloud · 2013-07-07T04:41:31Z

i like how github pulse double counts my commits here....not :|

jreback · 2013-07-07T16:22:40Z

https://github.com/newville/asteval/blob/master/lib/asteval.py

nice ref for implementing the ast handlers

cpcloud · 2013-07-07T16:29:28Z

nice. i think we can use slice, subscript, attribute access, numexpr supported functions, and possibly chained comparisons if we want. x if y else z is covered by where

would be nice to handle certain pandas functions like Timestamp, for pytables stuff.

jreback · 2013-07-07T16:34:54Z

all pretty straightforward

I have a new branch that does most of this

could move some into base class

cpcloud · 2013-07-07T16:36:10Z

great! can u post a link to a branch?

jreback · 2013-07-07T22:35:28Z

already #4155 up, extension of your branch....all tests pass!

take a look. I think some of what I did should actually be in your branch (e.g. see my code clean up point).
Need a better validator on visit in ExprVisitor, in fact maybe create a base class which basically implements everything, then let a sub-class (for NumExpr and your stuff) to turn on which ops it needs (and ocassionally overriding some); I already inherit this class

This can also consolidate the initialization sutff of the binary/math ops as well

other thing to note, I added a ops.Value which is very much like Constant but doesn't do name validation (as I can't know that names on the rhs are invalid, so I must accept them), but this includes stuff like a list of strings, or even an index

I didn't really check whether your tests break! more likely is that some 'invalid' ops would slip by

I also eliminated the call to generic_visitor, instead raising on an invalid op, not sure if that works for you or not

jreback · 2013-07-08T18:00:09Z

hmm..this not showing my commits?

cpcloud · 2013-07-08T18:08:50Z

this won't show your commits because i've changed the remote of my local branch here...

the remote for this pr is my origin which i don't know what you call that on ur box.
the new eval-3393 is on my upstream (your origin)

AFAIK i don't think you can track origin/eval-3393 and upstream/eval-3393 in the same branch so i've chosen to just use upstream, e.g. if i make a change make it won't show up in two places when i push...

one way might be: you could push to your origin/eval-3393 (my upstream), i could fetch from that and you could fetch from this and i'll push to this, problem is i think that creates a cycle in the commit graph which essentially breaks git

jreback · 2013-07-08T18:11:44Z

hmm...I thought that you were creating a master branch that is push/pullable from either of us and sort of independent?

maybe close that PR and just open another one from that branch (e.g. merge from eval-3393 to master) rather than from your branch to master?

cpcloud · 2013-07-08T18:11:46Z

so i think we should close our respective PRs, i'll open a new one with upstream/eval-3393 and we push to the branch on pydata/pandas until we're finished and ready to merge? sound good?

jreback · 2013-07-08T18:12:29Z

yes.....had same thoughts (and I would basically copy your top-levevel lists and then I will do the same), for ToDo's and such

cpcloud · 2013-07-08T18:13:35Z

right. a single PR to rule them all. i'll merge our TODO lists and other stuff in the initial text of the PR.

jreback · 2013-07-08T18:14:08Z

great....

cpcloud · 2013-07-08T18:15:19Z

hopefully, this will scale down the bloated pulse report of my commits

cpcloud reviewed Jun 26, 2013
View reviewed changes

ghost assigned cpcloud Jun 26, 2013

cpcloud added 9 commits July 6, 2013 11:40

ENH: add new computation module and toplevel eval function

89a03be

ENH/TST: add new instance testing functions and their tests

bcd17b0

BUG: prevent certain index types from joining with DatetimeIndex

81bacd1

TST/ENH: add 2d bare numpy array and nan support

e380271

ENH: add modulus support

99a3d28

TST: add failing modulus tests

4db95fe

CLN: use format string for unicode

6000c89

CLN: remove engine detection and manip for datetimes

c25a1d4

CLN/ENH: add new interface to encapsulate Terms and Constants

1132bc4

jreback mentioned this pull request Jul 7, 2013

WIP: use eval expression parsing as replacement for Term in HDFStore #4155

Closed

5 tasks

cpcloud closed this Jul 8, 2013

cpcloud deleted the eval-3393 branch February 27, 2014 18:05

WIP: add top-level eval function #4037

WIP: add top-level eval function #4037

Conversation

cpcloud commented Jun 26, 2013

eval Roadmap

Documentation

Engines

Functions

Error Handling

Alignment

Scope

Miscellaneous

Operators

Tests

Near Future

Far Future

cpcloud Jun 26, 2013

Choose a reason for hiding this comment

jreback commented Jun 26, 2013

cpcloud commented Jun 28, 2013

jreback commented Jun 28, 2013

cpcloud commented Jun 28, 2013

cpcloud commented Jun 28, 2013

jreback commented Jun 28, 2013

cpcloud commented Jun 28, 2013

jreback commented Jun 30, 2013

cpcloud commented Jun 30, 2013

cpcloud commented Jul 4, 2013

jreback commented Jul 4, 2013

cpcloud commented Jul 4, 2013

jreback commented Jul 4, 2013

cpcloud commented Jul 4, 2013

cpcloud commented Jul 4, 2013

cpcloud commented Jul 4, 2013

cpcloud commented Jul 4, 2013

cpcloud commented Jul 4, 2013

jreback commented Jul 4, 2013

cpcloud commented Jul 7, 2013

jreback commented Jul 7, 2013

cpcloud commented Jul 7, 2013

jreback commented Jul 7, 2013

cpcloud commented Jul 7, 2013

jreback commented Jul 7, 2013

jreback commented Jul 7, 2013

jreback commented Jul 7, 2013

cpcloud commented Jul 7, 2013

cpcloud commented Jul 7, 2013

jreback commented Jul 7, 2013

jreback commented Jul 7, 2013

jreback commented Jul 7, 2013

cpcloud commented Jul 7, 2013

jreback commented Jul 7, 2013

cpcloud commented Jul 7, 2013

jreback commented Jul 7, 2013

cpcloud commented Jul 7, 2013

jreback commented Jul 7, 2013

jreback commented Jul 8, 2013

cpcloud commented Jul 8, 2013

jreback commented Jul 8, 2013

cpcloud commented Jul 8, 2013

jreback commented Jul 8, 2013

cpcloud commented Jul 8, 2013

jreback commented Jul 8, 2013

cpcloud commented Jul 8, 2013

`eval` Roadmap