Skip to content

WIP: add top-level eval function #4037

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 37 commits into from
Closed

WIP: add top-level eval function #4037

wants to merge 37 commits into from

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Jun 26, 2013

closes #3393.

This is a ways from being mergeable but I thought I would put up the PR for people to play with and to get some other eyes on it. Tire-kick at will. A consolidated list of things to be completed to come.

eval Roadmap

Documentation

  • enhancing performance section
  • eval docstring
  • Warn users that this is not a magic bullet
    • E.g., x = 3; y = 2; pd.eval('x // y', engine='python') is 1000 times slower than the same operation in actual Python.

Engines

Should python be the default engine for small frames?

Functions

  • toplevel isexpr
  • move eval to top-level pandas

Error Handling

  • syntax error handling
  • name error handling

Alignment

  • Algorithm
    1. flatten the tree of terms
    2. resolve the variables, special casing only numpy arrays/scalars/single unary operations (probably room for more abstraction here)
    3. align by joining on the indices using the current broadcasting behavior
    4. update the original namespace with the newly aligned version of the underlying numpy array
    5. reconstruct the final object with the aforementioned joined indices and type
  • punt to 'python' engine when a Series and DataFrame both have DatetimeIndexes (I would rather not do this and just enforce the to-be-deprecated index-aligning-if-both-indices-are-datetimeindex behavior)
  • PeriodIndexes don't work (well, they won't pass my tests) because of a stack overflow bug when joining a DatetimeIndex and a PeriodIndex (that bug is slated for 0.13, but a design decision needs to be made)
  • alignment for NDFrame (is this really necessary?)
  • unary alignment
  • binary alignment

Scope

  • deal with global scoping issues (not sure if this is tested well enough...which means probably not)
  • allow Expr objects to be passed to eval (Expr will pull in the local and global variables from the calling stack frame)

Miscellaneous

  • replace current evaluate function
  • Expr in NDFrame.__getitem__ cc @jreback
  • reconstruct the final object to return the expected output type, e.g., df + df2 should return a DataFrame if df and df2 are both DataFrame objects.
  • change Expression to Expr
  • add truediv keyword to Expr, default to True

Operators

  • %
  • **
  • unary ops don't play well when they are operands of binary ops

Tests

  • DatetimeIndex and PeriodIndex objects are untested because of a few join and alignment issues.
  • rewrite a few tests in the python evaluator to make sure the // operator works as it should (it works, but there may be edge cases that I haven't though of yet)
  • // numexpr only supports scalar evaluation here: variables are not allowed so this will always eval in python space (half-assed it a bit in the tests here need to fix that)
  • truediv keyword tests
  • add tests for nans
  • fix ** associativity issue in testing

Near Future

  • allow and, or and not aliases for the operators &, | and ~, respectively
  • attribute accessing e.g., df.A syntax
  • reductions (sum and prod)
  • math ops (sin, cos, tan, sqrt, etc.)

Far Future

  • non-python indexing (or just punt to python)

return isinstance(obj, pd.Panel)


def is_pd_obj(obj):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self these should most likely be removed or should replace scattered functions that are similar to these

@ghost ghost assigned cpcloud Jun 26, 2013
@jreback
Copy link
Contributor

jreback commented Jun 26, 2013

quick pass I didn't see a doc-string on eval (its prob there...but I didn't delve too deep)

also a quick writeup for enhancedperf section?

just a couple of examples

@cpcloud
Copy link
Member Author

cpcloud commented Jun 28, 2013

@jreback

question:

  1. i think using eval under the hood for pytables evaluation should replace all of the various incarnations of expressions here...is that ok?

result being that only strings are allowed (which will be converted to Expr and passed to an evaluating function).

going to take much longer to finish this up to support all of those expression forms, but if it needs to be done that's ok too

@jreback
Copy link
Contributor

jreback commented Jun 28, 2013

for a string expression, that is exactly right (its conceptually simpler, because only scalars can be on the rhs in any expression), however also need to allow something like this:

E('index>',Timestamp('20130101'))

where E is replace Term and is a top-level that you evaluate

unless you think that you can evaluate:

index>Timestamp('20130101') ?

I think that the current Term should be a sub-class of Expr that just spits out a numexpr evaluable string (and does not actually evaluate it)

@cpcloud
Copy link
Member Author

cpcloud commented Jun 28, 2013

ok i think i'm starting to see the bigger picture here. convert_value and TermValue will be superceeded by eval,

in this code in Selection.select

return self.table.table.readWhere(self.condition, start=self.start, stop=self.stop)

self.condition should be generated by engine.convert()

is that right?

@cpcloud
Copy link
Member Author

cpcloud commented Jun 28, 2013

and queryables is basically an environment internal to the table that has for example the index part of index > 20130101

@jreback
Copy link
Contributor

jreback commented Jun 28, 2013

that's about right.

in Selection I am just using a list of Terms that are anded together

queryables is used for validation on the lhs of the expressions (its analagous to you looking up a variable in the enclosing/global scope, e.g. df>1 if df is not defined then you raise, same here

and no alignment necessary, just some conversions on rhs terms

and self.condition is generated by engine.convert() correct

self.condition is equiv to this:

pd.eval('index>Timestamp("20130101")',queryables=queryables,engine='pytables')

@cpcloud
Copy link
Member Author

cpcloud commented Jun 28, 2013

cool maybe this will be easier than i thought...to the pytables-mobile!

@jreback
Copy link
Contributor

jreback commented Jun 30, 2013

advertising!

http://stackoverflow.com/questions/17390886/how-to-speed-up-pandas-multilevel-dataframe-sum/17394064#17394064

might be a good benchmark.....and maybe add this 'example' to the EnhancingPerformance section?

@cpcloud
Copy link
Member Author

cpcloud commented Jun 30, 2013

nice! thanks. i gave u +1

@cpcloud
Copy link
Member Author

cpcloud commented Jul 4, 2013

just an update on the and or not deal

essentially:

what you want to write if u like python

pd.eval('not x and y and z or w')

if you want to be terse

pd.eval('~(x & y & z | w)')

what the engine actually sees (you could write this too, if you're practicing typing parens fast w/o looking!)

pd.eval('~((((x) & (y)) & (z)) | (w))')

and of course u can mix and match if you like partial clarity + terseness

what's nice is that the python precedence is preserved, e.g., not x and y rewrites to ~((x) & (y)) (not binds loosely) whereas ~x and y rewrites to '(~(x)) & (y)' (~ binds tightly)

for reference: operator precedence

if u want i can give more details, but essentially the repr of each op is '({self.lhs}) {self.op} ({self.rhs})'.format(self=self) and this happens recursively.

the ast manip is to "rewrite" the ast.BoolOp nodes as their respective Bit* node (really i'm just saying, interpret this as a bitwise op), but the python ast takes care of precedence for u so the precedence semantics remain.

here was the trickiest part

def visit_BoolOp(self, node):
    op = self.visit(node.op)
    def visitor(x, y):
        try:
            lhs = self.visit(x)
        except TypeError:
            lhs = x

        try:
            rhs = self.visit(y)
        except TypeError:
            rhs = y

        return op(lhs, rhs)

    operands = node.values
    return reduce(visitor, operands)

key part is the last two lines which because the python parser treats boolean expressions similar to how lisp does you have to reconstruct the (and x1 x2 ... xN) style op into (and (... (and (and x1 x2) ...) ...) xN)

@jreback
Copy link
Contributor

jreback commented Jul 4, 2013

I would have an engine option, maybe parsing=strict as the default
if you want to support python booleans then could pass parsing=extended or something like that
I don't think we want to turn this on by default even though it looks pretty -

@cpcloud
Copy link
Member Author

cpcloud commented Jul 4, 2013

ok. there's not really any automatic way to test that though since and, or and not are not part of the operator module, i suppose python's eval could work though (for testing only)

@jreback
Copy link
Contributor

jreback commented Jul 4, 2013

I would put it on back burner anyhow (i know u did it)
but too much of a chnge right now

@cpcloud
Copy link
Member Author

cpcloud commented Jul 4, 2013

sounds good, not worth too much time.

the pytables changes are quite daunting. i know those are most important for this to be merged, still sifting through how it works...reading tests and trying to understand what the classes do...which ones are important etc...

plus the issues with dt and period indexes

i think after those two then this might warrant some serious tire kicking

@cpcloud
Copy link
Member Author

cpcloud commented Jul 4, 2013

I think i'm going to allow callable parsing in the 'pytables' engine for now, as it will make the transition to using eval and friends under the hood much easier. I'll just subclass the current parser something like TableExprVisitor and overload the visit_Call method...

@cpcloud
Copy link
Member Author

cpcloud commented Jul 4, 2013

i think in 0.14 (not a typo) that Term should be removed...deprecate in 0.13 in favor of what I'm calling TableExpr which is a subclass of Expr designed to work with HDFStore et al.

@cpcloud
Copy link
Member Author

cpcloud commented Jul 4, 2013

@jreback should arbitrary expressions be allowed with 'pytables'?

using filter will be easy if only binary ops are allowed, but will be more complicated if arbitrary exprs are allowed

@cpcloud
Copy link
Member Author

cpcloud commented Jul 4, 2013

nvm. should allow them.

@jreback
Copy link
Contributor

jreback commented Jul 4, 2013

whatever is parsable by ne should be allowed
in reality though only simple expressions are prob needed

@cpcloud
Copy link
Member Author

cpcloud commented Jul 7, 2013

@jreback i feel a bit silly asking this but how do i pull down ur changes?

@jreback
Copy link
Contributor

jreback commented Jul 7, 2013

@cpcloud i think you have to add me as a remote, then cherry-pick

i pseudo have it working....but maybe you can help with this.....

say I haev a call like this:

Term('index=df.index[0:4]')...where Term is basically Expr

I can't seem to get the variable scope for df....(once I am in the object creation call sys._getframe(2).f_locals is just the local stuff

any idea?

@cpcloud
Copy link
Member Author

cpcloud commented Jul 7, 2013

can u point me to a commit?

@jreback
Copy link
Contributor

jreback commented Jul 7, 2013

fbeb99d

@cpcloud
Copy link
Member Author

cpcloud commented Jul 7, 2013

oh same one

@jreback
Copy link
Contributor

jreback commented Jul 7, 2013

somewhat old .....will push again in a bit

@jreback
Copy link
Contributor

jreback commented Jul 7, 2013

6b67214

new one....still very WIP though!

@jreback
Copy link
Contributor

jreback commented Jul 7, 2013

figured out the scoping issue.....I AM capturing scope, but then I was throwing it away because I allow a 'loose' sytax..

e.g.

[ 'columns=A',Term('index=df.index[0:4]')]

is translated to:

columns=A & index=df.index[0:4] as a string. (which I can't deal with yet...but getting there)

@cpcloud
Copy link
Member Author

cpcloud commented Jul 7, 2013

wait where does that commit live? it's on master but not a branch but a tree?

how are you parsing the slice?

@cpcloud
Copy link
Member Author

cpcloud commented Jul 7, 2013

@jreback i've added u as a remote but how can i cherry-pick that commit? it looks like its on master, but where is being pushed? is it detached or something?

@jreback
Copy link
Contributor

jreback commented Jul 7, 2013

its on a separate branch of my master

try this: https://github.com/jreback/pandas/tree/eval

its build on top of your branch (which at some point i will rebase; i am a few behind you)

eventually you can just grab some of my commits and add to your tree as you have the PR

not 100% sure how to do that, but since we can't push to each other I think only way

though once you have a PR to pandas master I think I can push to that (I think this is a psudoe private branch....
if I do a PR I think my commits will be added after yours.....)

@jreback
Copy link
Contributor

jreback commented Jul 7, 2013

on #4151 now

@jreback
Copy link
Contributor

jreback commented Jul 7, 2013

of course going to fail all tests...but stil a WIP

@cpcloud
Copy link
Member Author

cpcloud commented Jul 7, 2013

i like how github pulse double counts my commits here....not :|

@jreback
Copy link
Contributor

jreback commented Jul 7, 2013

https://github.com/newville/asteval/blob/master/lib/asteval.py

nice ref for implementing the ast handlers

@cpcloud
Copy link
Member Author

cpcloud commented Jul 7, 2013

nice. i think we can use slice, subscript, attribute access, numexpr supported functions, and possibly chained comparisons if we want. x if y else z is covered by where

would be nice to handle certain pandas functions like Timestamp, for pytables stuff.

@jreback
Copy link
Contributor

jreback commented Jul 7, 2013

all pretty straightforward

I have a new branch that does most of this

could move some into base class

@cpcloud
Copy link
Member Author

cpcloud commented Jul 7, 2013

great! can u post a link to a branch?

@jreback
Copy link
Contributor

jreback commented Jul 7, 2013

already #4155 up, extension of your branch....all tests pass!

take a look. I think some of what I did should actually be in your branch (e.g. see my code clean up point).
Need a better validator on visit in ExprVisitor, in fact maybe create a base class which basically implements everything, then let a sub-class (for NumExpr and your stuff) to turn on which ops it needs (and ocassionally overriding some); I already inherit this class

This can also consolidate the initialization sutff of the binary/math ops as well

other thing to note, I added a ops.Value which is very much like Constant but doesn't do name validation (as I can't know that names on the rhs are invalid, so I must accept them), but this includes stuff like a list of strings, or even an index

I didn't really check whether your tests break! more likely is that some 'invalid' ops would slip by

I also eliminated the call to generic_visitor, instead raising on an invalid op, not sure if that works for you or not

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

hmm..this not showing my commits?

@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

this won't show your commits because i've changed the remote of my local branch here...

the remote for this pr is my origin which i don't know what you call that on ur box.
the new eval-3393 is on my upstream (your origin)

AFAIK i don't think you can track origin/eval-3393 and upstream/eval-3393 in the same branch so i've chosen to just use upstream, e.g. if i make a change make it won't show up in two places when i push...

one way might be: you could push to your origin/eval-3393 (my upstream), i could fetch from that and you could fetch from this and i'll push to this, problem is i think that creates a cycle in the commit graph which essentially breaks git

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

hmm...I thought that you were creating a master branch that is push/pullable from either of us and sort of independent?

maybe close that PR and just open another one from that branch (e.g. merge from eval-3393 to master) rather than from your branch to master?

@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

so i think we should close our respective PRs, i'll open a new one with upstream/eval-3393 and we push to the branch on pydata/pandas until we're finished and ready to merge? sound good?

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

yes.....had same thoughts (and I would basically copy your top-levevel lists and then I will do the same), for ToDo's and such

@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

right. a single PR to rule them all. i'll merge our TODO lists and other stuff in the initial text of the PR.

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

great....

@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

hopefully, this will scale down the bloated pulse report of my commits

@cpcloud cpcloud closed this Jul 8, 2013
@cpcloud cpcloud deleted the eval-3393 branch February 27, 2014 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: eval function
2 participants