-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: add expression evaluation functionality via eval #4164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ok....was able to pull/push just fine now.... maybe the original rebase caused me to look at something different? |
ok so no rebasing until later...agreed? |
this eval-3393 push to eval-3393 is good (was not there previously). maybe having a different local name was a problem somehow |
why don't you try rebasing those last 2 commits into one? |
see what happens |
ok let's see... |
i picked up the change..... so I guess as we don't rebase at the same time or while one is working, maybe ok |
going to be hard to coordinate that though |
in any event....let's just push for now...can fix commits later you are welcome to create |
not as long as we push/pull.....just don't rebase |
right yeah that's what i meant |
should the numexpr engine allow the |
i would make |
got a throwing a |
hm maybe query for now and indexing later. seems to break a ton of stuff i don't have it down just yet what the rules should be for indexing with something that could be an expression |
actually only a few tests were broken by indexing, my refactor seems to have broken the other things |
ok.....lmk when I should take a look....I am pretty much done for now....(maybe will start on docs soon) |
thanks for the pytables stuff! |
it was built on a nice base! |
ah shucks, thanks :) |
i will support chained comparisons--they aren't a big deal to add...although things will be evaluated twice by e.g., more generally
so inner terms will be evaluated 2 * (N - 2) times something like reducer = lambda x, y: BinOp(op=BitAnd, operands=[Compare(op=node.op, comparators=[x, y])])
reduce(reducer, node.comparators) |
i wish github had latex |
are u doing this transform directly? I don't think me will mind the extras prob optimize them away anyhow |
yes i would do it in the base class essentially build a new node and then visit it similar idea to assignment. compare nodes will be passed over that shouldn't be a big deal for parsing unless you have a ridiculously large chained comparison like 1000 compares or something. |
if its easy ok otherwise u can just detect it for now and disallow |
@jreback just added the i think indexing should be off limits for now b/c of date strings, since |
I agree and I like query / eval better |
@jreback resolution is now there. i opted for a |
@jreback can any of ur list above be checked off? |
both good ideas! |
I'm saying that df['[1,2] in c'] must filter on col c since there is no other column passed. However, it does not follow the pattern of df['a in b'] which filters on col a, the first object before the 'in'. |
Hey guys, Sorry it's taken me so long to get to reviewing this. This seems like a really good start on adding this kind of functionality to the library. I feel like we should treat the actual query syntax as experimental in 0.13.x while we solicit some more feedback and see how it plays out in actual use. I have to say though, I'm kind of 👎 on pushing expression evaluation into I'm pretty stretched thin on taking this out for a serious spin but I'm fine with this being merged in and roped off as experimental until we have some time to see what the implications of various design decisions are. Feel free to disagree of course but I think starting a dialogue with the community about this sort of thing in general is a good idea (we've been talking about it for ages...glad you got around to actually doing it) |
which is syntactically equivalent of '='
Also added tests for nan in and not in and disallowed ops like pd.eval('1 or 2') since that should be performed in regular Python
Yes. That's correct. I don't see why this is an issue. Why would |
Not an issue; possibly confusing when comparing what col is filtered on with different usages of in. Ex: |
Just an FYI, as per @wesm's suggestion I've removed the Right now Otherwise, things work as if you had called |
Sounds good. I think the query method will make for much easier reading. 👍 |
ENH: add expression evaluation functionality via eval
thanks to @cpcloud for implemented much of this. When it was started, this was a tiny little function! |
475 comments! new record! who-hoo |
Awesome - great work @cpcloud :) |
closes #3393
closes #2560
cc @jreback
eval
RoadmapPyTables
allows natural syntax for queries, with in-line variables allowed
some examples
"ts>=Timestamp('2012-02-01') & users=['a','b','c']"
['A > 0']
'(index>df.index[3] & index<=df.index[6]) | string="bar"'
Todos:
|
and~
operatorsExpr
base class to opt in/out of allowed operations(e.g.
Numexpr
doesn't want to allow certain operations, butPyTables
needs them,so maybe define in the base class for simplicity, and just not allow them (checked in
visit
),which right now doesn't allow a
generic_visitor
)ops.Value
be equiv ofops.Constant
?query
methodquery
docstring to show use of__getitem__
and then add toindexing.rst
query
docstringDataFrame.eval
methodDocumentation
in
andnot in
eval
docstringx = 3; y = 2; pd.eval('x // y', engine='python')
is many times slower than the same operation in actual Python.Engines
python
engine for testing (basically a restricted eval, should make sure nothing dangerous a laeval('os.system("sudo rm -rf /")')
) is possible herenumexpr
engineParsers
python
parser is currrently not failing on complex bool ops involving subscripting like it should (potentially couldeval
the compiled AST instead of dispatching to function calls)pandas
parser should handle subscripting, i.e.,df.query(...) == pd.eval('df[...]')
pandas
parserpython
parserpytables
engine and parser (internal)Should python be the default engine for small frames?
Functions
isexpr
eval
to top-level pandasError Handling
expr.py
. should fail fast.Alignment
Scope
@a
syntax for referencing local variables with the same name as a columnExpr
objects to be passed toeval
(Expr
will pull in the local and global variables from the calling stack frame)Miscellaneous
MultiIndex
queriesstr
ops (implemented usingisin
)Timestamp
anddatetime
functionalitypd.eval('1')
orpd.eval('df')
df + df2
should return aDataFrame
ifdf
anddf2
are bothDataFrame
objects.Expression
toExpr
truediv
keyword toExpr
, default toTrue
and
,or
andnot
aliases for the operators&
,|
and~
, respectivelydf.A
Operators
%
**
Tests
PeriodIndex
objects are untested at large (a few scattered tests for basic things)df['@awesome_domain']
,df['df["$10"]']
align.py
once BUG: fix period index object instantiation when joining with self #4379 is mergedDataFrame
with aPeriodIndex
in the columnsDatetimeIndex
objects//
operator works as it should (it works, but there may be edge cases that I haven't though of yet)//
numexpr
only supports scalar evaluation here: variables are not allowed so this will always eval in python space (half-assed it a bit in the tests here need to fix that)**
associativity issue in testingSyntax (Easier)
in
,not in
(was thinkingimplemented asa in b
->b.isin(a)
anda not in b
->not a in b
->~b.isin(a)
)Syntax (More difficult)
sum
andprod
)numexpr
supportsfunction calls(we're not building a Python interpreter here 😉)Optimizations
There's a quite a bit of recursion going on here, where maybe something stack-based might be more efficient. For example, an expression like
df['f in g and a < b < (c ** 2 + b ** 2 - a ** 3) and d in b not in e']
will make 3 recursive calls to eval (not including the top level call). One for the firstin
expression and two for thein
/not in
chain. A stack-based implementation of the parser (instead of recursively visiting nodes) could possible shave off a few milliseconds. Not sure if this is worth it though.Others
alignment forGeneric alignment function and new module core.align #4336NDFrame
replace currentevaluate
function