-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Add type promotion support for eval() expressions with many properties #6205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If someone could chime in to just double check my understanding of the behavior of numpy.result_type, that would be nice. I think this is an important fix, otherwise it is essentially impossible to evaluate expressions whose tree representation have more than 32 leaves. Given that this limit seems imposed by NumPy, I assume it was not a conscious decision in the design of Pandas. |
@havoc-io Gimme some time to review |
I wasn't aware of this limit. Thanks for catching it. |
# use reduce instead of passing the full generator as *args to | ||
# result_type, because for large expressions there are often more than | ||
# NPY_MAXARGS (32) types | ||
return reduce(np.result_type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should move reduce
to pandas.compat
and import from there since there's no builtin reduce
in Python 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like reduce is actually already in pandas.compat, I can go ahead and change to that. Do you have any qualms about just doing:
from pandas.compat import reduce
and overriding the built-in reduce for Python 2.x? It's actually the same as the built-in for 2.x, but I can rename it something like compat_reduce depending on your coding preferences.
No rush or anything on the review. I want to make sure that my understanding of the result_type function (particularly in the context of the eval() infrastructure) is correct so this doesn't start returning some weird types from eval. |
I've gone ahead and added a reduce import from pandas.compat as well and squashed it into this commit. |
@havoc-io pls add the above test (which should fail without this change, and pass with it) |
Importing from compat is fine |
Added a test which fails without the change and succeeds with it. The test will raise an exception under the old code. Do you prefer to catch and then manually trigger failure - or just let the test framework catch the exceptions and mark them as E? |
# there is no way to get the value programmatically inside Python, so | ||
# we just hardcode a value > 32 here | ||
df.eval('*'.join(('a') * 33)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, you need to assert that this works, e.g. compare with a known value and use assert_frame_equal
by definition w/o the change this will fail but with the change will return the correct value (and thus the test will pass)
so contruct the result in way and compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, just use pd.eval
here. There's no need to construct a DataFrame
. Just something like
def check_many_exprs(parser, engine):
a = 1
expr = ' * '.join('a' * 33)
expected = 1
res = pd.eval(expr, parser=parser, engine=engine)
tm.assert_equal(res, expected)
def test_many_exprs():
for parser, engine in PARSERS_ENGINES:
yield check_many_exprs, parser, engine
We generally frown upon yield
style nose tests but they are useful to avoid repetition of test functions only differing in their arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does implementing the test in TestEvalNumexprPandas not implemented it for all parser engines? If not, where are you suggesting this test code be put?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't automatically give pd.eval
the parser and engine, it's a class attribute (as opposed to an instance attribute). You have to pass self.parser
and self.engine
when you call pd.eval
. You can probably just copy and paste what I wrote above and put that into test_eval.py
. I'm not sure if PARSERS_ENGINES
is called that, it might be something else, but it's just itertools.product(PARSERS, ENGINES)
. PARSERS
and ENGINES
are tuple
s or list
s (can't remember) of the supported parsers/engines.
I've updated the commit. The issue was marginally more extensive that I originally anticipated, so rather than scatter calls to I've left in-place calls to In addition, I've added the test recommended by @cpcloud (thanks!). |
Another thing to check is the associativity and commutativity of result_type. If it's neither then using reduce isn't going to give the same result as result_type. Intuitively I would expect it to be both, akin to a sort of max operation on dtypes, but maybe there's some edge case that I'm not thinking of. |
That was my intuition as well, and does appear to be the case. The NumPy result_type call stack (since NumPy 1.6) effectively looks like:
The C code is a bit difficult to parse from a conceptional standpoint, but the documentation for the latter two functions gives a more concise explanation - which does indeed portray the functions as an effective "max" operation on dtypes. I don't think it would be possible to write an exhaustive test to confirm this, but I think the algorithm described in the documentation is quite clearly symmetric and associative. Aside from pinging the numpy developers, I'm not really sure how else to verify. |
try: | ||
first = next(arg_iter) | ||
except StopIteration: | ||
raise ValueError('at least one array or dtype is required') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just do
if not arrays_and_dtypes:
raise ValueError("your message")
Creating an iterator and catching StopIteration
is a bit overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why you originally did that, but you don't need to pass the initial argument to reduce
, it'll use the first element of the sequence if you don't pass anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that if the argument sequence has a length of only 1, then reduce will never pass it to numpy.result_type
, and consequently it won't come back as a dtype, it'll just be whatever the original argument is.
Okay, done, have a peek at the revised version. |
looks ok 2 me ... i 'll merge when it passes |
@havoc-io pls add a release notes (you will have to rebase as I just added the 0.14.0 section); add in bug fixes pls (you can ref this PR) |
I've added an entry to the 0.14.0 bug fix log. |
Looks like CI build passed - good to merge? |
pls rebase on master; the release notes are in conflict |
as soon as you do I can merge |
…erties This commit modifies the call to numpy.result_type to get around the NPY_MAXARGS limit, which at the moment is 32. Instead of passing a generator of all types involved in an expression, the type promotion is done on a pair-wise basis with a call to reduce. This fixes bugs for code such as the following: from numpy.random import randn from pandas import DataFrame d = DataFrame(randn(10, 2), columns=list('ab')) # Evaluates fine print(d.eval('*'.join(['a'] * 32))) # Fails to evaluate due to NumPy argument limits print(d.eval('*'.join(['a'] * 33)))
Should be sorted now. Thanks. |
BUG: Add type promotion support for eval() expressions with many properties
perfect thank you! |
Looks good to me. |
gr8! this didn't show up till the windows builds started failing...they don't have numexpr installed at all |
This commit modifies the call to numpy.result_type to get around the
NPY_MAXARGS limit, which at the moment is 32. Instead of passing a
generator of all types involved in an expression, the type promotion
is done on a pair-wise basis with a call to reduce.
This fixes bugs for code such as the following: