Skip to content

BUG: Add type promotion support for eval() expressions with many properties #6205

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 6, 2014

Conversation

xenoscopic
Copy link
Contributor

This commit modifies the call to numpy.result_type to get around the
NPY_MAXARGS limit, which at the moment is 32. Instead of passing a
generator of all types involved in an expression, the type promotion
is done on a pair-wise basis with a call to reduce.

This fixes bugs for code such as the following:

from numpy.random import randn
from pandas import DataFrame

d = DataFrame(randn(10, 2), columns=list('ab'))

# Evaluates fine
print(d.eval('*'.join(['a'] * 32)))

# Fails to evaluate due to NumPy argument limits
print(d.eval('*'.join(['a'] * 33)))

@xenoscopic
Copy link
Contributor Author

If someone could chime in to just double check my understanding of the behavior of numpy.result_type, that would be nice.

I think this is an important fix, otherwise it is essentially impossible to evaluate expressions whose tree representation have more than 32 leaves. Given that this limit seems imposed by NumPy, I assume it was not a conscious decision in the design of Pandas.

@cpcloud
Copy link
Member

cpcloud commented Jan 31, 2014

@havoc-io Gimme some time to review

@cpcloud
Copy link
Member

cpcloud commented Jan 31, 2014

I wasn't aware of this limit. Thanks for catching it.

# use reduce instead of passing the full generator as *args to
# result_type, because for large expressions there are often more than
# NPY_MAXARGS (32) types
return reduce(np.result_type,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should move reduce to pandas.compat and import from there since there's no builtin reduce in Python 3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like reduce is actually already in pandas.compat, I can go ahead and change to that. Do you have any qualms about just doing:

from pandas.compat import reduce

and overriding the built-in reduce for Python 2.x? It's actually the same as the built-in for 2.x, but I can rename it something like compat_reduce depending on your coding preferences.

@xenoscopic
Copy link
Contributor Author

No rush or anything on the review. I want to make sure that my understanding of the result_type function (particularly in the context of the eval() infrastructure) is correct so this doesn't start returning some weird types from eval.

@xenoscopic
Copy link
Contributor Author

I've gone ahead and added a reduce import from pandas.compat as well and squashed it into this commit.

@jreback
Copy link
Contributor

jreback commented Jan 31, 2014

@havoc-io pls add the above test (which should fail without this change, and pass with it)

@cpcloud
Copy link
Member

cpcloud commented Jan 31, 2014

Importing from compat is fine

@xenoscopic
Copy link
Contributor Author

Added a test which fails without the change and succeeds with it. The test will raise an exception under the old code. Do you prefer to catch and then manually trigger failure - or just let the test framework catch the exceptions and mark them as E?

# there is no way to get the value programmatically inside Python, so
# we just hardcode a value > 32 here
df.eval('*'.join(('a') * 33))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, you need to assert that this works, e.g. compare with a known value and use assert_frame_equal
by definition w/o the change this will fail but with the change will return the correct value (and thus the test will pass)

so contruct the result in way and compare

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, just use pd.eval here. There's no need to construct a DataFrame. Just something like

def check_many_exprs(parser, engine):
    a = 1
    expr = ' * '.join('a' * 33)
    expected = 1
    res = pd.eval(expr, parser=parser, engine=engine)
    tm.assert_equal(res, expected)

def test_many_exprs():
    for parser, engine in PARSERS_ENGINES:
        yield check_many_exprs, parser, engine

We generally frown upon yield style nose tests but they are useful to avoid repetition of test functions only differing in their arguments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does implementing the test in TestEvalNumexprPandas not implemented it for all parser engines? If not, where are you suggesting this test code be put?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't automatically give pd.eval the parser and engine, it's a class attribute (as opposed to an instance attribute). You have to pass self.parser and self.engine when you call pd.eval. You can probably just copy and paste what I wrote above and put that into test_eval.py. I'm not sure if PARSERS_ENGINES is called that, it might be something else, but it's just itertools.product(PARSERS, ENGINES). PARSERS and ENGINES are tuples or lists (can't remember) of the supported parsers/engines.

@xenoscopic
Copy link
Contributor Author

I've updated the commit. The issue was marginally more extensive that I originally anticipated, so rather than scatter calls to reduce everywhere, I've added a pandas.computation.common._result_type_many utility function as a wrapper for calling numpy.result_type when passing an indefinite and potentially large number of arguments. It also slightly changes the behavior of my previous code to ensure that the call correctly succeeds for argument lists of length 1.

I've left in-place calls to numpy.result_type which have a small, definite number of arguments, since the wrapper function is not necessary.

In addition, I've added the test recommended by @cpcloud (thanks!).

@cpcloud
Copy link
Member

cpcloud commented Feb 3, 2014

Another thing to check is the associativity and commutativity of result_type. If it's neither then using reduce isn't going to give the same result as result_type. Intuitively I would expect it to be both, akin to a sort of max operation on dtypes, but maybe there's some edge case that I'm not thinking of.

@xenoscopic
Copy link
Contributor Author

That was my intuition as well, and does appear to be the case. The NumPy result_type call stack (since NumPy 1.6) effectively looks like:

  1. return_type (an alias for the C function array_result_type which effectively just repackages the input array to be C-friendly), which calls
  2. PyArray_ResultType which calls
  3. PyArray_PromoteTypes

The C code is a bit difficult to parse from a conceptional standpoint, but the documentation for the latter two functions gives a more concise explanation - which does indeed portray the functions as an effective "max" operation on dtypes. PyArray_PromoteTypes even explicitly states that it is symmetric and associative.

I don't think it would be possible to write an exhaustive test to confirm this, but I think the algorithm described in the documentation is quite clearly symmetric and associative. Aside from pinging the numpy developers, I'm not really sure how else to verify.

try:
first = next(arg_iter)
except StopIteration:
raise ValueError('at least one array or dtype is required')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just do

if not arrays_and_dtypes:
    raise ValueError("your message")

Creating an iterator and catching StopIteration is a bit overkill.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why you originally did that, but you don't need to pass the initial argument to reduce, it'll use the first element of the sequence if you don't pass anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that if the argument sequence has a length of only 1, then reduce will never pass it to numpy.result_type, and consequently it won't come back as a dtype, it'll just be whatever the original argument is.

@xenoscopic
Copy link
Contributor Author

Okay, done, have a peek at the revised version.

@cpcloud
Copy link
Member

cpcloud commented Feb 3, 2014

looks ok 2 me ... i 'll merge when it passes

@jreback
Copy link
Contributor

jreback commented Feb 3, 2014

@havoc-io pls add a release notes (you will have to rebase as I just added the 0.14.0 section); add in bug fixes pls (you can ref this PR)

@xenoscopic
Copy link
Contributor Author

I've added an entry to the 0.14.0 bug fix log.

@xenoscopic
Copy link
Contributor Author

Looks like CI build passed - good to merge?

@jreback
Copy link
Contributor

jreback commented Feb 6, 2014

pls rebase on master; the release notes are in conflict

@jreback
Copy link
Contributor

jreback commented Feb 6, 2014

as soon as you do I can merge

…erties

This commit modifies the call to numpy.result_type to get around the
NPY_MAXARGS limit, which at the moment is 32.  Instead of passing a
generator of all types involved in an expression, the type promotion
is done on a pair-wise basis with a call to reduce.

This fixes bugs for code such as the following:

    from numpy.random import randn
    from pandas import DataFrame

    d = DataFrame(randn(10, 2), columns=list('ab'))

    # Evaluates fine
    print(d.eval('*'.join(['a'] * 32)))

    # Fails to evaluate due to NumPy argument limits
    print(d.eval('*'.join(['a'] * 33)))
@xenoscopic
Copy link
Contributor Author

Should be sorted now. Thanks.

jreback added a commit that referenced this pull request Feb 6, 2014
BUG: Add type promotion support for eval() expressions with many properties
@jreback jreback merged commit 352db84 into pandas-dev:master Feb 6, 2014
@jreback
Copy link
Contributor

jreback commented Feb 6, 2014

perfect thank you!

jreback added a commit that referenced this pull request Feb 6, 2014
@jreback
Copy link
Contributor

jreback commented Feb 6, 2014

@havoc-io FYI, I had to move this to a class (rather than the global checker), as it didn't properly disable numexpr when necessary (e.g. when it was not available

b0ba12a

it works, but pls confirm that it is still correctly checking

@xenoscopic
Copy link
Contributor Author

Looks good to me.

@jreback
Copy link
Contributor

jreback commented Feb 6, 2014

gr8! this didn't show up till the windows builds started failing...they don't have numexpr installed at all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants