Skip to content

DataFrame.query produces inconsistent results on Windows 8.1 and higher #12055

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bridwell opened this issue Jan 15, 2016 · 2 comments
Closed
Labels
Compat pandas objects compatability with Numpy or Python functions Duplicate Report Duplicate issue or pull request

Comments

@bridwell
Copy link

I have had issues with DataFrame.query() producing erroneous results on machines with Window 8.1, Windows Server 2012 R2 and Windows 10.

The issue can be re-produced with the code below which is creating a simple data frame and then querying it successively and checking the results. The issue appears to be intermittent: running the script multiple times will show the query failing on different tests. The issue also seems to depend on the data frame size: the the issue does not seem to existing for data frames below ~1,000 rows, data frames with ~1000 to 100,000 rows return the wrong results, and data frames with more than 100k rows commonly return no results.

import numpy as np
import pandas as pd

# create a data frame to test with
num_groups = 100
rows_per_group = 1000

# set the evaluation engine
use_python_engine = True

df = pd.DataFrame({
    'grp_col': np.arange(num_groups).repeat(rows_per_group),
    'val_col': np.random.rand(num_groups * rows_per_group)
    })

# check counts via group by
print df.groupby('grp_col').size()

# look for problems in DataFrame.query
tests_per_group = 1000

print '-------------------------'
print ' expected rows per group: ' + str(rows_per_group)
print '-------------------------'

bad_result = []

for i in range(num_groups):

    print 'group: ' + str(i)
    query_str = 'grp_col == ' + str(i)

    for j in range(0, tests_per_group):

        if use_python_engine:
            result = df.query(query_str, engine='python')
        else:
            result = df.query(query_str)
        min_grp =  result.grp_col.min()
        max_grp =  result.grp_col.max()

        if len(result) != rows_per_group or i != min_grp or i != max_grp:
            bad_result.append(result)
            print ' bad query result on test #{}, num rows: {}'.format(str(j), str(len(result)))
            print ' min group: ' + str(min_grp)
            print ' max group: ' + str(max_grp)

I believe the perhaps lies with the numexpr evaluation. In the above script, changing the evaluation engine to python, does not return any errors. Also, the script below demonstrates the issue occur when doing numexpr by itself:

import numpy as np
import numexpr as ne

# create an array to test with
num_items = 10000
arr = np.array(np.random.randint(0, 100, num_items))

# get a value to query from the 1st entry in the array
query_value = arr[0]
query_value
print 'query value: ' + str(query_value)

# get the corresponding query result
count = len(arr[arr == query_value])
print 'query result count: ' + str(count)
print ''

num_tests = 100

for i in range(num_tests):
    curr_test = len(arr[ne.evaluate("arr == " + str(query_value))])

    if curr_test != count:
        print "***Query problem:"
        print "   on test: " + str(i)
        print "   returned count: " + str(curr_test)
@jreback
Copy link
Contributor

jreback commented Jan 15, 2016

dupe of #12023 (and a couple of others)

simply upgrade to numexpr 2.4.6

@jreback jreback closed this as completed Jan 15, 2016
@jreback jreback added the Compat pandas objects compatability with Numpy or Python functions label Jan 15, 2016
@bridwell
Copy link
Author

Awesome. Thanks.

@jorisvandenbossche jorisvandenbossche added the Duplicate Report Duplicate issue or pull request label Jan 15, 2016
@jorisvandenbossche jorisvandenbossche added this to the No action milestone Jan 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

3 participants