DataFrame.query gives false positives #11743

ardoi · 2015-12-02T19:16:05Z

Noticing some strange behaviour when using the query method. I'm expecting empty results from the query but once in a while I'm getting a non-empty dataframe returned.

import pandas as pd

testdf = pd.DataFrame(pd.np.random.randint(100, size=(10000, 2)))
testdf.columns = ['a', 'b']

missing_value = -2000
print missing_value in testdf['a'].values
>>>False

Doing this 10000 times with using the classical way:

found_count = 0
for i in xrange(10000):
    qq = testdf[testdf['a']==missing_value]
    if not qq.empty:
        print "Value found"
        found_count += 1
print "Times found: {}".format(found_count)
>>>0

Using query:

found_count = 0
for i in xrange(1000):
    qq = testdf.query('a==-2000')
    if not qq.empty:
        print "Value found"
        found_count += 1
print "Times found: {}".format(found_count)
>>>Value found
>>>Value found
>>>Times found: 2

or

found_count = 0
for i in xrange(1000):
    qq = testdf.query('a==@missing_value')
    if not qq.empty:
        print "Value found"
        found_count += 1
print "Times found: {}".format(found_count)
>>>Value found
>>>Times found: 1

The number of non-empty results varies but tends to happen 0.2% of the time. Is this a bug or have I misunderstood what query is supposed to to? Version is 0.17.0.

The text was updated successfully, but these errors were encountered:

jreback · 2015-12-02T22:19:43Z

pls include a np.random.seed as impossible to replicated.

This is prob a floating point issue. numexpr actually compares floats I think (which is what is actually executing this).

ardoi · 2015-12-02T22:48:42Z

Also happens with a dataframe containing strings:

testdf=pd.DataFrame([("one","two")]*10000)
testdf.columns=['a','b']

found_count = 0
for i in xrange(10000):
    qq = testdf.query('a=="three"')
    if not qq.empty:
        print "Value found"
        found_count += 1
print "Times found: {}".format(found_count)
>>>Value found
>>>Value found
>>>Value found
>>>...
>>>Times found: 13

jreback · 2015-12-03T00:42:48Z

pls show pd.show_versions()

this is not reproducible on master

ardoi · 2015-12-03T01:04:48Z

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Darwin
OS-release: 15.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8

pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.4
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 4.0.0
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: None
tables: None
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: None
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)

jreback · 2015-12-03T01:12:18Z

so it might be numexpr 2.4.4 (though that is the latest in conda), you can try pip installing 2.4.6 which did fix some obscure indexing bugs.

I have quite similar versions of these packages & have not been able to repro on 0.17,0.17.1, or master

does it always happen?

ardoi · 2015-12-03T01:41:02Z

Tried it on a different machine and also on a fresh IPython notebook and it random query results do not happen there. Only place I can reproduce it is in one notebook that has a several dataframes with >1e6 items each. In that particular notebook it always happens and I get random results from what should be a deterministic expression. I'll upgrade to the newer numexpr and see if it helps.

jreback · 2015-12-03T01:52:10Z

then its almost certainly numexpr. as the indexing issues only happen with > 500k element (note this was important information to have at the beginning).

pekaalto · 2015-12-20T08:58:52Z

I just reinstalled latest version of Anaconda which has numexpr 2.4.4 and then this started happening. Like OP, It happens very rarely and very hard to reproduce.

I had previously other issues with 2.4.4 but 2.4.6 has been ok so far.

pekaalto · 2015-12-20T09:06:00Z

Versions for the record.

INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: fi_FI

pandas: 0.17.1
nose: 1.3.7
pip: 7.1.2
setuptools: 18.5
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.0
statsmodels: None
IPython: 4.0.1
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.5.0
openpyxl: 2.2.6
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.7.7
lxml: 3.4.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: None
Jinja2: 2.8

jreback · 2015-12-29T19:46:58Z

closing as this is a numexpr issue. maybe we should put in a doc-warning that we recommend > 2.4.6, though not available on conda...hmm I'll push on that.

so if anyone wants to put up a PR for install.rst great8.

jreback added the Can't Repro label Dec 3, 2015

jreback closed this as completed Dec 29, 2015

jreback added Docs Compat pandas objects compatability with Numpy or Python functions labels Dec 29, 2015

jreback mentioned this issue Jan 12, 2016

Query errors in Windows #12023

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.query gives false positives #11743

DataFrame.query gives false positives #11743

ardoi commented Dec 2, 2015

jreback commented Dec 2, 2015

ardoi commented Dec 2, 2015

jreback commented Dec 3, 2015

ardoi commented Dec 3, 2015

jreback commented Dec 3, 2015

ardoi commented Dec 3, 2015

jreback commented Dec 3, 2015

pekaalto commented Dec 20, 2015

pekaalto commented Dec 20, 2015

jreback commented Dec 29, 2015

DataFrame.query gives false positives #11743

DataFrame.query gives false positives #11743

Comments

ardoi commented Dec 2, 2015

jreback commented Dec 2, 2015

ardoi commented Dec 2, 2015

jreback commented Dec 3, 2015

ardoi commented Dec 3, 2015

jreback commented Dec 3, 2015

ardoi commented Dec 3, 2015

jreback commented Dec 3, 2015

pekaalto commented Dec 20, 2015

pekaalto commented Dec 20, 2015

INSTALLED VERSIONS

jreback commented Dec 29, 2015