Skip to content

DataFrame.query gives false positives #11743

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ardoi opened this issue Dec 2, 2015 · 10 comments
Closed

DataFrame.query gives false positives #11743

ardoi opened this issue Dec 2, 2015 · 10 comments
Labels
Compat pandas objects compatability with Numpy or Python functions Docs

Comments

@ardoi
Copy link

ardoi commented Dec 2, 2015

Noticing some strange behaviour when using the query method. I'm expecting empty results from the query but once in a while I'm getting a non-empty dataframe returned.

import pandas as pd

testdf = pd.DataFrame(pd.np.random.randint(100, size=(10000, 2)))
testdf.columns = ['a', 'b']

missing_value = -2000
print missing_value in testdf['a'].values
>>>False

Doing this 10000 times with using the classical way:

found_count = 0
for i in xrange(10000):
    qq = testdf[testdf['a']==missing_value]
    if not qq.empty:
        print "Value found"
        found_count += 1
print "Times found: {}".format(found_count)
>>>0

Using query:

found_count = 0
for i in xrange(1000):
    qq = testdf.query('a==-2000')
    if not qq.empty:
        print "Value found"
        found_count += 1
print "Times found: {}".format(found_count)
>>>Value found
>>>Value found
>>>Times found: 2

or

found_count = 0
for i in xrange(1000):
    qq = testdf.query('a==@missing_value')
    if not qq.empty:
        print "Value found"
        found_count += 1
print "Times found: {}".format(found_count)
>>>Value found
>>>Times found: 1

The number of non-empty results varies but tends to happen 0.2% of the time. Is this a bug or have I misunderstood what query is supposed to to? Version is 0.17.0.

@jreback
Copy link
Contributor

jreback commented Dec 2, 2015

pls include a np.random.seed as impossible to replicated.

This is prob a floating point issue. numexpr actually compares floats I think (which is what is actually executing this).

@ardoi
Copy link
Author

ardoi commented Dec 2, 2015

Also happens with a dataframe containing strings:

testdf=pd.DataFrame([("one","two")]*10000)
testdf.columns=['a','b']
found_count = 0
for i in xrange(10000):
    qq = testdf.query('a=="three"')
    if not qq.empty:
        print "Value found"
        found_count += 1
print "Times found: {}".format(found_count)
>>>Value found
>>>Value found
>>>Value found
>>>...
>>>Times found: 13

@jreback
Copy link
Contributor

jreback commented Dec 3, 2015

pls show pd.show_versions()

this is not reproducible on master

@ardoi
Copy link
Author

ardoi commented Dec 3, 2015

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Darwin
OS-release: 15.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8

pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.4
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 4.0.0
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: None
tables: None
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: None
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)

@jreback
Copy link
Contributor

jreback commented Dec 3, 2015

so it might be numexpr 2.4.4 (though that is the latest in conda), you can try pip installing 2.4.6 which did fix some obscure indexing bugs.

I have quite similar versions of these packages & have not been able to repro on 0.17,0.17.1, or master

does it always happen?

@ardoi
Copy link
Author

ardoi commented Dec 3, 2015

Tried it on a different machine and also on a fresh IPython notebook and it random query results do not happen there. Only place I can reproduce it is in one notebook that has a several dataframes with >1e6 items each. In that particular notebook it always happens and I get random results from what should be a deterministic expression. I'll upgrade to the newer numexpr and see if it helps.

@jreback
Copy link
Contributor

jreback commented Dec 3, 2015

then its almost certainly numexpr. as the indexing issues only happen with > 500k element (note this was important information to have at the beginning).

@pekaalto
Copy link

I just reinstalled latest version of Anaconda which has numexpr 2.4.4 and then this started happening. Like OP, It happens very rarely and very hard to reproduce.

I had previously other issues with 2.4.4 but 2.4.6 has been ok so far.

@pekaalto
Copy link

Versions for the record.

INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: fi_FI

pandas: 0.17.1
nose: 1.3.7
pip: 7.1.2
setuptools: 18.5
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.0
statsmodels: None
IPython: 4.0.1
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.5.0
openpyxl: 2.2.6
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.7.7
lxml: 3.4.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: None
Jinja2: 2.8

@jreback
Copy link
Contributor

jreback commented Dec 29, 2015

closing as this is a numexpr issue. maybe we should put in a doc-warning that we recommend > 2.4.6, though not available on conda...hmm I'll push on that.

so if anyone wants to put up a PR for install.rst great8.

@jreback jreback closed this as completed Dec 29, 2015
@jreback jreback added Docs Compat pandas objects compatability with Numpy or Python functions labels Dec 29, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Docs
Projects
None yet
Development

No branches or pull requests

3 participants