DataFrame.query raises ValueError when comparing columns with nullable dtypes #31913

daviewales · 2020-02-12T04:33:05Z

Code Sample

In [2]: df1 = pd.DataFrame({'A': [1, 1, 2], 'B': [1, 2, 2]})

In [3]: df1.dtypes
Out[3]:
A    int64
B    int64
dtype: object

In [4]: df2 = pd.DataFrame({'A': [1, 1, 2], 'B': [1, 2, 2]}, dtype='Int64')

In [5]: df2.dtypes
Out[5]:
A    Int64
B    Int64
dtype: object

In [6]: df1.query('A == B')
Out[6]:
   A  B
0  1  1
2  2  2

In [7]: df2.query('A == B')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-8efe41b297d7> in <module>
----> 1 df2.query('A == B')

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in query(self, expr, inplace, **kwargs)
   3229         kwargs["level"] = kwargs.pop("level", 0) + 1
   3230         kwargs["target"] = None
-> 3231         res = self.eval(expr, **kwargs)
   3232
   3233         try:

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in eval(self, expr, inplace, **kwargs)
   3344         kwargs["resolvers"] = kwargs.get("resolvers", ()) + tuple(resolvers)
   3345
-> 3346         return _eval(expr, inplace=inplace, **kwargs)
   3347
   3348     def select_dtypes(self, include=None, exclude=None) -> "DataFrame":

~/anaconda3/lib/python3.6/site-packages/pandas/core/computation/eval.py in eval(expr, parser, engine, truediv, local_dict, global_dict, resolvers, level, target, inplace)
    335         eng = _engines[engine]
    336         eng_inst = eng(parsed_expr)
--> 337         ret = eng_inst.evaluate()
    338
    339         if parsed_expr.assigner is None:

~/anaconda3/lib/python3.6/site-packages/pandas/core/computation/engines.py in evaluate(self)
     71
     72         # make sure no names in resolvers and locals/globals clash
---> 73         res = self._evaluate()
     74         return reconstruct_object(
     75             self.result_type, res, self.aligned_axes, self.expr.terms.return_type

~/anaconda3/lib/python3.6/site-packages/pandas/core/computation/engines.py in _evaluate(self)
    112         scope = env.full_scope
    113         _check_ne_builtin_clash(self.expr)
--> 114         return ne.evaluate(s, local_dict=scope)
    115
    116

~/anaconda3/lib/python3.6/site-packages/numexpr/necompiler.py in evaluate(ex, local_dict, global_dict, out, order, casting, **kwargs)
    820     # Create a signature
    821     signature = [(name, getType(arg)) for (name, arg) in
--> 822                  zip(names, arguments)]
    823
    824     # Look up numexpr if possible.

~/anaconda3/lib/python3.6/site-packages/numexpr/necompiler.py in <listcomp>(.0)
    819
    820     # Create a signature
--> 821     signature = [(name, getType(arg)) for (name, arg) in
    822                  zip(names, arguments)]
    823

~/anaconda3/lib/python3.6/site-packages/numexpr/necompiler.py in getType(a)
    701     if kind == 'S':
    702         return bytes
--> 703     raise ValueError("unknown type %s" % a.dtype.name)
    704
    705

ValueError: unknown type object

Problem description

DataFrame.query raises ValueError: unknown type object for boolean comparisons when the dtype is one of the new nullable types. (I have tested this for both Int64 and string dtypes.)

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.6.8.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 17.7.0
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_AU.UTF-8
LOCALE           : en_AU.UTF-8

pandas           : 1.0.1
numpy            : 1.18.1
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 45.2.0.post20200210
Cython           : 0.29.15
pytest           : 5.3.5
hypothesis       : 5.4.1
sphinx           : 2.4.0
blosc            : None
feather          : None
xlsxwriter       : 1.2.7
lxml.etree       : 4.5.0
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.1
IPython          : 7.12.0
pandas_datareader: None
bs4              : 4.8.2
bottleneck       : 1.3.1
fastparquet      : None
gcsfs            : None
lxml.etree       : 4.5.0
matplotlib       : 3.1.3
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.3
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : 5.3.5
pyxlsb           : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : 1.3.13
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : 1.2.0
xlwt             : 1.2.0
xlsxwriter       : 1.2.7
numba            : 0.48.0

The text was updated successfully, but these errors were encountered:

zzapzzap · 2020-02-12T05:57:04Z

I can see the difference between 'df1' and 'df2'.
It seems a problem with int64 of df1, not with Int64 of df2.
I'll check 'frame.py' file soon.
Moreover, to do correct this code, you can use int64 instead of int32 in df2.

zzapzzap · 2020-02-12T06:54:28Z

What version of pandas do you use?
Eventually, Released Version(0.25.1) can't solve this bug; but, the current development version(1.1.0.dev0+419.g625441b31) has solved this problem.

I hope it helps you.

daviewales · 2020-02-12T07:22:48Z

@zzapzzap I'm using pandas 1.0.1 as shown in the output of pd.show_versions().
The problem is definitely with df2 and Int64, because that's the one throwing the error.

jorisvandenbossche · 2020-02-12T09:53:16Z

@daviewales Thanks for the report!
In general, the new nullable dtypes are not yet much tested with query. Investigations in the underlying issue or contributions for better support are certainly welcome.

TomAugspurger · 2020-02-12T12:38:22Z

This is going through numexpr, which generally won't know about pd.NA. We would need to work with the numexpr devs, or more likely avoid passing NA there in the first place.

jreback · 2020-02-12T13:17:58Z

the nullable types should force the python engine path as numexpr can’t support anything non numpy based

sumanau7 · 2020-02-26T11:05:28Z

Interested to pick this up, should we set
kwargs["engine"] = "python" for nullable types, before calling self.eval at core/frame.py

Also what is the recommended way to check if dataframe is of nullable type or not, didn't find a right method exposed at dtypes/api.py

And if this is right way to proceed, what are the other places in code where this check should happen ?

@jreback

sumanau7 · 2020-02-26T11:07:55Z

take

eduardopaul · 2021-03-18T18:23:38Z

I've just run across this issue using pandas 1.2.3. Has there been any development here?

marianoju · 2021-10-25T12:23:09Z

I've just run across this issue using pandas 1.3.2. dtype Int64 causes ValueError: unknown type object.

jorisvandenbossche added Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays expressions pd.eval, query ExtensionArray Extending pandas with custom dtypes or arrays. and removed NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Feb 12, 2020

jorisvandenbossche added this to the Contributions Welcome milestone Feb 12, 2020

github-actions bot assigned sumanau7 Feb 26, 2020

jbrockmendel added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Dec 21, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

phofl mentioned this issue Jan 18, 2023

BUG: eval and query not working with ea dtypes #50764

Merged

7 tasks

mroeschke closed this as completed in #50764 Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.query raises ValueError when comparing columns with nullable dtypes #31913

DataFrame.query raises ValueError when comparing columns with nullable dtypes #31913

daviewales commented Feb 12, 2020 •

edited

Loading

zzapzzap commented Feb 12, 2020 •

edited

Loading

zzapzzap commented Feb 12, 2020

daviewales commented Feb 12, 2020

jorisvandenbossche commented Feb 12, 2020

TomAugspurger commented Feb 12, 2020

jreback commented Feb 12, 2020

sumanau7 commented Feb 26, 2020 •

edited

Loading

sumanau7 commented Feb 26, 2020

eduardopaul commented Mar 18, 2021

marianoju commented Oct 25, 2021 •

edited

Loading

DataFrame.query raises ValueError when comparing columns with nullable dtypes #31913

DataFrame.query raises ValueError when comparing columns with nullable dtypes #31913

Comments

daviewales commented Feb 12, 2020 • edited Loading

Code Sample

Problem description

Output of pd.show_versions()

zzapzzap commented Feb 12, 2020 • edited Loading

zzapzzap commented Feb 12, 2020

daviewales commented Feb 12, 2020

jorisvandenbossche commented Feb 12, 2020

TomAugspurger commented Feb 12, 2020

jreback commented Feb 12, 2020

sumanau7 commented Feb 26, 2020 • edited Loading

sumanau7 commented Feb 26, 2020

eduardopaul commented Mar 18, 2021

marianoju commented Oct 25, 2021 • edited Loading

daviewales commented Feb 12, 2020 •

edited

Loading

Output of `pd.show_versions()`

zzapzzap commented Feb 12, 2020 •

edited

Loading

sumanau7 commented Feb 26, 2020 •

edited

Loading

marianoju commented Oct 25, 2021 •

edited

Loading