Skip to content

BUG: Query on Int64Dtype raises ValueError #50261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
anthonyaag opened this issue Dec 14, 2022 · 0 comments · Fixed by #50764
Closed
2 of 3 tasks

BUG: Query on Int64Dtype raises ValueError #50261

anthonyaag opened this issue Dec 14, 2022 · 0 comments · Fixed by #50764
Labels
Bug expressions pd.eval, query NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@anthonyaag
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
df = pd.DataFrame(
    [[1, 1], [2, 2], [3, 3]],
    columns=["pd_dtype", "np_dtype"],
).astype({"pd_dtype": pd.core.arrays.integer.Int64Dtype(), "np_dtype": int})
ref = {2,3}
print(pd.__version__, np.__version__)
print(df.query("np_dtype in @ref"))
print(df.query("pd_dtype in @ref"))

Issue Description

Based on debugging this appears to be the pathway of execution

column in @env_variable -> pd.Series containing true/false ->necompiler--> np.array containing true/false -> df.loc[nparray].

It appears that at some point in the execution of the query the Int64Dtype gets manipulated into a pandas.core.arrays.boolean.BooleanDtype. Meanwhile, the good example gives a pd.Series with dtype numpy.dtype[bool_] which numpy is perfectly happy with.

Numpy doesn't know how to handle the BooleanDType and np.asarray gives a np.array with dtype 'O' and the type check throws an exception:

.../python3.10/site-packages/numexpr/necompiler.py:692, in getType(a)
    690 if kind == 'U':
    691     raise ValueError('NumExpr 2 does not support Unicode as a dtype.')
--> 692 raise ValueError("unknown type %s" % a.dtype.name)

ValueError: unknown type object

Looking back and ran the "bad" scenario code in a previous pandas, and the intermediate pd.Series was of the normal np bool dtype. I don't think any of the relevant code for query/eval/numexpr has changed in the past ~4+ ish years according to git blame, so it must be that something in pandas 1.5 is more aggressively turning things to BooleanDType

Expected Behavior

I would exepct both Int64Dtype and int to work in a query and both to return

   pd_dtype  np_dtype
1         2         2
2         3         3

Installed Versions

INSTALLED VERSIONS

commit : 91111fd
python : 3.10.8.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.103.hrtdev
Version : #1 SMP Tue Mar 8 11:50:55 EST 2022
machine : x86_64
processor :
byteorder : little
LC_ALL :
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.1
numpy : 1.23.2
pytz : 2022.1
dateutil : 2.8.2
setuptools : 62.1.0
pip : 22.3.1
Cython : 0.29.30
pytest : 7.1.2
hypothesis : 6.52.3
sphinx : 5.0.2
blosc : 1.10.6
feather : None
xlsxwriter : None
lxml.etree : 4.9.1
html5lib : 1.1
pymysql : 1.0.2
psycopg2 : 2.9.3
jinja2 : 3.1.2
IPython : 8.4.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : 1.3.5
brotli : 1.0.9
fastparquet : 0.8.1
fsspec : 2022.5.0
gcsfs : None
matplotlib : 3.5.2+hrt1
numba : 0.56.2
numexpr : 2.8.3
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 7.0.0
pyreadstat : None
pyxlsb : 1.0.9
s3fs : None
scipy : 1.9.1
snappy : None
sqlalchemy : 1.4.39
tables : 3.7.0
tabulate : 0.8.10
xarray : 2022.6.0
xlrd : 2.0.1
xlwt : None
zstandard : 0.18.0
tzdata : 2022.1

@anthonyaag anthonyaag added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 14, 2022
@phofl phofl added expressions pd.eval, query NA - MaskedArrays Related to pd.NA and nullable extension arrays and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug expressions pd.eval, query NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants