-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: maybe_downcast_numeric interferes with PintPandas #55824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The comment `# if we have any nulls, then we are done` is not consistent with the test `if isna(arr).any()` because `arr` is constructed only from the first element (`r[0]`) not the full ravel'd list of values. Moreover, calling `np.array()` on some random type can have surprising consequences. So instead, do the early-out test as intended, just using `r[0]` without going through `np.array()`. Then test other things about `r[0]`. Only then should we test all the values (and if we have any nulls, then we are done). See pandas-dev#55824 Signed-off-by: Michael Tiemann <[email protected]>
When I do:
I get Which isn't to say that the changes aren't welcome. But I think it'd be helpful to describe what you think is wrong with |
Pint is happy to convert dimensionless quantities to integers:
Numpy is happy to store object-type objects if they are so declared (even if Pint isn't thrilled about it):
Pint is willing to allow pd.isna() to examine magnitudes for NA-ness, but it draws the line at casting quantities as integers. Perhaps Numpy should check whether its arguments should be treated as number-like (and thus allow attempts to cast them to numeric, as The changes I made (which can be interpreted as an answer to the question "what do I think is wrong with |
Sure this qualifies as an assumption, but "not raising" seems pretty benign as far as assumptions go. I'm happy to work with you to look for solutions, but i think this assumption is going to be implicit in a lot of places in pandas. |
It's OK for the assumption to be implicit in a lot of places, and it may yet be the case that we find something really tricky where either Pint or Pandas needs to make a big call as to what's right. To make that big call, we'd probably want a much better example or many more examples of how this assumption affects Pint and/or Pandas. The PR for this bug doesn't suggest making such a call on either side...just reordering the conditional tests and simplifying the code so we don't have to make that call now. |
* Reorder tests in maybe_downcast_numeric The comment `# if we have any nulls, then we are done` is not consistent with the test `if isna(arr).any()` because `arr` is constructed only from the first element (`r[0]`) not the full ravel'd list of values. Moreover, calling `np.array()` on some random type can have surprising consequences. So instead, do the early-out test as intended, just using `r[0]` without going through `np.array()`. Then test other things about `r[0]`. Only then should we test all the values (and if we have any nulls, then we are done). See #55824 Signed-off-by: Michael Tiemann <[email protected]> * Simplify/optimize tests for downcasting If the first element of `result` is an array, ravel that to get element we will test. Otherwise use it as is. We only need to check whether `result` is all non-null once. Signed-off-by: Michael Tiemann <[email protected]> * Update cast.py Don't use deprecated array indexing on ExtensionArrays. We need to now us `iloc`. Signed-off-by: Michael Tiemann <[email protected]> * Eliminate need to call `ravel` When processing a multidimensional `ndarray`, we can get the first element by calling `result.item(0)` and completely avoid the copying needed by `ravel` to get the first element that way. We can also eliminates an additional conditional check. Signed-off-by: Michael Tiemann <[email protected]> --------- Signed-off-by: Michael Tiemann <[email protected]> Co-authored-by: Richard Shadrach <[email protected]>
Closed by #55825 |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Pint (version 0.22) treats a generic attempt to construct an array from list of Quantities as a request for just their magnitudes, and then it complains that it cannot convert a quantified type into a dimensionless quantity. One can argue that it could have better semantics in that case. Nevertheless, I think Pandas is to blame for trying to assume what might or might not happen when calling
np.array
on some random type.When I investigated what was going wrong, I found that the implementation of
maybe_downcast_numeric
was not following the comments in the code. I slightly modified the code to better follow what the comments seemed to intend, and this also fixed the fatal problems with Pint.The changes still elicit a warning about stripping units, specifically when
isna(r[0])
digs into the Quantity, doesn't see it as a scalar, but does ultimately see an array method it can call on the Quantity. That call returns the magnitude of the quantity (stripped of its units), and isna properly reports whether the magnitude is NA or not. I think that Pint has this default array behavior specifically to makeisna
work in this way.@hgrecco @FannySternfeld
Expected Behavior
We expect that the program will complete and print 'ok'
Installed Versions
INSTALLED VERSIONS
commit : d98e6f0
python : 3.11.5.final.0
python-bits : 64
OS : Darwin
OS-release : 22.6.0
Version : Darwin Kernel Version 22.6.0: Wed Jul 5 22:21:56 PDT 2023; root:xnu-8796.141.3~6/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.2.0.dev0+391.gd98e6f00ca.dirty
numpy : 1.24.4
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.1.2
pip : 23.2.1
Cython : 0.29.33
pytest : 7.4.2
hypothesis : 6.84.2
sphinx : 6.2.1
blosc : 1.11.1
feather : None
xlsxwriter : 3.1.2
lxml.etree : 4.9.3
html5lib : 1.1
pymysql : 1.4.6
psycopg2 : 2.9.7
jinja2 : 3.1.2
IPython : 8.15.0
pandas_datareader : None
bs4 : 4.12.2
bottleneck : 1.3.7
dataframe-api-compat: None
fastparquet : 2023.8.0
fsspec : 2023.9.0
gcsfs : 2023.9.0
matplotlib : 3.7.2
numba : 0.57.1
numexpr : 2.8.5
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : 1.2.3
python-calamine : None
pyxlsb : 1.0.10
s3fs : 2023.9.0
scipy : 1.11.2
sqlalchemy : 2.0.20
tables : 3.8.0
tabulate : 0.9.0
xarray : 2023.8.0
xlrd : 2.0.1
zstandard : 0.21.0
tzdata : 2023.3
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: