BUG: maybe_downcast_numeric interferes with PintPandas #55824

MichaelTiemannOSC · 2023-11-04T01:37:25Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as	np
import pint
from pint import Quantity as Q_

ureg = pint.get_application_registry()
data = pd.DataFrame({'m':[1, 2], 'unit':['m', 'kg']})
result = data['m'].combine(data['unit'], lambda m, u: Q_(m, u))
expected = np.array([Q_(1, 'm'), Q_(2, 'kg')], dtype=object)
assert all(result == expected)
print("ok")

Issue Description

Pint (version 0.22) treats a generic attempt to construct an array from list of Quantities as a request for just their magnitudes, and then it complains that it cannot convert a quantified type into a dimensionless quantity. One can argue that it could have better semantics in that case. Nevertheless, I think Pandas is to blame for trying to assume what might or might not happen when calling np.array on some random type.

When I investigated what was going wrong, I found that the implementation of maybe_downcast_numeric was not following the comments in the code. I slightly modified the code to better follow what the comments seemed to intend, and this also fixed the fatal problems with Pint.

The changes still elicit a warning about stripping units, specifically when isna(r[0]) digs into the Quantity, doesn't see it as a scalar, but does ultimately see an array method it can call on the Quantity. That call returns the magnitude of the quantity (stripped of its units), and isna properly reports whether the magnitude is NA or not. I think that Pint has this default array behavior specifically to make isna work in this way.

@hgrecco @FannySternfeld

Expected Behavior

We expect that the program will complete and print 'ok'

Installed Versions

INSTALLED VERSIONS

commit : d98e6f0
python : 3.11.5.final.0
python-bits : 64
OS : Darwin
OS-release : 22.6.0
Version : Darwin Kernel Version 22.6.0: Wed Jul 5 22:21:56 PDT 2023; root:xnu-8796.141.3~6/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.0.dev0+391.gd98e6f00ca.dirty
numpy : 1.24.4
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.1.2
pip : 23.2.1
Cython : 0.29.33
pytest : 7.4.2
hypothesis : 6.84.2
sphinx : 6.2.1
blosc : 1.11.1
feather : None
xlsxwriter : 3.1.2
lxml.etree : 4.9.3
html5lib : 1.1
pymysql : 1.4.6
psycopg2 : 2.9.7
jinja2 : 3.1.2
IPython : 8.15.0
pandas_datareader : None
bs4 : 4.12.2
bottleneck : 1.3.7
dataframe-api-compat: None
fastparquet : 2023.8.0
fsspec : 2023.9.0
gcsfs : 2023.9.0
matplotlib : 3.7.2
numba : 0.57.1
numexpr : 2.8.5
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : 1.2.3
python-calamine : None
pyxlsb : 1.0.10
s3fs : 2023.9.0
scipy : 1.11.2
sqlalchemy : 2.0.20
tables : 3.8.0
tabulate : 0.9.0
xarray : 2023.8.0
xlrd : 2.0.1
zstandard : 0.21.0
tzdata : 2023.3
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

The comment `# if we have any nulls, then we are done` is not consistent with the test `if isna(arr).any()` because `arr` is constructed only from the first element (`r[0]`) not the full ravel'd list of values. Moreover, calling `np.array()` on some random type can have surprising consequences. So instead, do the early-out test as intended, just using `r[0]` without going through `np.array()`. Then test other things about `r[0]`. Only then should we test all the values (and if we have any nulls, then we are done). See pandas-dev#55824 Signed-off-by: Michael Tiemann <[email protected]>

rhshadrach · 2023-11-05T23:48:28Z

Nevertheless, I think Pandas is to blame for trying to assume what might or might not happen when calling np.array on some random type.

When I do:

np.array([Q_(1, 'm')])

I get ValueError: setting an array element with a sequence.. In general, if you want to store a Python object v in pandas, it seems to me that np.array([v]) doesn't raise is a reasonable assumption.

Which isn't to say that the changes aren't welcome. But I think it'd be helpful to describe what you think is wrong with maybe_downcast_numeric, and what impact the changes might have from a user perspective.

MichaelTiemannOSC · 2023-11-06T02:10:40Z

Pint is happy to convert dimensionless quantities to integers:

>>> np.array([ureg.Quantity(1, 'dimensionless')])
array([1])

Numpy is happy to store object-type objects if they are so declared (even if Pint isn't thrilled about it):

>>> np.array([ureg.Quantity(1, 'm')], dtype='object')
<stdin>:1: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
array([<Quantity(1, 'meter')>], dtype=object)

Pint is willing to allow pd.isna() to examine magnitudes for NA-ness, but it draws the line at casting quantities as integers. Perhaps Numpy should check whether its arguments should be treated as number-like (and thus allow attempts to cast them to numeric, as np.array() attempts to do). But this feels to me like a question that needs some deeper discussions by people who understand these APIs much better than I do.

The changes I made (which can be interpreted as an answer to the question "what do I think is wrong with maybe_downcast_numeric?") prioritize first checking that we are dealing directly with numeric types, and if so try downcasting to better numeric types. And if not, it properly checks for NaNs across they whole array (as the comment says), which Pint is happy to support. There is no need to call np.array at all, so the API question need not be answered in order to progress this issue/pr.

jbrockmendel · 2023-11-17T23:15:15Z

Nevertheless, I think Pandas is to blame for trying to assume what might or might not happen when calling np.array on some random type.

Sure this qualifies as an assumption, but "not raising" seems pretty benign as far as assumptions go. I'm happy to work with you to look for solutions, but i think this assumption is going to be implicit in a lot of places in pandas.

MichaelTiemannOSC · 2023-11-18T04:00:14Z

It's OK for the assumption to be implicit in a lot of places, and it may yet be the case that we find something really tricky where either Pint or Pandas needs to make a big call as to what's right. To make that big call, we'd probably want a much better example or many more examples of how this assumption affects Pint and/or Pandas. The PR for this bug doesn't suggest making such a call on either side...just reordering the conditional tests and simplifying the code so we don't have to make that call now.

* Reorder tests in maybe_downcast_numeric The comment `# if we have any nulls, then we are done` is not consistent with the test `if isna(arr).any()` because `arr` is constructed only from the first element (`r[0]`) not the full ravel'd list of values. Moreover, calling `np.array()` on some random type can have surprising consequences. So instead, do the early-out test as intended, just using `r[0]` without going through `np.array()`. Then test other things about `r[0]`. Only then should we test all the values (and if we have any nulls, then we are done). See #55824 Signed-off-by: Michael Tiemann <[email protected]> * Simplify/optimize tests for downcasting If the first element of `result` is an array, ravel that to get element we will test. Otherwise use it as is. We only need to check whether `result` is all non-null once. Signed-off-by: Michael Tiemann <[email protected]> * Update cast.py Don't use deprecated array indexing on ExtensionArrays. We need to now us `iloc`. Signed-off-by: Michael Tiemann <[email protected]> * Eliminate need to call `ravel` When processing a multidimensional `ndarray`, we can get the first element by calling `result.item(0)` and completely avoid the copying needed by `ravel` to get the first element that way. We can also eliminates an additional conditional check. Signed-off-by: Michael Tiemann <[email protected]> --------- Signed-off-by: Michael Tiemann <[email protected]> Co-authored-by: Richard Shadrach <[email protected]>

MichaelTiemannOSC · 2024-02-27T22:36:45Z

Closed by #55825

MichaelTiemannOSC added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 4, 2023

MichaelTiemannOSC mentioned this issue Nov 4, 2023

Reorder tests in maybe_downcast_numeric #55825

Merged

4 tasks

rhshadrach added Dtype Conversions Unexpected or buggy dtype conversions Needs Info Clarification about behavior needed to assess issue labels Nov 5, 2023

rhshadrach removed Needs Info Clarification about behavior needed to assess issue Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 7, 2023

MichaelTiemannOSC closed this as completed Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: maybe_downcast_numeric interferes with PintPandas #55824

BUG: maybe_downcast_numeric interferes with PintPandas #55824

MichaelTiemannOSC commented Nov 4, 2023 •

edited

Loading

INSTALLED VERSIONS

rhshadrach commented Nov 5, 2023

MichaelTiemannOSC commented Nov 6, 2023

jbrockmendel commented Nov 17, 2023

MichaelTiemannOSC commented Nov 18, 2023

MichaelTiemannOSC commented Feb 27, 2024

BUG: maybe_downcast_numeric interferes with PintPandas #55824

BUG: maybe_downcast_numeric interferes with PintPandas #55824

Comments

MichaelTiemannOSC commented Nov 4, 2023 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

rhshadrach commented Nov 5, 2023

MichaelTiemannOSC commented Nov 6, 2023

jbrockmendel commented Nov 17, 2023

MichaelTiemannOSC commented Nov 18, 2023

MichaelTiemannOSC commented Feb 27, 2024

MichaelTiemannOSC commented Nov 4, 2023 •

edited

Loading