Skip to content

BUG: maybe_downcast_numeric interferes with PintPandas #55824

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
MichaelTiemannOSC opened this issue Nov 4, 2023 · 5 comments
Closed
3 tasks done

BUG: maybe_downcast_numeric interferes with PintPandas #55824

MichaelTiemannOSC opened this issue Nov 4, 2023 · 5 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions

Comments

@MichaelTiemannOSC
Copy link
Contributor

MichaelTiemannOSC commented Nov 4, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as	np
import pint
from pint import Quantity as Q_

ureg = pint.get_application_registry()
data = pd.DataFrame({'m':[1, 2], 'unit':['m', 'kg']})
result = data['m'].combine(data['unit'], lambda m, u: Q_(m, u))
expected = np.array([Q_(1, 'm'), Q_(2, 'kg')], dtype=object)
assert all(result == expected)
print("ok")

Issue Description

Pint (version 0.22) treats a generic attempt to construct an array from list of Quantities as a request for just their magnitudes, and then it complains that it cannot convert a quantified type into a dimensionless quantity. One can argue that it could have better semantics in that case. Nevertheless, I think Pandas is to blame for trying to assume what might or might not happen when calling np.array on some random type.

When I investigated what was going wrong, I found that the implementation of maybe_downcast_numeric was not following the comments in the code. I slightly modified the code to better follow what the comments seemed to intend, and this also fixed the fatal problems with Pint.

The changes still elicit a warning about stripping units, specifically when isna(r[0]) digs into the Quantity, doesn't see it as a scalar, but does ultimately see an array method it can call on the Quantity. That call returns the magnitude of the quantity (stripped of its units), and isna properly reports whether the magnitude is NA or not. I think that Pint has this default array behavior specifically to make isna work in this way.

@hgrecco @FannySternfeld

Expected Behavior

We expect that the program will complete and print 'ok'

Installed Versions

INSTALLED VERSIONS

commit : d98e6f0
python : 3.11.5.final.0
python-bits : 64
OS : Darwin
OS-release : 22.6.0
Version : Darwin Kernel Version 22.6.0: Wed Jul 5 22:21:56 PDT 2023; root:xnu-8796.141.3~6/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.0.dev0+391.gd98e6f00ca.dirty
numpy : 1.24.4
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.1.2
pip : 23.2.1
Cython : 0.29.33
pytest : 7.4.2
hypothesis : 6.84.2
sphinx : 6.2.1
blosc : 1.11.1
feather : None
xlsxwriter : 3.1.2
lxml.etree : 4.9.3
html5lib : 1.1
pymysql : 1.4.6
psycopg2 : 2.9.7
jinja2 : 3.1.2
IPython : 8.15.0
pandas_datareader : None
bs4 : 4.12.2
bottleneck : 1.3.7
dataframe-api-compat: None
fastparquet : 2023.8.0
fsspec : 2023.9.0
gcsfs : 2023.9.0
matplotlib : 3.7.2
numba : 0.57.1
numexpr : 2.8.5
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : 1.2.3
python-calamine : None
pyxlsb : 1.0.10
s3fs : 2023.9.0
scipy : 1.11.2
sqlalchemy : 2.0.20
tables : 3.8.0
tabulate : 0.9.0
xarray : 2023.8.0
xlrd : 2.0.1
zstandard : 0.21.0
tzdata : 2023.3
qtpy : None
pyqt5 : None

@MichaelTiemannOSC MichaelTiemannOSC added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 4, 2023
MichaelTiemannOSC added a commit to MichaelTiemannOSC/pandas that referenced this issue Nov 4, 2023
The comment `# if we have any nulls, then we are done` is not consistent with the test `if isna(arr).any()` because `arr` is constructed only from the first element (`r[0]`) not the full ravel'd list of values.  Moreover, calling `np.array()` on some random type can have surprising consequences.

So instead, do the early-out test as intended, just using `r[0]` without going through `np.array()`.  Then test other things about `r[0]`.  Only then should we test all the values (and if we have any nulls, then we are done).

See pandas-dev#55824

Signed-off-by: Michael Tiemann <[email protected]>
@rhshadrach
Copy link
Member

Nevertheless, I think Pandas is to blame for trying to assume what might or might not happen when calling np.array on some random type.

When I do:

np.array([Q_(1, 'm')])

I get ValueError: setting an array element with a sequence.. In general, if you want to store a Python object v in pandas, it seems to me that np.array([v]) doesn't raise is a reasonable assumption.

Which isn't to say that the changes aren't welcome. But I think it'd be helpful to describe what you think is wrong with maybe_downcast_numeric, and what impact the changes might have from a user perspective.

@rhshadrach rhshadrach added Dtype Conversions Unexpected or buggy dtype conversions Needs Info Clarification about behavior needed to assess issue labels Nov 5, 2023
@MichaelTiemannOSC
Copy link
Contributor Author

Pint is happy to convert dimensionless quantities to integers:

>>> np.array([ureg.Quantity(1, 'dimensionless')])
array([1])

Numpy is happy to store object-type objects if they are so declared (even if Pint isn't thrilled about it):

>>> np.array([ureg.Quantity(1, 'm')], dtype='object')
<stdin>:1: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
array([<Quantity(1, 'meter')>], dtype=object)

Pint is willing to allow pd.isna() to examine magnitudes for NA-ness, but it draws the line at casting quantities as integers. Perhaps Numpy should check whether its arguments should be treated as number-like (and thus allow attempts to cast them to numeric, as np.array() attempts to do). But this feels to me like a question that needs some deeper discussions by people who understand these APIs much better than I do.

The changes I made (which can be interpreted as an answer to the question "what do I think is wrong with maybe_downcast_numeric?") prioritize first checking that we are dealing directly with numeric types, and if so try downcasting to better numeric types. And if not, it properly checks for NaNs across they whole array (as the comment says), which Pint is happy to support. There is no need to call np.array at all, so the API question need not be answered in order to progress this issue/pr.

@rhshadrach rhshadrach removed Needs Info Clarification about behavior needed to assess issue Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 7, 2023
@jbrockmendel
Copy link
Member

Nevertheless, I think Pandas is to blame for trying to assume what might or might not happen when calling np.array on some random type.

Sure this qualifies as an assumption, but "not raising" seems pretty benign as far as assumptions go. I'm happy to work with you to look for solutions, but i think this assumption is going to be implicit in a lot of places in pandas.

@MichaelTiemannOSC
Copy link
Contributor Author

It's OK for the assumption to be implicit in a lot of places, and it may yet be the case that we find something really tricky where either Pint or Pandas needs to make a big call as to what's right. To make that big call, we'd probably want a much better example or many more examples of how this assumption affects Pint and/or Pandas. The PR for this bug doesn't suggest making such a call on either side...just reordering the conditional tests and simplifying the code so we don't have to make that call now.

mroeschke pushed a commit that referenced this issue Nov 22, 2023
* Reorder tests in maybe_downcast_numeric

The comment `# if we have any nulls, then we are done` is not consistent with the test `if isna(arr).any()` because `arr` is constructed only from the first element (`r[0]`) not the full ravel'd list of values.  Moreover, calling `np.array()` on some random type can have surprising consequences.

So instead, do the early-out test as intended, just using `r[0]` without going through `np.array()`.  Then test other things about `r[0]`.  Only then should we test all the values (and if we have any nulls, then we are done).

See #55824

Signed-off-by: Michael Tiemann <[email protected]>

* Simplify/optimize tests for downcasting

If the first element of `result` is an array, ravel that to get element we will test.  Otherwise use it as is.

We only need to check whether `result` is all non-null once.

Signed-off-by: Michael Tiemann <[email protected]>

* Update cast.py

Don't use deprecated array indexing on ExtensionArrays.  We need to now us `iloc`.

Signed-off-by: Michael Tiemann <[email protected]>

* Eliminate need to call `ravel`

When processing a multidimensional `ndarray`, we can get the first element by calling `result.item(0)` and completely avoid the copying needed by `ravel` to get the first element that way.  We can also eliminates an additional conditional check.

Signed-off-by: Michael Tiemann <[email protected]>

---------

Signed-off-by: Michael Tiemann <[email protected]>
Co-authored-by: Richard Shadrach <[email protected]>
@MichaelTiemannOSC
Copy link
Contributor Author

Closed by #55825

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

No branches or pull requests

3 participants