Skip to content

API: should to_numpy() by default return the corresponding type, and raise otherwise? #48891

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MarcoGorelli opened this issue Sep 30, 2022 · 8 comments · Fixed by #55058
Closed
Labels
API Design NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@MarcoGorelli
Copy link
Member

Currently, to_numpy() from a nullable type always returns dtype object, regardless of whether the Series contains missing values.
E.g.

In [3]: pd.Series([1,2,3], dtype='Int64').to_numpy()
Out[3]: array([1, 2, 3], dtype=object)

This isn't great for users/downstream libraries. E.g. matplotlib/matplotlib#23991 (comment)

Users can specify a dtype in to_numpy(), but if they want to mirror the current behaviour of pandas whereby one has:

In [4]: pd.Series([1,2,3], dtype='int64').to_numpy()
Out[4]: array([1, 2, 3])

In [5]: pd.Series([1,2,3], dtype='float64').to_numpy()
Out[5]: array([1., 2., 3.])

then they would need to add extra logic to their code.

I'd suggest that instead, to_numpy() just always convert to the corresponding numpy type. Then:

  • if a Series has no NA, then to_numpy() will just convert to the corresponding numpy type:
In [4]: pd.Series([1,2,3], dtype='Int64').to_numpy()
Out[4]: array([1, 2, 3])

In [5]: pd.Series([1,2,3], dtype='Float64').to_numpy()
Out[5]: array([1., 2., 3.])
  • if a Series does have NA, then unless it's being converted to float (which can handle nan), it'll raise:
In [8]: pd.Series([1,2,pd.NA], dtype='Int64').to_numpy(int)
---------------------------------------------------------------------------
ValueError: cannot convert to '<class 'int'>'-dtype NumPy array with missing values.
Please either:
- convert to 'float'
- convert to 'object'
- specify an appropriate 'na_value' for this dtype

This would achieve:

@phofl
Copy link
Member

phofl commented Sep 30, 2022

Could you include how you would handle df.values/ser.values?

@MarcoGorelli
Copy link
Member Author

I don't think they'd need to change

@mroeschke
Copy link
Member

  • specify an appropriate 'na_value' for this dtype

Is an "appropriate" na_value include when the na_value could be cast to the dtype losslessly e.g. dtype="Int64" & na_value=1.0 -> np.int64?

@jreback
Copy link
Contributor

jreback commented Oct 1, 2022

this looks reasonable to do
it's basically best efforts conversion

@tacaswell
Copy link
Contributor

What would

pd.Series([1,2,pd.NA], dtype='Int64').to_numpy()

do? I can see a case both for "promote to float" and "raise".

@MarcoGorelli
Copy link
Member Author

Thanks for weighing in - I'm suggesting it'd raise. Because numpy doesn't have a missing value for int64, so it'd be up to the user to be explicit about what they want to happen

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Dec 9, 2022

Numpy is going to error on equality comparisons between object ndarrays which contain pd.NA, so this is kinda important to solve before too long

In [1]: np.array([1, 3]) == np.array([4, pd.NA])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[1], line 1
----> 1 np.array([1, 3]) == np.array([4, pd.NA])

File ~/pandas-dev/pandas/_libs/missing.pyx:413, in pandas._libs.missing.NAType.__bool__()
    411 
    412     def __bool__(self):
--> 413         raise TypeError("boolean value of NA is ambiguous")
    414 
    415     def __hash__(self):

TypeError: boolean value of NA is ambiguous

@lukemanley
Copy link
Member

+1 for avoiding object dtype where possible, e.g. at least in these cases:

In [2]: pd.array([1, 2], dtype="Int64").to_numpy().dtype
Out[2]: dtype('O')

In [3]: pd.array([1, pd.NA], dtype="Int64").to_numpy(na_value=2).dtype
Out[3]: dtype('O')

ArrowExtensionArray.to_numpy now has this behavior:

In [4]: pd.array([1, 2], dtype="int64[pyarrow]").to_numpy().dtype
Out[4]: dtype('int64')

In [5]: pd.array([1, pd.NA], dtype="int64[pyarrow]").to_numpy(na_value=2).dtype
Out[5]: dtype('int64')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
6 participants