Skip to content

ExtensionDtype API for preferred type when converting to NumPy array #22791

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TomAugspurger opened this issue Sep 20, 2018 · 13 comments
Open
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 20, 2018

This is coming out of SparseArray. Right now if you have a homogeneous DataFrame of EAs, then doing something like np.asarray(df) will always convert to object.

def _interleave(self):
"""
Return ndarray from blocks with specified item order
Items must be contained in the blocks
"""
dtype = _interleaved_dtype(self.blocks)
result = np.empty(self.shape, dtype=dtype)
.

Should we give ExtensionDtype some input on the dtype used here? This would allow for some consistency between np.asarray(series) and np.asarray(homogeneous_dataframe).

@TomAugspurger TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. labels Sep 20, 2018
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Sep 20, 2018
@jorisvandenbossche
Copy link
Member

Not sure if that was from the same discussion, but I recall a discussion where we said that we basically wanted to know the "numpy dtype if I would be converted to a numpy array" of an ExtensionArray, without actually converting it.
That seems something useful in general I think, and could then be used to determine the best interleaved dtype in the use case above?

@jorisvandenbossche
Copy link
Member

It was indeed from the sparse discussion, my comment there: #22325 (comment)

Basically, we want to know np.array(EA).dtype without doing the conversion to array.

We could add a numpy_dtype attribute to the Dtype object?

This might also overlap with #22224

@TomAugspurger
Copy link
Contributor Author

There's some overlap with #22224, but I don't know if we can re-use the same attribute for both unfortunately. IntegerArray.numpy_type can't hold values with NaNs.

@jorisvandenbossche
Copy link
Member

Yes, I don't think we can re-use it, but the numpy_dtype attribute of IntegerArray can still be changed of course.

The other issue is more about if you have a composite array, what is the dtype of the underlying array (which of course can also be multiple arrays .., so then this might also be multiple dtypes), while here it is the dtype when converted to a numpy array.

But I would use a numpy_dtype name rather for the use case of this issue.

@jorisvandenbossche
Copy link
Member

Although the question is, what if this dtype depends on the data?
For example, for IntegerArray, we now decided to return an object array. But, you could also opt for actually returning the integer array if there are no NAs, and float otherwise.

@TomAugspurger
Copy link
Contributor Author

But, you could also opt for actually returning the integer array

I assume you meant integer dtype.

That's a good point. It's not clear what's best here. I think IntegerArray may not be the best one to consider here, since the "best type" so clearly depends on the values. There's no way to put that on the ExtensionDtype.

Let's consider, say, Categorical and Sparse. In this case, there's always a clearly best dtype for the resulting ndarray, CategoricalDtype.categories.dtype and SparseDtype.subdtype (or whatever we end up calling the type of sp_values).

@jorisvandenbossche
Copy link
Member

But, you could also opt for actually returning the integer array

I assume you meant integer dtype.

Sorry, I was speaking about the return value for np.array(EA), that was not fully clear. So I did mean "integer (numpy) array".

The question then is, what should the value for this attribute be for IntegerArray? None to indicate it is not known ?

@jreback
Copy link
Contributor

jreback commented Jan 4, 2019

I think we have this now? e.g. can just specify the dtype= in np.asarray(), or something else needed here?

@jorisvandenbossche
Copy link
Member

I don't think this is solved. It is not about passing a dtype in np.asarray(..), the original question is: predicting for an array what the dtype will be of np.array(dtype) (so the default when not passing a dtype= kwarg).

One use case is to define the interleaved dtype (see top post)

(but it's not necessarily a blocker for 0.24.0)

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jan 4, 2019 via email

@jbrockmendel
Copy link
Member

interleaved_dtype goes through find_common_type, for which EA authors can specify behavior via _get_common_dtype.

There are some standard cases where find_common_type isn't quite right due to values-dependent-behavior. Recently we've been collecting the values-dependent-behavior in dtypes.cast functions find_result_type and common_dtype_categorical_compat. The latter could/should probably be extended to handle IntegerArray/BooleanArray with NAs. I'd like to whittle these down into just find_result_type, but no immediate plans to do so.

So I'm thinking we could implement something like EA._get_result_dtype_values_dependent:find_result_type::_get_common_type:find_common_type

@JBGreisman
Copy link
Contributor

This proposal sounds related to #22224 -- would the proposed plan be to use something like EA.find_result_type here:

# TODO: https://github.com/pandas-dev/pandas/issues/22791
# Give EAs some input on what happens here. Sparse needs this.
if isinstance(dtype, SparseDtype):
dtype = dtype.subtype
dtype = cast(np.dtype, dtype)
elif isinstance(dtype, ExtensionDtype):
dtype = np.dtype("object")
elif is_dtype_equal(dtype, str):
dtype = np.dtype("object")

in the clause for ExtensionDtype?

@MarcoGorelli
Copy link
Member

_get_result_dtype_values_dependent

is this really desirable though? wouldn't it go against this comment:

We try to avoid value-dependent behavior where the metadata (shape, dtype, etc.) depend on the values of the array.

I'd be more inclined to do:

  1. find the interleaved type. E.g. if Int64Dtype() and Int32Dtype(), then this would be Int64Dtype()
  2. convert to the corresponding numpy type (int64)
  3. if there are missing values (which can't be represented by int64), then raise. It's then up to the user to provide a suitable dtype (e.g. float64) and/or a suitable na_value

This would be the 2D analogue of #48891

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

No branches or pull requests

7 participants