ExtensionDtype API for preferred type when converting to NumPy array #22791

TomAugspurger · 2018-09-20T16:13:02Z

This is coming out of SparseArray. Right now if you have a homogeneous DataFrame of EAs, then doing something like np.asarray(df) will always convert to object.

pandas/pandas/core/internals/managers.py

Lines 787 to 794 in 32a74f1

    
               def _interleave(self): 
        
                   """ 
        
                   Return ndarray from blocks with specified item order 
        
                   Items must be contained in the blocks 
        
                   """ 
        
                   dtype = _interleaved_dtype(self.blocks) 
        
                   result = np.empty(self.shape, dtype=dtype)

.

Should we give ExtensionDtype some input on the dtype used here? This would allow for some consistency between np.asarray(series) and np.asarray(homogeneous_dataframe).

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2018-09-21T09:54:31Z

Not sure if that was from the same discussion, but I recall a discussion where we said that we basically wanted to know the "numpy dtype if I would be converted to a numpy array" of an ExtensionArray, without actually converting it.
That seems something useful in general I think, and could then be used to determine the best interleaved dtype in the use case above?

jorisvandenbossche · 2018-09-26T16:28:06Z

It was indeed from the sparse discussion, my comment there: #22325 (comment)

Basically, we want to know np.array(EA).dtype without doing the conversion to array.

We could add a numpy_dtype attribute to the Dtype object?

This might also overlap with #22224

TomAugspurger · 2018-09-26T16:34:07Z

There's some overlap with #22224, but I don't know if we can re-use the same attribute for both unfortunately. IntegerArray.numpy_type can't hold values with NaNs.

jorisvandenbossche · 2018-09-26T16:39:48Z

Yes, I don't think we can re-use it, but the numpy_dtype attribute of IntegerArray can still be changed of course.

The other issue is more about if you have a composite array, what is the dtype of the underlying array (which of course can also be multiple arrays .., so then this might also be multiple dtypes), while here it is the dtype when converted to a numpy array.

But I would use a numpy_dtype name rather for the use case of this issue.

jorisvandenbossche · 2018-09-26T16:41:30Z

Although the question is, what if this dtype depends on the data?
For example, for IntegerArray, we now decided to return an object array. But, you could also opt for actually returning the integer array if there are no NAs, and float otherwise.

TomAugspurger · 2018-09-26T17:21:31Z

But, you could also opt for actually returning the integer array

I assume you meant integer dtype.

That's a good point. It's not clear what's best here. I think IntegerArray may not be the best one to consider here, since the "best type" so clearly depends on the values. There's no way to put that on the ExtensionDtype.

Let's consider, say, Categorical and Sparse. In this case, there's always a clearly best dtype for the resulting ndarray, CategoricalDtype.categories.dtype and SparseDtype.subdtype (or whatever we end up calling the type of sp_values).

jorisvandenbossche · 2018-09-26T19:26:11Z

But, you could also opt for actually returning the integer array

I assume you meant integer dtype.

Sorry, I was speaking about the return value for np.array(EA), that was not fully clear. So I did mean "integer (numpy) array".

The question then is, what should the value for this attribute be for IntegerArray? None to indicate it is not known ?

jreback · 2019-01-04T14:10:23Z

I think we have this now? e.g. can just specify the dtype= in np.asarray(), or something else needed here?

jorisvandenbossche · 2019-01-04T16:22:50Z

I don't think this is solved. It is not about passing a dtype in np.asarray(..), the original question is: predicting for an array what the dtype will be of np.array(dtype) (so the default when not passing a dtype= kwarg).

One use case is to define the interleaved dtype (see top post)

(but it's not necessarily a blocker for 0.24.0)

TomAugspurger · 2019-01-04T17:06:35Z

Agreed: not solved, but a blocker.

…

On Fri, Jan 4, 2019 at 10:22 AM Joris Van den Bossche < ***@***.***> wrote: I don't think this is solved. It is not about passing a dtype in np.asarray(..), the original question is: predicting for an array what the dtype will be of np.array(dtype) (so the default when not passing a dtype= kwarg). One use case is to define the interleaved dtype (see top post) (but it's not necessarily a blocker for 0.24.0) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#22791 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHItXsV-jbeGWEwnblFUqboJFQtLmfks5u_3_ggaJpZM4WydvG> .

jbrockmendel · 2022-02-16T23:49:26Z

interleaved_dtype goes through find_common_type, for which EA authors can specify behavior via _get_common_dtype.

There are some standard cases where find_common_type isn't quite right due to values-dependent-behavior. Recently we've been collecting the values-dependent-behavior in dtypes.cast functions find_result_type and common_dtype_categorical_compat. The latter could/should probably be extended to handle IntegerArray/BooleanArray with NAs. I'd like to whittle these down into just find_result_type, but no immediate plans to do so.

So I'm thinking we could implement something like EA._get_result_dtype_values_dependent:find_result_type::_get_common_type:find_common_type

JBGreisman · 2022-05-02T20:57:19Z

This proposal sounds related to #22224 -- would the proposed plan be to use something like EA.find_result_type here:

pandas/pandas/core/internals/managers.py

Lines 1587 to 1595 in 60ac973

    
           # TODO: https://github.com/pandas-dev/pandas/issues/22791 
        
           # Give EAs some input on what happens here. Sparse needs this. 
        
           if isinstance(dtype, SparseDtype): 
        
               dtype = dtype.subtype 
        
               dtype = cast(np.dtype, dtype) 
        
           elif isinstance(dtype, ExtensionDtype): 
        
               dtype = np.dtype("object") 
        
           elif is_dtype_equal(dtype, str): 
        
               dtype = np.dtype("object")

in the clause for ExtensionDtype?

MarcoGorelli · 2022-10-05T13:24:22Z

_get_result_dtype_values_dependent

is this really desirable though? wouldn't it go against this comment:

We try to avoid value-dependent behavior where the metadata (shape, dtype, etc.) depend on the values of the array.

I'd be more inclined to do:

find the interleaved type. E.g. if Int64Dtype() and Int32Dtype(), then this would be Int64Dtype()
convert to the corresponding numpy type (int64)
if there are missing values (which can't be represented by int64), then raise. It's then up to the user to provide a suitable dtype (e.g. float64) and/or a suitable na_value

This would be the 2D analogue of #48891

TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. labels Sep 20, 2018

TomAugspurger added this to the 0.24.0 milestone Sep 20, 2018

TomAugspurger mentioned this issue Sep 20, 2018

SparseArray is an ExtensionArray #22325

Merged

4 tasks

jorisvandenbossche mentioned this issue Oct 8, 2018

API: Standardize ExtensionDtype subtype name #22224

Open

TomAugspurger modified the milestones: 0.24.0, 0.24.1 Jan 4, 2019

TomAugspurger modified the milestones: 0.24.1, 0.24.2 Feb 1, 2019

jreback modified the milestones: 0.24.2, 0.25.0 Mar 3, 2019

jreback modified the milestones: 0.25.0, 1.0 Jun 28, 2019

amueller mentioned this issue Sep 27, 2019

MRG respect dtypes in pandas dataframes if homogeneous scikit-learn/scikit-learn#15094

Merged

TomAugspurger modified the milestones: 1.0, Contributions Welcome Jan 6, 2020

JBGreisman mentioned this issue Dec 16, 2020

Invoking to_numpy on 2D array returns object dtype rs-station/reciprocalspaceship#33

Closed

mzeitlin11 mentioned this issue May 21, 2021

BUG: Column 'Int8Dtype' casted to 'object' type when melted #41570

Closed

1 task

mroeschke added the Enhancement label Jun 22, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

lukemanley mentioned this issue Feb 8, 2023

PERF: ArrowExtensionArray.to_numpy(dtype=object) #51227

Merged

4 tasks

lukemanley mentioned this issue Apr 2, 2023

PERF: DataFrame.values for pyarrow-backed numeric types #52348

Closed

5 tasks

MichaelTiemannOSC mentioned this issue Jun 30, 2023

PintArray.astype does not use the same conventions as pandas for float and int dtypes hgrecco/pint-pandas#183

Closed

mroeschke mentioned this issue Nov 3, 2023

BUG: Pyarrow Float dataframe -> to_numpy yields Object type #55802

Closed

3 tasks

mroeschke mentioned this issue Jun 19, 2024

BUG: DataFrame.to_numpy() unnecessarily upcasts to object dtype. #59054

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ExtensionDtype API for preferred type when converting to NumPy array #22791

ExtensionDtype API for preferred type when converting to NumPy array #22791

TomAugspurger commented Sep 20, 2018 •

edited

Loading

jorisvandenbossche commented Sep 21, 2018

jorisvandenbossche commented Sep 26, 2018

TomAugspurger commented Sep 26, 2018

jorisvandenbossche commented Sep 26, 2018

jorisvandenbossche commented Sep 26, 2018

TomAugspurger commented Sep 26, 2018

jorisvandenbossche commented Sep 26, 2018

jreback commented Jan 4, 2019

jorisvandenbossche commented Jan 4, 2019

TomAugspurger commented Jan 4, 2019 via email

jbrockmendel commented Feb 16, 2022

JBGreisman commented May 2, 2022

MarcoGorelli commented Oct 5, 2022

ExtensionDtype API for preferred type when converting to NumPy array #22791

ExtensionDtype API for preferred type when converting to NumPy array #22791

Comments

TomAugspurger commented Sep 20, 2018 • edited Loading

jorisvandenbossche commented Sep 21, 2018

jorisvandenbossche commented Sep 26, 2018

TomAugspurger commented Sep 26, 2018

jorisvandenbossche commented Sep 26, 2018

jorisvandenbossche commented Sep 26, 2018

TomAugspurger commented Sep 26, 2018

jorisvandenbossche commented Sep 26, 2018

jreback commented Jan 4, 2019

jorisvandenbossche commented Jan 4, 2019

TomAugspurger commented Jan 4, 2019 via email

jbrockmendel commented Feb 16, 2022

JBGreisman commented May 2, 2022

MarcoGorelli commented Oct 5, 2022

TomAugspurger commented Sep 20, 2018 •

edited

Loading