Skip to content

API: getting the "densified" values of a Categorical (preserving categories dtype) ? #26410

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jorisvandenbossche opened this issue May 15, 2019 · 3 comments
Labels
Categorical Categorical Data Type Enhancement

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 15, 2019

Assume we have a Categorical, and want to convert to a dense array (not encoded). We have np.asarray(..) and the to_dense() method (which uses asarray under the hood):

In [1]: cat = pd.Categorical(['a', 'b', 'a'])

In [2]: np.asarray(cat) 
Out[2]: array(['a', 'b', 'a'], dtype=object)

In [3]: cat.to_dense() 
Out[3]: array(['a', 'b', 'a'], dtype=object)

In addition, we also have get_values:

In [4]: cat.get_values() 
Out[4]: array(['a', 'b', 'a'], dtype=object)

get_values is mostly the same, with the exception that returns an Index for datetime/period/timedelta, and an object array for integers if there are missing values instead of float array:

In [10]: cat = pd.Categorical(pd.date_range("2012", periods=3))

In [11]: cat.to_dense()
Out[11]: 
array(['2012-01-01T00:00:00.000000000', '2012-01-02T00:00:00.000000000',
       '2012-01-03T00:00:00.000000000'], dtype='datetime64[ns]')

In [12]: cat.get_values()
Out[12]: DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03'], dtype='datetime64[ns]', freq='D')

With the result that it preserves somewhat more the dtype (although only specifically for datetime-like, it will not do it for any EA)

While looking into the deprecation of get_values (#26409), I was wondering: do we want some method to actually get a "dense" version of the array, but with the exact same dtype? (so returning an EA in case the categories have an extension dtype)

And should we deprecate to_dense() ?

@topper-123
Copy link
Contributor

I'd be +1 to a method that so that we can roundtrip and other ExtensionArrays. It could be useful simplify going to/from MultiIndex and DataFrame colunms.

to_dense is a wrapper around np. asarray so is quite useless. I'm +1 to deprecate that.

@jbrockmendel
Copy link
Member

And should we deprecate to_dense() ?

I'm on board for getting rid of Categorical.to_dense in its current form. If deprecation cycles weren't an issue, I would actually be inclined to alias to_dense to _internal_get_values, since to_dense is a more descriptive name, and that matches the SparseArray behavior.

FWIW in the ongoing work to get rid of _internal_get_values, it looks like Categorical is going to be the sticking point.

@jorisvandenbossche
Copy link
Member Author

I see you deprecated Categorical.to_dense (#32639).

I was just thinking, if we find to_dense the best name for this "get densified values" (I am not yet fully sure about it, but I also can't think of a better alternative), we can actually keep the method but only deprecate the default behaviour.

More specifically, we could introduce a new keyword to get the new behaviour (returning a densified array, potentially extension array), and only raise the deprecation warning if that keyword is not specified (and then internally we just have to always use the keyword).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Enhancement
Projects
None yet
Development

No branches or pull requests

4 participants