-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: Should factorize(categorical) return a Categorical for uniques? #19721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Shouldn't it rather give the same as:
(the fact that it returns [0, 1] as the unique labels clearly is a bug I think, it seems to be factorizing the codes) When factorizing a Categorical, I would expect to get back (codes, categories), not (codes, categorical) |
Yeah returning the factorized codes is certainly a bug.
I ask because we need to think about how to handle other 3rd party extension array types, and we’re currently inconsistent internally.
…________________________________
From: Joris Van den Bossche <[email protected]>
Sent: Friday, February 16, 2018 1:57:30 AM
To: pandas-dev/pandas
Cc: Tom Augspurger; Author
Subject: Re: [pandas-dev/pandas] API: Should factorize(categorical) return a Categorical for uniques? (#19721)
Shouldn't it rather give the same as:
In [17]: pd.factorize(['a', 'a', 'c'])
Out[17]: (array([0, 0, 1]), array(['a', 'c'], dtype=object))
(the fact that it returns [0, 1] as the unique labels clearly is a bug I think, it seems to be factorizing the codes)
When factorizing a Categorical, I would expect to get back (codes, categories), not (codes, categorical)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#19721 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIrPbUAM5lHs1b0b0kGieUhZegY8Eks5tVTTpgaJpZM4SHhOm>.
|
Some other weirdness In [1]: import pandas as pd
In [2]: pd.factorize(pd.Categorical(['a', 'a', 'b']))
Out[2]: (array([0, 0, 1]), array([0, 1]))
In [3]: pd.factorize(pd.Index(pd.Categorical(['a', 'a', 'b'])))
Out[3]:
(array([0, 0, 1]),
CategoricalIndex([nan, nan], categories=['a', 'b'], ordered=False, dtype='category'))
In [4]: pd.factorize(pd.Series(pd.Categorical(['a', 'a', 'b'])))
Out[4]: (array([0, 0, 1]), Int64Index([0, 1], dtype='int64')) Out[3] should be Out[4] should be similar. |
In general, I think the type of the uniques should be the same array-like as the input (or in case of Series/Index, the array-like that is backing the Series). But, I would personally deviate from this rule for Categoricals, and in those cases return the array-like of which the categories are made up, instead of a Categorical itself. |
the return values now are all as expected |
hmm actually these do look a bit odd |
Jeff, did you look at the examples Tom showed? It is clearly not returning the " uniques (categories)" in case of categorical input. And the general question is not only about the current behaviour, but what to do for the new extension arrays |
Cross posted :) Let's limit it to just fixing In [2]: pd.factorize(pd.Series(pd.Categorical(['a', 'a'])))
Out[2]: (array([0, 0]), Int64Index([0], dtype='int64')) to return the unique categories. Out[2]: (array([0, 0]), array(['a'])) |
Yes, agreed the above should be fixed. But, I think we still need to discuss what it should do for ExtensionArrays.
But, for a new ExtensionArray, I don't think we want to return an Index? As for now I think we will not yet support an extension array to be stored in an Index? Shouldn't it rather be a instance of the ExtensionArray itself (holding the unique values)? |
Although, of course, in the idea that those unique values will typically be not that many in number (or at least much shorter than the input array), it is maybe not a problem to have a materialized ExtensionArray stored in Index. |
Correct, though there was interest in that on Twitter.
I see use-cases for either ndarray[object] or ExtensionArray, since they support different operations. Does this warrant a new keyword to |
I'm revisiting this now. I think that the rule should be that
In particular, we'd change Consider the case of unobserved categories. I'd like to treat these two arrays differently In [6]: c1 = pd.Categorical(['a', 'a', 'b'], categories=['a', 'b'])
In [7]: c2 = pd.Categorical(['a', 'a', 'b'], categories=['a', 'b', 'c']) If we return an
But if we return a In [10]: pd.factorize(pd.CategoricalIndex(c1))[1]
Out[10]: CategoricalIndex(['a', 'b'], categories=['a', 'b'], ordered=False, dtype='category')
In [11]: pd.factorize(pd.CategoricalIndex(c2))[1]
Out[11]: CategoricalIndex(['a', 'b'], categories=['a', 'b', 'c'], ordered=False, dtype='category') It seems strange to be that the result dtype would change based on the container (to be clear, I'm OK with the output container changing from Categorical -> CategoricalIndex. But I'd like |
I still think that
Agreed that the dtype should be the same. |
100% agreed. If I were starting from scratch I would say I'll think about if there's a deprecation path, and if we even want to go down this path. |
Looks to now indeed return a Categorical. I guess this behavior seems right and could use a test if it doesn't exist
|
Returning a categorical feels more natural to me
That's kind of what we do for a
DatetimeIndex
with TZ:The text was updated successfully, but these errors were encountered: