Skip to content

API: Should factorize(categorical) return a Categorical for uniques? #19721

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #3
TomAugspurger opened this issue Feb 15, 2018 · 15 comments · Fixed by #50784
Closed
Tracked by #3

API: Should factorize(categorical) return a Categorical for uniques? #19721

TomAugspurger opened this issue Feb 15, 2018 · 15 comments · Fixed by #50784
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions

Comments

@TomAugspurger
Copy link
Contributor

Returning a categorical feels more natural to me

In [11]: pd.factorize(pd.Categorical(['a', 'a', 'c']))
Out[11]: (array([0, 0, 1]), array([0, 1]))

That's kind of what we do for a DatetimeIndex with TZ:

In [10]: pd.factorize(pd.Series(pd.DatetimeIndex(['2017', '2017'], tz='US/Eastern')))
Out[10]:
(array([0, 0]),
 DatetimeIndex(['2017-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None))
@TomAugspurger TomAugspurger added API Design Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Categorical Categorical Data Type labels Feb 15, 2018
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Feb 15, 2018
@jorisvandenbossche
Copy link
Member

Shouldn't it rather give the same as:

In [17]: pd.factorize(['a', 'a', 'c'])
Out[17]: (array([0, 0, 1]), array(['a', 'c'], dtype=object))

(the fact that it returns [0, 1] as the unique labels clearly is a bug I think, it seems to be factorizing the codes)

When factorizing a Categorical, I would expect to get back (codes, categories), not (codes, categorical)

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Feb 16, 2018 via email

@TomAugspurger
Copy link
Contributor Author

Some other weirdness

In [1]: import pandas as pd

In [2]: pd.factorize(pd.Categorical(['a', 'a', 'b']))
Out[2]: (array([0, 0, 1]), array([0, 1]))

In [3]: pd.factorize(pd.Index(pd.Categorical(['a', 'a', 'b'])))
Out[3]:
(array([0, 0, 1]),
 CategoricalIndex([nan, nan], categories=['a', 'b'], ordered=False, dtype='category'))

In [4]: pd.factorize(pd.Series(pd.Categorical(['a', 'a', 'b'])))
Out[4]: (array([0, 0, 1]), Int64Index([0, 1], dtype='int64'))

Out[3] should be CategoricalIndex(['a', 'b']) or Categorical(['a', 'b']), or ndarray(['a', 'b']),

Out[4] should be similar.

@jorisvandenbossche
Copy link
Member

In general, I think the type of the uniques should be the same array-like as the input (or in case of Series/Index, the array-like that is backing the Series).
So eg for DatetimeIndex with tz or Series of such values, this should in the future give a DatetimeTZArray (and that it now gives a DatetimeIndex is then logical I think). For numerical Series/Index/array, it is a normal numpy array.
So in the future for a certain extension type (Index with such values, Series with such values of Array class itself), would return an extension array of its type?

But, I would personally deviate from this rule for Categoricals, and in those cases return the array-like of which the categories are made up, instead of a Categorical itself.

@jreback
Copy link
Contributor

jreback commented Feb 16, 2018

the return values now are all as expected
by definition factorizing gives u a categorical return data ie codes and uniques (categories); further the categories are all array like

@jreback
Copy link
Contributor

jreback commented Feb 16, 2018

hmm actually these do look a bit odd
factorizing should be in the categories not the codes

@jorisvandenbossche
Copy link
Member

the return values now are all as expected
by definition factorizing gives u a categorical return data ie codes and uniques (categories); further the categories are all array like

Jeff, did you look at the examples Tom showed? It is clearly not returning the " uniques (categories)" in case of categorical input.

And the general question is not only about the current behaviour, but what to do for the new extension arrays

@TomAugspurger
Copy link
Contributor Author

Cross posted :)

Let's limit it to just fixing

In [2]: pd.factorize(pd.Series(pd.Categorical(['a', 'a'])))
Out[2]: (array([0, 0]), Int64Index([0], dtype='int64'))

to return the unique categories.

Out[2]: (array([0, 0]), array(['a']))

@jorisvandenbossche
Copy link
Member

Yes, agreed the above should be fixed.

But, I think we still need to discuss what it should do for ExtensionArrays.
Currently the docstring says that ndarray is returned for arrays, and Index for Index/Series, which is also what it does:

In [74]: for vals in [np.array([0, 1, 2]), np.array(['a', 'b', 'c']), pd.DatetimeIndex([1, 2, 3], tz='UTC'), pd.Categorical(['a', 'b', 'a'])]:
    ...:     for cont in [lambda x: x, pd.Series, pd.Index]:
    ...:         obj = cont(vals)
    ...:         codes, uniques = pd.factorize(obj)
    ...:         print(vals.dtype, type(obj).__name__, '  ->  ', type(uniques).__name__)
    ...:         
    ...:         
int64 ndarray   ->   ndarray
int64 Series   ->   Int64Index
int64 Int64Index   ->   Int64Index
<U1 ndarray   ->   ndarray
<U1 Series   ->   Index
<U1 Index   ->   Index
datetime64[ns, UTC] DatetimeIndex   ->   DatetimeIndex  # this deviation is normal since I used index as input (since there is no array-like yet)
datetime64[ns, UTC] Series   ->   DatetimeIndex
datetime64[ns, UTC] DatetimeIndex   ->   DatetimeIndex
category Categorical   ->   ndarray
category Series   ->   Int64Index
category CategoricalIndex   ->   CategoricalIndex

But, for a new ExtensionArray, I don't think we want to return an Index? As for now I think we will not yet support an extension array to be stored in an Index? Shouldn't it rather be a instance of the ExtensionArray itself (holding the unique values)?

@jorisvandenbossche
Copy link
Member

But, for a new ExtensionArray, I don't think we want to return an Index? As for now I think we will not yet support an extension array to be stored in an Index? Shouldn't it rather be a instance of the ExtensionArray itself (holding the unique values)?

Although, of course, in the idea that those unique values will typically be not that many in number (or at least much shorter than the input array), it is maybe not a problem to have a materialized ExtensionArray stored in Index.
(on the other hand, I think array-likes would be a more logical return value here, and I think we only return Index because of the absence of arrays likes for all types up to now)

@TomAugspurger
Copy link
Contributor Author

As for now I think we will not yet support an extension array to be stored in an Index?

Correct, though there was interest in that on Twitter.

Shouldn't it rather be a instance of the ExtensionArray itself (holding the unique values)?

I see use-cases for either ndarray[object] or ExtensionArray, since they support different operations. Does this warrant a new keyword to factorize?

@TomAugspurger
Copy link
Contributor Author

I'm revisiting this now. I think that the rule should be that uniques has the same type as the input (except Series -> Index)

  • factorize(ndarray)[1] -> ndarray
  • factorize(ExtensionArray)[1] -> ExtensionArray
  • factorize(Serise/Indxe)[1] -> Index

In particular, we'd change pd.factorize(Categorical) to return a Categorical for uniques.

Consider the case of unobserved categories. I'd like to treat these two arrays differently

In [6]: c1 = pd.Categorical(['a', 'a', 'b'], categories=['a', 'b'])

In [7]: c2 = pd.Categorical(['a', 'a', 'b'], categories=['a', 'b', 'c'])

If we return an ndarray for uniques, these will be the same:

In [8]: pd.factorize(c1)[1]
Out[8]: array(['a', 'b'], dtype=object)

In [9]: pd.factorize(c2)[1]
Out[9]: array(['a', 'b'], dtype=object)

But if we return a Categorical, we can preserve the unobsesrved categories in uniques.dtype. That also would avoid the somewhat strange situation when moving to a CategoricalIndex

In [10]: pd.factorize(pd.CategoricalIndex(c1))[1]
Out[10]: CategoricalIndex(['a', 'b'], categories=['a', 'b'], ordered=False, dtype='category')

In [11]: pd.factorize(pd.CategoricalIndex(c2))[1]
Out[11]: CategoricalIndex(['a', 'b'], categories=['a', 'b', 'c'], ordered=False, dtype='category')

It seems strange to be that the result dtype would change based on the container (to be clear, I'm OK with the output container changing from Categorical -> CategoricalIndex. But I'd like pd.factorize(cat)[1].dtype to match pd.factorize(cat_index)[1].dtype).)

@jorisvandenbossche
Copy link
Member

I still think that factorize(Series[ExtensionArray])[1] -> ExtensionArray might make more sense than returning an Index.

It seems strange to be that the result dtype would change based on the container (to be clear, I'm OK with the output container changing from Categorical -> CategoricalIndex. But I'd like pd.factorize(cat)[1].dtype to match pd.factorize(cat_index)[1].dtype).)

Agreed that the dtype should be the same.
But here, I also find it strange why factorize(categorical) would return a Categorical, while factorize(Series[categorical]) would return CategoricalIndex. I mean, I don't see any justification or logic for this difference? (apart from historical reasons, which might be enough to keep it like that and do the same for ExtensionArrays ..)

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Feb 27, 2018

apart from historical reasons

100% agreed. If I were starting from scratch I would say factorize(series)[1] -> array (ndarray or ExtensionArray). I'm just not sure that changing it is worthwhile. And why stop there? Shouldn't factorize(index) return an array? Index and Series are both just containers, so why box one and not the other?

I'll think about if there's a deprecation path, and if we even want to go down this path.

@mroeschke
Copy link
Member

Looks to now indeed return a Categorical. I guess this behavior seems right and could use a test if it doesn't exist

In [2]: pd.factorize(pd.Categorical(['a', 'a', 'c']))
Out[2]:
(array([0, 0, 1]),
 ['a', 'c']
 Categories (2, object): ['a', 'c'])

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed API Design Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Categorical Categorical Data Type labels Jun 19, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants