API: Should factorize(categorical) return a Categorical for uniques? #19721

TomAugspurger · 2018-02-15T21:18:32Z

Returning a categorical feels more natural to me

In [11]: pd.factorize(pd.Categorical(['a', 'a', 'c']))
Out[11]: (array([0, 0, 1]), array([0, 1]))

That's kind of what we do for a DatetimeIndex with TZ:

In [10]: pd.factorize(pd.Series(pd.DatetimeIndex(['2017', '2017'], tz='US/Eastern')))
Out[10]:
(array([0, 0]),
 DatetimeIndex(['2017-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None))

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2018-02-16T07:57:25Z

Shouldn't it rather give the same as:

In [17]: pd.factorize(['a', 'a', 'c'])
Out[17]: (array([0, 0, 1]), array(['a', 'c'], dtype=object))

(the fact that it returns [0, 1] as the unique labels clearly is a bug I think, it seems to be factorizing the codes)

When factorizing a Categorical, I would expect to get back (codes, categories), not (codes, categorical)

TomAugspurger · 2018-02-16T11:29:39Z

Yeah returning the factorized codes is certainly a bug. I ask because we need to think about how to handle other 3rd party extension array types, and we’re currently inconsistent internally.

…

________________________________ From: Joris Van den Bossche <[email protected]> Sent: Friday, February 16, 2018 1:57:30 AM To: pandas-dev/pandas Cc: Tom Augspurger; Author Subject: Re: [pandas-dev/pandas] API: Should factorize(categorical) return a Categorical for uniques? (#19721) Shouldn't it rather give the same as: In [17]: pd.factorize(['a', 'a', 'c']) Out[17]: (array([0, 0, 1]), array(['a', 'c'], dtype=object)) (the fact that it returns [0, 1] as the unique labels clearly is a bug I think, it seems to be factorizing the codes) When factorizing a Categorical, I would expect to get back (codes, categories), not (codes, categorical) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#19721 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIrPbUAM5lHs1b0b0kGieUhZegY8Eks5tVTTpgaJpZM4SHhOm>.

TomAugspurger · 2018-02-16T12:17:57Z

Some other weirdness

In [1]: import pandas as pd

In [2]: pd.factorize(pd.Categorical(['a', 'a', 'b']))
Out[2]: (array([0, 0, 1]), array([0, 1]))

In [3]: pd.factorize(pd.Index(pd.Categorical(['a', 'a', 'b'])))
Out[3]:
(array([0, 0, 1]),
 CategoricalIndex([nan, nan], categories=['a', 'b'], ordered=False, dtype='category'))

In [4]: pd.factorize(pd.Series(pd.Categorical(['a', 'a', 'b'])))
Out[4]: (array([0, 0, 1]), Int64Index([0, 1], dtype='int64'))

Out[3] should be CategoricalIndex(['a', 'b']) or Categorical(['a', 'b']), or ndarray(['a', 'b']),

Out[4] should be similar.

jorisvandenbossche · 2018-02-16T12:40:15Z

In general, I think the type of the uniques should be the same array-like as the input (or in case of Series/Index, the array-like that is backing the Series).
So eg for DatetimeIndex with tz or Series of such values, this should in the future give a DatetimeTZArray (and that it now gives a DatetimeIndex is then logical I think). For numerical Series/Index/array, it is a normal numpy array.
So in the future for a certain extension type (Index with such values, Series with such values of Array class itself), would return an extension array of its type?

But, I would personally deviate from this rule for Categoricals, and in those cases return the array-like of which the categories are made up, instead of a Categorical itself.

jreback · 2018-02-16T12:58:37Z

the return values now are all as expected
by definition factorizing gives u a categorical return data ie codes and uniques (categories); further the categories are all array like

jreback · 2018-02-16T13:01:09Z

hmm actually these do look a bit odd
factorizing should be in the categories not the codes

jorisvandenbossche · 2018-02-16T13:01:22Z

the return values now are all as expected
by definition factorizing gives u a categorical return data ie codes and uniques (categories); further the categories are all array like

Jeff, did you look at the examples Tom showed? It is clearly not returning the " uniques (categories)" in case of categorical input.

And the general question is not only about the current behaviour, but what to do for the new extension arrays

TomAugspurger · 2018-02-16T13:02:43Z

Cross posted :)

Let's limit it to just fixing

In [2]: pd.factorize(pd.Series(pd.Categorical(['a', 'a'])))
Out[2]: (array([0, 0]), Int64Index([0], dtype='int64'))

to return the unique categories.

Out[2]: (array([0, 0]), array(['a']))

jorisvandenbossche · 2018-02-16T13:13:44Z

Yes, agreed the above should be fixed.

But, I think we still need to discuss what it should do for ExtensionArrays.
Currently the docstring says that ndarray is returned for arrays, and Index for Index/Series, which is also what it does:

In [74]: for vals in [np.array([0, 1, 2]), np.array(['a', 'b', 'c']), pd.DatetimeIndex([1, 2, 3], tz='UTC'), pd.Categorical(['a', 'b', 'a'])]:
    ...:     for cont in [lambda x: x, pd.Series, pd.Index]:
    ...:         obj = cont(vals)
    ...:         codes, uniques = pd.factorize(obj)
    ...:         print(vals.dtype, type(obj).__name__, '  ->  ', type(uniques).__name__)
    ...:         
    ...:         
int64 ndarray   ->   ndarray
int64 Series   ->   Int64Index
int64 Int64Index   ->   Int64Index
<U1 ndarray   ->   ndarray
<U1 Series   ->   Index
<U1 Index   ->   Index
datetime64[ns, UTC] DatetimeIndex   ->   DatetimeIndex  # this deviation is normal since I used index as input (since there is no array-like yet)
datetime64[ns, UTC] Series   ->   DatetimeIndex
datetime64[ns, UTC] DatetimeIndex   ->   DatetimeIndex
category Categorical   ->   ndarray
category Series   ->   Int64Index
category CategoricalIndex   ->   CategoricalIndex

But, for a new ExtensionArray, I don't think we want to return an Index? As for now I think we will not yet support an extension array to be stored in an Index? Shouldn't it rather be a instance of the ExtensionArray itself (holding the unique values)?

jorisvandenbossche · 2018-02-16T13:16:24Z

But, for a new ExtensionArray, I don't think we want to return an Index? As for now I think we will not yet support an extension array to be stored in an Index? Shouldn't it rather be a instance of the ExtensionArray itself (holding the unique values)?

Although, of course, in the idea that those unique values will typically be not that many in number (or at least much shorter than the input array), it is maybe not a problem to have a materialized ExtensionArray stored in Index.
(on the other hand, I think array-likes would be a more logical return value here, and I think we only return Index because of the absence of arrays likes for all types up to now)

TomAugspurger · 2018-02-16T13:22:09Z

As for now I think we will not yet support an extension array to be stored in an Index?

Correct, though there was interest in that on Twitter.

Shouldn't it rather be a instance of the ExtensionArray itself (holding the unique values)?

I see use-cases for either ndarray[object] or ExtensionArray, since they support different operations. Does this warrant a new keyword to factorize?

TomAugspurger · 2018-02-27T20:10:33Z

I'm revisiting this now. I think that the rule should be that uniques has the same type as the input (except Series -> Index)

factorize(ndarray)[1] -> ndarray
factorize(ExtensionArray)[1] -> ExtensionArray
factorize(Serise/Indxe)[1] -> Index

In particular, we'd change pd.factorize(Categorical) to return a Categorical for uniques.

Consider the case of unobserved categories. I'd like to treat these two arrays differently

In [6]: c1 = pd.Categorical(['a', 'a', 'b'], categories=['a', 'b'])

In [7]: c2 = pd.Categorical(['a', 'a', 'b'], categories=['a', 'b', 'c'])

If we return an ndarray for uniques, these will be the same:

In [8]: pd.factorize(c1)[1]
Out[8]: array(['a', 'b'], dtype=object)

In [9]: pd.factorize(c2)[1]
Out[9]: array(['a', 'b'], dtype=object)

But if we return a Categorical, we can preserve the unobsesrved categories in uniques.dtype. That also would avoid the somewhat strange situation when moving to a CategoricalIndex

In [10]: pd.factorize(pd.CategoricalIndex(c1))[1]
Out[10]: CategoricalIndex(['a', 'b'], categories=['a', 'b'], ordered=False, dtype='category')

In [11]: pd.factorize(pd.CategoricalIndex(c2))[1]
Out[11]: CategoricalIndex(['a', 'b'], categories=['a', 'b', 'c'], ordered=False, dtype='category')

It seems strange to be that the result dtype would change based on the container (to be clear, I'm OK with the output container changing from Categorical -> CategoricalIndex. But I'd like pd.factorize(cat)[1].dtype to match pd.factorize(cat_index)[1].dtype).)

jorisvandenbossche · 2018-02-27T20:34:52Z

I still think that factorize(Series[ExtensionArray])[1] -> ExtensionArray might make more sense than returning an Index.

It seems strange to be that the result dtype would change based on the container (to be clear, I'm OK with the output container changing from Categorical -> CategoricalIndex. But I'd like pd.factorize(cat)[1].dtype to match pd.factorize(cat_index)[1].dtype).)

Agreed that the dtype should be the same.
But here, I also find it strange why factorize(categorical) would return a Categorical, while factorize(Series[categorical]) would return CategoricalIndex. I mean, I don't see any justification or logic for this difference? (apart from historical reasons, which might be enough to keep it like that and do the same for ExtensionArrays ..)

TomAugspurger · 2018-02-27T20:41:05Z

apart from historical reasons

100% agreed. If I were starting from scratch I would say factorize(series)[1] -> array (ndarray or ExtensionArray). I'm just not sure that changing it is worthwhile. And why stop there? Shouldn't factorize(index) return an array? Index and Series are both just containers, so why box one and not the other?

I'll think about if there's a deprecation path, and if we even want to go down this path.

mroeschke · 2021-06-19T04:41:34Z

Looks to now indeed return a Categorical. I guess this behavior seems right and could use a test if it doesn't exist

In [2]: pd.factorize(pd.Categorical(['a', 'a', 'c']))
Out[2]:
(array([0, 0, 1]),
 ['a', 'c']
 Categories (2, object): ['a', 'c'])

TomAugspurger added API Design Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Categorical Categorical Data Type labels Feb 15, 2018

TomAugspurger added this to the Next Major Release milestone Feb 15, 2018

TomAugspurger mentioned this issue Feb 28, 2018

REF/BUG/API: factorizing categorical data #19938

Merged

mroeschke added the Bug label May 13, 2020

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed API Design Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Categorical Categorical Data Type labels Jun 19, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

phofl mentioned this issue Nov 16, 2022

TST: Fixed issues that need tests noatamir/pyladies-berlin-sprints#3

Open

17 tasks

luke396 mentioned this issue Jan 17, 2023

TST: Add test_factorize_mixed_values #50784

Merged

1 task

mroeschke closed this as completed in #50784 Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Should factorize(categorical) return a Categorical for uniques? #19721

API: Should factorize(categorical) return a Categorical for uniques? #19721

TomAugspurger commented Feb 15, 2018

jorisvandenbossche commented Feb 16, 2018

TomAugspurger commented Feb 16, 2018 via email

TomAugspurger commented Feb 16, 2018

jorisvandenbossche commented Feb 16, 2018

jreback commented Feb 16, 2018

jreback commented Feb 16, 2018

jorisvandenbossche commented Feb 16, 2018

TomAugspurger commented Feb 16, 2018

jorisvandenbossche commented Feb 16, 2018

jorisvandenbossche commented Feb 16, 2018

TomAugspurger commented Feb 16, 2018

TomAugspurger commented Feb 27, 2018

jorisvandenbossche commented Feb 27, 2018

TomAugspurger commented Feb 27, 2018 •

edited

Loading

mroeschke commented Jun 19, 2021

API: Should factorize(categorical) return a Categorical for uniques? #19721

API: Should factorize(categorical) return a Categorical for uniques? #19721

Comments

TomAugspurger commented Feb 15, 2018

jorisvandenbossche commented Feb 16, 2018

TomAugspurger commented Feb 16, 2018 via email

TomAugspurger commented Feb 16, 2018

jorisvandenbossche commented Feb 16, 2018

jreback commented Feb 16, 2018

jreback commented Feb 16, 2018

jorisvandenbossche commented Feb 16, 2018

TomAugspurger commented Feb 16, 2018

jorisvandenbossche commented Feb 16, 2018

jorisvandenbossche commented Feb 16, 2018

TomAugspurger commented Feb 16, 2018

TomAugspurger commented Feb 27, 2018

jorisvandenbossche commented Feb 27, 2018

TomAugspurger commented Feb 27, 2018 • edited Loading

mroeschke commented Jun 19, 2021

TomAugspurger commented Feb 27, 2018 •

edited

Loading