Skip to content

ENH: Add categorize method to Categoricals #10374

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Jun 17, 2015 · 4 comments
Closed

ENH: Add categorize method to Categoricals #10374

TomAugspurger opened this issue Jun 17, 2015 · 4 comments
Labels

Comments

@TomAugspurger
Copy link
Contributor

I may be missing an existing method here. Didn't see anything though...

Given a Categorical

In [10]: s = pd.Series(['A', 'B', 'C', 'A'], dtype='category')

In [11]: s
Out[11]:
0    A
1    B
2    C
3    A
dtype: category
Categories (3, object): [A, B, C]

And some codes, new_codes = pd.Series([0, 1, 0, 2], index=[4, 5, 6, 7]), I'd like to easily categorize the new codes. I think this is essentially

In [22]: catmap = dict(enumerate(s.cat.categories))

In [23]: new_codes.map(catmap).astype('category')
Out[23]:
4    A
5    B
6    A
7    C
dtype: object

That's not quite the same (what if new_codes doesn't have every code from s, then it's "missing"? need to make sure the categories / codes are identical), but the basic idea is there.


My use-case here is going from Categorical -> codes -> scikit-learn classifier -> prediction (codes). It'd be nice if the column could easily transform that result back.

Naming wise, I'd say categorize or decode. Should we also add the symmetrical encode method for category to code?

If new_codes contains a previously unseen code I'd say raise, or maybe insert nan (perhaps an option here?)

@TomAugspurger TomAugspurger modified the milestones: 0.17.0, Next Major Release Jun 17, 2015
@TomAugspurger
Copy link
Contributor Author

Hmm, just saw pd.Categorical.from_codes in the docs. So it'd be pd.Categorical.from_codes(new_codes, s.cat.categories), which isn't terrible to type. Just wasn't the first place I looked. I'll close this unless anyone else thinks adding essentially that method to the .cat accessor, just with categories already assigned, would be useful.

so

>>> s.cat.categorize(new_codes)
# rather than
>>> pd.Categorical.from_codes(new_codes, s.cat.categories)

@jreback
Copy link
Contributor

jreback commented Jun 17, 2015

This is how I would do it (currently a bug reports here #10324)

In [40]: new_codes.map(s)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-40-f6a8e88422bb> in <module>()
----> 1 new_codes.map(s)

/Users/jreback/pandas/pandas/core/series.pyc in map(self, arg, na_action)
   2010 
   2011             indexer = arg.index.get_indexer(values)
-> 2012             new_values = com.take_1d(arg.values, indexer)
   2013             return self._constructor(new_values,
   2014                                      index=self.index).__finalize__(self)

/Users/jreback/pandas/pandas/core/common.pyc in take_nd(arr, indexer, axis, out, fill_value, mask_info, allow_fill)
    829         out_shape[axis] = len(indexer)
    830         out_shape = tuple(out_shape)
--> 831         if arr.flags.f_contiguous and axis == arr.ndim - 1:
    832             # minor tweak that can make an order-of-magnitude difference
    833             # for dataframes initialized directly from 2-d ndarrays

AttributeError: 'Categorical' object has no attribute 'flags'

@kokes
Copy link
Contributor

kokes commented Dec 14, 2018

@jreback the bug has been resolved, so your proposed solution works just fine, should this be closed? Not sure if there's a need for additional APIs for this.

@jreback
Copy link
Contributor

jreback commented Dec 14, 2018

yep this looks good.

@jreback jreback closed this as completed Dec 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants