Skip to content

Recoding as numerical categories with multiple columns #14242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lesshaste opened this issue Sep 18, 2016 · 2 comments
Closed

Recoding as numerical categories with multiple columns #14242

lesshaste opened this issue Sep 18, 2016 · 2 comments
Labels
Categorical Categorical Data Type Duplicate Report Duplicate issue or pull request Usage Question

Comments

@lesshaste
Copy link

lesshaste commented Sep 18, 2016

Code Sample, a copy-pastable example if possible

This SO questions asks the simple question of how to recode strings in a data frame as numerical categories http://stackoverflow.com/questions/39475187/how-to-speed-up-recoding-into-integers .

The pandas solution x = df.apply(lambda x: x.astype('category').cat.codes) Is by far the fastest. However it doesn't give a consistent answer if the data frame has more than one column.

E.g.

g,k
a,h
c,i
j,e
d,i
i,h
b,b
d,d
i,a
d,h

gets recoded to:

0 1
0 4 6
1 0 4
2 2 5
3 6 3
4 3 5
5 5 4
6 1 1
7 3 2
8 5 0
9 3 4

Notice that 'd' is mapped to 3 in the first column but 2 in the second.

It would be great if pandas could do this recoding consistently.

Expected Output

output of pd.show_versions()

@jorisvandenbossche
Copy link
Member

@lesshaste The fact that df.apply(lambda x: x.astype('category').cat.codes) does this column by column is expected.
But see #12860 for some discussion on how to be able to do this on multiple columns at once (using the same categories for all columns).

The workaround listed over there is:

uniques = np.sort(pd.unique(df.values.ravel()))
df.apply(lambda x: x.astype('category', categories=uniques))

@jorisvandenbossche jorisvandenbossche added Usage Question Duplicate Report Duplicate issue or pull request Categorical Categorical Data Type labels Sep 18, 2016
@jorisvandenbossche jorisvandenbossche added this to the No action milestone Sep 18, 2016
@lesshaste
Copy link
Author

That is very nice and I had no idea you could do that. Thank you.

@jreback jreback closed this as completed Sep 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Duplicate Report Duplicate issue or pull request Usage Question
Projects
None yet
Development

No branches or pull requests

3 participants