DataFrame.apply() silently converting columns to non-categorical type #11208

pganssle · 2015-09-30T19:40:45Z

Per this SO question, using apply() to convert multiple DataFrame columns to categorical does not work unless all columns are categorical.

An MWE demonstrating the issue can be found at this gist for easy copy-paste.

Using this for example data:

import pandas as pd

pdf = pd.DataFrame(dict(name=       ('Earl', 'Eve', 'Alan', 'Randall', 'Danielle'),
                        age=        (    29,    17,     73,        31,         62),
                        gender=     (   'M',   'F',    'M',       'M',        'F'),
                        nationality=(  'US',  'UK',  'CAN',     'CAN',       'US'),
                        height=     ( 182.9, 167.6,  175.3,     170.2,      172.8)),
                   columns=('name', 'age', 'gender', 'nationality', 'height'))
pdf = pdf.set_index('name')
>>> print(pdf)

          age gender nationality  height
name
Earl       29      M          US   182.9
Eve        17      F          UK   167.6
Alan       73      M         CAN   175.3
Randall    31      M         CAN   170.2
Danielle   62      F          US   172.8

I tried to use pdf.apply() to convert 'gender' and 'nationality' to categorical columns:

cat_list = {'gender', 'nationality'}
set_cat_list = lambda x: x.astype('category') if x.name in cat_list else x
dfa = pdf.apply(set_cat_list)

>>> print('Applied to subset: dtype={}'.format(dfa['gender'].dtype))
Applied to subset: dtype=object

To make sure that the problem isn't just that I'm never reaching the x.astype('category') branch of the lambda expression, I added in an alert:

in_cl = lambda x: x.name in cat_list
set_cat_list_alert = lambda x: (set_cat_list(x),
                                sys.stdout.write('{}: {}\n'.format(x.name, in_cl(x))))[0]
dfa = pdf.apply(set_cat_list_alert)
>>> print('Applied to subset: dtype={}'.format(dfa['gender'].dtype))
age: False
age: False
gender: True
nationality: True
height: False
Applied to subset: dtype=object

To verify that it's not just a problem setting any columns as categorical, I tried setting all columns to categorical, which works just fine:

set_cat = lambda x: x.astype('category')
dfb = pdf.apply(set_cat)

>>> print('Applied to whole frame: dtype={}'.format(dfb['gender'].dtype))
Applied to whole frame: dtype=category

Finally, I tried just using a for loop to duplicate the final result, to make sure that mixed categorical / non-categorical columns can coexist like this:

dfc = pdf.copy()
for cat in cat_list:
    dfc[cat] = pdf[cat].astype('category')

>>> print('For loop: dtype={}'.format(dfc['gender'].dtype))
For loop: dtype=category

Is this the desired behavior?

The text was updated successfully, but these errors were encountered:

jreback · 2015-09-30T19:56:03Z

this is a dup of #9573 which is closed in the shortly-to-be-released 0.17.0

thanks for the report

jreback closed this as completed Sep 30, 2015

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Categorical Categorical Data Type labels Sep 30, 2015

jreback added this to the 0.17.0 milestone Sep 30, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.apply() silently converting columns to non-categorical type #11208

DataFrame.apply() silently converting columns to non-categorical type #11208

pganssle commented Sep 30, 2015

jreback commented Sep 30, 2015

DataFrame.apply() silently converting columns to non-categorical type #11208

DataFrame.apply() silently converting columns to non-categorical type #11208

Comments

pganssle commented Sep 30, 2015

jreback commented Sep 30, 2015