WIP/DO NOT MERGE: Categorical improvements #7444

jankatins · 2014-06-12T20:02:16Z

This is a PR to make discussing the doc changes easier. See #7217 for the main PR

TODO List: now in #7217

The Docs (updated 1st july, 4pm CEST)

Categorical¶

New in version 0.15.

Note

While there was in pandas.Categorical in earlier versions, the ability to use Categorical data in Series and DataFrame is new.

This is a short introduction to pandas Categorical type, including a short comparison with R’s factor.

Categoricals are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take on only a limited, and usually fixed, number of possible values (commonly called levels). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales.

In contrast to statistical categorical variables, a Categorical might have an order (e.g. ‘strongly agree’ vs ‘agree’ or ‘first observation’ vs. ‘second observation’), but numerical operations (additions, divisions, ...) are not possible.

All values of the Categorical are either in levels or np.nan. Order is defined by the order of the levels, not lexical order of the values. Internally, the data structure consists of a levels array and an integer array of level_codes which point to the real value in the levels array.

Categoricals are useful in the following cases:

A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory.
The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the levels, sorting and min/max will use the logical order instead of the lexical order.
As a signal to other python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types)

See also the API docs on Categoricals.

Object Creation¶

Categorical Series or columns in a DataFrame can be crated in several ways:

By passing a Categorical object to a Series or assigning it to a DataFrame:

In [1]: raw_cat = pd.Categorical(["a","b","c","a"])
In [2]: s = pd.Series(raw_cat)
In [3]: s

Out[3]: 

0    a

1    b

2    c

3    a

dtype: category
In [4]: df = pd.DataFrame({"A":["a","b","c","a"]})
In [5]: df["B"] = raw_cat
In [6]: df

Out[6]: 

   A  B

0  a  a

1  b  b

2  c  c

3  a  a

By converting an existing Series or column to a category type:

In [7]: df = pd.DataFrame({"A":["a","b","c","a"]})
In [8]: df["B"] = df["A"].astype('category')
In [9]: df

Out[9]: 

   A  B

0  a  a

1  b  b

2  c  c

3  a  a

By using some special functions:

In [10]: df = pd.DataFrame({'value': np.random.randint(0, 100, 20)})
In [11]: labels = [ "{0} - {1}".format(i, i + 9) for i in range(0, 100, 10) ]
In [12]: df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
In [13]: df.head(10)

Out[13]: 

   value    group

0     65  60 - 69

1     49  40 - 49

2     56  50 - 59

3     43  40 - 49

4     43  40 - 49

5     91  90 - 99

6     32  30 - 39

7     87  80 - 89

8     36  30 - 39

9      8    0 - 9

Categoricals have a specific category dtype:

In [14]: df.dtypes
Out[14]: 
value       int32
group    category
dtype: object

Note

In contrast to R’s factor function, a Categorical is not converting input values to string and levels will end up the same data type as the original values.

Note

I contrast to R’s factor function, there is currently no way to assign/change labels at creation time. Use levels to change the levels after creation time.

To get back to the original Series or numpy array, use Series.astype(original_dtype) or np.asarray(categorical):

In [15]: s = pd.Series(["a","b","c","a"])
In [16]: s

Out[16]: 

0    a

1    b

2    c

3    a

dtype: object
In [17]: s2 = s.astype('category')
In [18]: s2

Out[18]: 

0    a

1    b

2    c

3    a

dtype: category
In [19]: s3 = s2.astype('string')
In [20]: s3

Out[20]: 

0    a

1    b

2    c

3    a

dtype: object
In [21]: np.asarray(s2.cat)

Out[21]: array(['a', 'b', 'c', 'a'], dtype=object)

Working with levels¶

Categoricals have a levels property, which list their possible values. If you don’t manually specify levels, they are inferred from the passed in values. Series of type category expose the same interface via their cat property.

In [22]: raw_cat = pd.Categorical(["a","b","c","a"])
In [23]: raw_cat.levels

Out[23]: Index([u'a', u'b', u'c'], dtype='object')
In [24]: raw_cat.ordered

Out[24]: True
# Series of type "category" also expose these interface via the .cat property:

In [25]: s = pd.Series(raw_cat)
In [26]: s.cat.levels

Out[26]: Index([u'a', u'b', u'c'], dtype='object')
In [27]: s.cat.ordered

Out[27]: True

Note

New Categorical are automatically ordered if the passed in values are sortable or a levels argument is supplied. This is a difference to R’s factors, which are unordered unless explicitly told to be ordered (ordered=TRUE).

It’s also possible to pass in the levels in a specific order:

In [28]: raw_cat = pd.Categorical(["a","b","c","a"], levels=["c","b","a"])
In [29]: s = pd.Series(raw_cat)
In [30]: s.cat.levels

Out[30]: Index([u'c', u'b', u'a'], dtype='object')
In [31]: s.cat.ordered

Out[31]: True

Note

Passing in a levels argument implies ordered=True.

Any value omitted in the levels argument will be replaced by np.nan:

In [32]: raw_cat = pd.Categorical(["a","b","c","a"], levels=["a","b"])
In [33]: s = pd.Series(raw_cat)
In [34]: s.cat.levels

Out[34]: Index([u'a', u'b'], dtype='object')
In [35]: s

Out[35]: 

0      a

1      b

2    NaN

3      a

dtype: category

Renaming levels is done by assigning new values to the Category.levels or Series.cat.levels property:

In [36]: s = pd.Series(pd.Categorical(["a","b","c","a"]))
In [37]: s

Out[37]: 

0    a

1    b

2    c

3    a

dtype: category
In [38]: s.cat.levels = ["Group %s" % g for g in s.cat.levels]
In [39]: s

Out[39]: 

0    Group a

1    Group b

2    Group c

3    Group a

dtype: category
In [40]: s.cat.levels = [1,2,3]
In [41]: s

Out[41]: 

0    1

1    2

2    3

3    1

dtype: category

Note

I contrast to R’s factor function, a Categorical can have levels of other types than string.

Levels must be unique or a ValueError is raised:

In [42]: try:
   ....:     s.cat.levels = [1,1,1]
   ....: except ValueError as e:
   ....:     print("ValueError: " + str(e))
   ....: 
ValueError: Categorical levels must be unique

Appending a level can be done by assigning a levels list longer than the current levels:

In [43]: s.cat.levels = [1,2,3,4]
In [44]: s.cat.levels

Out[44]: Int64Index([1, 2, 3, 4], dtype='int64')
In [45]: s

Out[45]: 

0    1

1    2

2    3

3    1

dtype: category

Removing a level is also possible, but only the last level(s) can be removed by assigning a shorter list than current levels. Values which are omitted are replaced by np.nan.

In [46]: s.levels = [1,2]
In [47]: s

Out[47]: 

0    1

1    2

2    3

3    1

dtype: category

Note

It’s only possible to remove or add a level at the last position. If that’s not where you want to remove an old or add a new level, use Category.reorder_levels(new_order) or Series.cat.reorder_levels(new_order) methods before or after.

Removing unused levels can also be done:

In [48]: raw = pd.Categorical(["a","b","a"], levels=["a","b","c","d"])
In [49]: c = pd.Series(raw)
In [50]: raw

Out[50]: 

 a

 b

 a

Levels (4): Index(['a', 'b', 'c', 'd'], dtype=object), ordered
In [51]: raw.remove_unused_levels()
In [52]: raw

Out[52]: 

 a

 b

 a

Levels (2): Index(['a', 'b'], dtype=object), ordered
In [53]: c.cat.remove_unused_levels()
In [54]: c

Out[54]: 

0    a

1    b

2    a

dtype: category

Note

In contrast to R’s factor function, passing a Categorical as the sole input to the Categorical constructor will not remove unused levels but create a new Categorical which is equal to the passed in one!

Ordered or not...¶

If a Categoricals is ordered (cat.ordered == True), then the order of the levels has a meaning and certain operations are possible. If the categorical is unordered, a TypeError is raised.

In [55]: s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))
In [56]: try:

   ....:     s.sort()

   ....: except TypeError as e:

   ....:     print("TypeError: " + str(e))

   ....:

TypeError: Categorical not ordered
In [57]: s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=True))
In [58]: s.sort()
In [59]: s

Out[59]: 

0    a

3    a

1    b

2    c

dtype: category
In [60]: print(s.min(), s.max())

('a', 'c')

Note

ordered=True is not necessary needed in the second case, as lists of strings are sortable and so the resulting Categorical is ordered.

Sorting will use the order defined by levels, not any lexical order present on the data type. This is even true for strings and numeric data:

In [61]: s = pd.Series(pd.Categorical([1,2,3,1]))
In [62]: s.cat.levels = [2,3,1]
In [63]: s

Out[63]: 

0    2

1    3

2    1

3    2

dtype: category
In [64]: s.sort()
In [65]: s

Out[65]: 

0    2

3    2

1    3

2    1

dtype: category
In [66]: print(s.min(), s.max())

(2, 1)

Reordering the levels is possible via the Categorical.reorder_levels(new_levels) or Series.cat.reorder_levels(new_levels) methods:

In [67]: s2 = pd.Series(pd.Categorical([1,2,3,1]))
In [68]: s2.cat.reorder_levels([2,3,1])
In [69]: s2

Out[69]: 

0    1

1    2

2    3

3    1

dtype: category
In [70]: s2.sort()
In [71]: s2

Out[71]: 

1    2

2    3

0    1

3    1

dtype: category
In [72]: print(s2.min(), s2.max())

(2, 1)

Note

Note the difference between assigning new level names and reordering the levels: the first renames levels and therefore the individual values in the Series, but if the first position was sorted last, the renamed value will still be sorted last. Reordering means that the way values are sorted is different afterwards, but not that individual values in the Series are changed.

Operations¶

The following operations are possible with categorical data:

Getting the minimum and maximum, if the categorical is ordered:

In [73]: s = pd.Series(pd.Categorical(["a","b","c","a"], levels=["c","a","b","d"]))
In [74]: print(s.min(), s.max())

('c', 'b')

Note

If the Categorical is not ordered, Categorical.min() and Categorical.max() and the corresponding operations on Series will raise TypeError.

The mode:

In [75]: raw_cat = pd.Categorical(["a","b","c","c"], levels=["c","a","b","d"])
In [76]: s = pd.Series(raw_cat)
In [77]: raw_cat.mode()

Out[77]: 

 c

Levels (4): Index(['c', 'a', 'b', 'd'], dtype=object), ordered
In [78]: s.mode()

Out[78]: 

0    c

dtype: category

Note

Numeric operations like +, -, *, / and operations based on them (e.g. .median(), which would need to compute the mean between two values if the length of an array is even) do not work and raise a TypeError.

Series methods like Series.value_counts() will use all levels, even if some levels are not present in the data:

In [79]: s = pd.Series(pd.Categorical(["a","b","c","c"], levels=["c","a","b","d"]))
In [80]: s.value_counts()

Out[80]: 

c    2

b    1

a    1

d    0

dtype: int64

Groupby will also show “unused” levels:

In [81]: cats = pd.Categorical(["a","b","b","b","c","c","c"], levels=["a","b","c","d"])
In [82]: df = pd.DataFrame({"cats":cats,"values":[1,2,2,2,3,4,5]})
In [83]: df.groupby("cats").mean()

Out[83]: 

      values

cats        

a          1

b          2

c          4

d        NaN
In [84]: cats2 = pd.Categorical(["a","a","b","b"], levels=["a","b","c"])
In [85]: df2 = pd.DataFrame({"cats":cats2,"B":["c","d","c","d"], "values":[1,2,3,4]})
# This doesn't work yet with two columns -> see failing unittests

In [86]: df2.groupby(["cats","B"]).mean()

Out[86]: 

        values

cats B        

a    c       1

     d       2

b    c       3

     d       4

Pivot tables:

In [87]: raw_cat = pd.Categorical(["a","a","b","b"], levels=["a","b","c"])
In [88]: df = pd.DataFrame({"A":raw_cat,"B":["c","d","c","d"], "values":[1,2,3,4]})
In [89]: pd.pivot_table(df, values='values', index=['A', 'B'])

Out[89]: 

A  B

a  c    1

   d    2

b  c    3

   d    4

Name: values, dtype: int64

Data munging¶

The optimized pandas data access methods .loc, .iloc, .ix .at, and .iat, work as normal, the only difference is the return type (for getting) and that only values already in the levels can be assigned.

Getting¶

If the slicing operation returns either a DataFrame or a a column of type Series, the category dtype is preserved.

In [90]: cats = pd.Categorical(["a","b","b","b","c","c","c"], levels=["a","b","c"])
In [91]: idx = pd.Index(["h","i","j","k","l","m","n",])
In [92]: values= [1,2,2,2,3,4,5]
In [93]: df = pd.DataFrame({"cats":cats,"values":values}, index=idx)
In [94]: df.iloc[2:4,:]

Out[94]: 

  cats  values

j    b       2

k    b       2
In [95]: df.iloc[2:4,:].dtypes

Out[95]: 

cats      category

values       int64

dtype: object
In [96]: df.loc["h":"j","cats"]

Out[96]: 

h    a

i    b

j    b

Name: cats, dtype: category
In [97]: df.ix["h":"j",0:1]

Out[97]: 

  cats

h    a

i    b

j    b
In [98]: df[df["cats"] == "b"]

Out[98]: 

  cats  values

i    b       2

j    b       2

k    b       2

An example where the Categorical is not preserved is if you take one single row: the resulting Series is of dtype object:

# get the complete "h" row as a Series
In [99]: df.loc["h", :]
Out[99]: 
cats      a
values    1
Name: h, dtype: object

Returning a single item from a Categorical will also return the value, not a Categorical of length “1”.

In [100]: df.iat[0,0]
Out[100]: 'a'
In [101]: df["cats"].cat.levels = ["x","y","z"]
In [102]: df.at["h","cats"] # returns a string

Out[102]: 'x'

Note

This is a difference to R’s factor function, where factor(c(1,2,3))[1] returns a single value factor.

To get a single value Series of type category pass in a single value list:

In [103]: df.loc[["h"],"cats"]
Out[103]: 
h    x
Name: cats, dtype: category

Setting¶

Setting values in a categorical column (or Series) works as long as the value is included in the levels:

In [104]: cats = pd.Categorical(["a","a","a","a","a","a","a"], levels=["a","b"])
In [105]: idx = pd.Index(["h","i","j","k","l","m","n"])
In [106]: values = [1,1,1,1,1,1,1]
In [107]: df = pd.DataFrame({"cats":cats,"values":values}, index=idx)
In [108]: df.iloc[2:4,:] = [["b",2],["b",2]]
In [109]: df

Out[109]: 

  cats  values

h    a       1

i    a       1

j    b       2

k    b       2

l    a       1

m    a       1

n    a       1
In [110]: try:

   .....:     df.iloc[2:4,:] = [["c",3],["c",3]]

   .....: except ValueError as e:

   .....:     print("ValueError: " + str(e))

   .....:

ValueError: cannot setitem on a Categorical with a new level, set the levels first

Setting values by assigning a Categorical will also check that the levels match:

In [111]: df.loc["j":"k","cats"] = pd.Categorical(["a","a"], levels=["a","b"])
In [112]: df

Out[112]: 

  cats  values

h    a       1

i    a       1

j    a       2

k    a       2

l    a       1

m    a       1

n    a       1
In [113]: try:

   .....:     df.loc["j":"k","cats"] = pd.Categorical(["b","b"], levels=["a","b","c"])

   .....: except ValueError as e:

   .....:     print("ValueError: " + str(e))

   .....:

ValueError: cannot set a Categorical with another, without identical levels

Assigning a Categorical to parts of a column of other types will use the values:

In [114]: df = pd.DataFrame({"a":[1,1,1,1,1], "b":["a","a","a","a","a"]})
In [115]: df.loc[1:2,"a"] = pd.Categorical(["b","b"], levels=["a","b"])
In [116]: df.loc[2:3,"b"] = pd.Categorical(["b","b"], levels=["a","b"])
In [117]: df

Out[117]: 

   a  b

0  1  a

1  b  a

2  b  b

3  1  b

4  1  a
In [118]: df.dtypes

Out[118]: 

a    object

b    object

dtype: object

Merging¶

You can concat two DataFrames containing categorical data together, but the levels of these Categoricals need to be the same:

In [119]: cat = pd.Categorical(["a","b"], levels=["a","b"])
In [120]: vals = [1,2]
In [121]: df = pd.DataFrame({"cats":cat, "vals":vals})
In [122]: res = pd.concat([df,df])
In [123]: res

Out[123]: 

  cats  vals

0    a     1

1    b     2

0    a     1

1    b     2
In [124]: res.dtypes

Out[124]: 

cats    category

vals       int64

dtype: object
In [125]: df_different = df.copy()
In [126]: df_different["cats"].cat.levels = ["a","b","c"]
In [127]: try:

   .....:     pd.concat([df,df])

   .....: except ValueError as e:

   .....:     print("ValueError: " + str(e))

   .....:

The same applies to df.append(df).

Getting Data In/Out¶

Writing data (Series, Frames) to a HDF store and reading it in entirety works. Querying the hdf store does not yet work.

In [128]: hdf_file = "test.h5"
In [129]: s = pd.Series(pd.Categorical(['a', 'b', 'b', 'a', 'a', 'c'], levels=['a','b','c','d']))
In [130]: df = pd.DataFrame({"s":s, "vals":[1,2,3,4,5,6]})
In [131]: df.to_hdf(hdf_file, "frame")
In [132]: df2 = pd.read_hdf(hdf_file, "frame")
In [133]: df2

Out[133]: 

   s  vals

0  a     1

1  b     2

2  b     3

3  a     4

4  a     5

5  c     6
In [134]: try:

   .....:     pd.read_hdf(hdf_file, "frame", where = ['index>2'])

   .....: except TypeError as e:

   .....:     print("TypeError: " + str(e))

   .....:

TypeError: cannot pass a where specification when reading from a Fixed format store. this store must be selected in its entirety

Writing to a csv file will convert the data, effectively removing any information about the Categorical (levels and ordering). So if you read back the csv file you have to convert the relevant columns back to category and assign the right levels and level ordering.

In [135]: s = pd.Series(pd.Categorical(['a', 'b', 'b', 'a', 'a', 'd']))
# rename the levels

In [136]: s.cat.levels = ["very good", "good", "bad"]
# add new levels at the end

In [137]: s.cat.levels = list(s.cat.levels) + ["medium", "very bad"]
# reorder the levels

In [138]: s.cat.reorder_levels(["very bad", "bad", "medium", "good", "very good"])
In [139]: df = pd.DataFrame({"s":s, "vals":[1,2,3,4,5,6]})
In [140]: df.to_csv(csv_file)

---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

<ipython-input-140-72bb1b843e60> in <module>()

----> 1 df.to_csv(csv_file)
c:\data\external\pandas\pandas\util\decorators.pyc in wrapper(*args, **kwargs)

     58                 else:

     59                     kwargs[new_arg_name] = old_arg_value

---> 60             return func(*args, **kwargs)

     61         return wrapper

     62     return _deprecate_kwarg
c:\data\external\pandas\pandas\core\frame.pyc in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, quoting, quotechar, line_terminator, chunksize, tupleize_cols, date_format, doublequote, escapechar, **kwds)

   1139                                      doublequote=doublequote,

   1140                                      escapechar=escapechar)

-> 1141         formatter.save()

   1142 

   1143         if path_or_buf is None:
c:\data\external\pandas\pandas\core\format.pyc in save(self)

   1312 

   1313             else:

-> 1314                 self._save()

   1315 

   1316         finally:
c:\data\external\pandas\pandas\core\format.pyc in _save(self)

   1412                 break

   1413 

-> 1414             self._save_chunk(start_i, end_i)

   1415 

   1416     def _save_chunk(self, start_i, end_i):
c:\data\external\pandas\pandas\core\format.pyc in _save_chunk(self, start_i, end_i)

   1424             d = b.to_native_types(slicer=slicer, na_rep=self.na_rep,

   1425                                   float_format=self.float_format,

-> 1426                                   date_format=self.date_format)

   1427 

   1428             for col_loc, col in zip(b.mgr_locs, d):
c:\data\external\pandas\pandas\core\internals.pyc in to_native_types(self, slicer, na_rep, **kwargs)

    446         values = self.values

    447         if slicer is not None:

--> 448             values = values[:, slicer]

    449         values = np.array(values, dtype=object)

    450         mask = isnull(values)
c:\data\external\pandas\pandas\core\categorical.pyc in getitem(self, key)

    669                 return self.levels[i]

    670         else:

--> 671             return Categorical(values=self._codes[key], levels=self.levels,

    672                                ordered=self.ordered, fastpath=True)

    673 
IndexError: too many indices
In [141]: df2 = pd.read_csv(csv_file)

---------------------------------------------------------------------------

CParserError                              Traceback (most recent call last)

<ipython-input-141-8d612f40488f> in <module>()

----> 1 df2 = pd.read_csv(csv_file)
c:\data\external\pandas\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format)

    450                     infer_datetime_format=infer_datetime_format)

    451 

--> 452         return _read(filepath_or_buffer, kwds)

    453 

    454     parser_f.name = name
c:\data\external\pandas\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)

    232 

    233     # Create the parser.

--> 234     parser = TextFileReader(filepath_or_buffer, **kwds)

    235 

    236     if (nrows is not None) and (chunksize is not None):
c:\data\external\pandas\pandas\io\parsers.pyc in init(self, f, engine, **kwds)

    540             self.options['has_index_names'] = kwds['has_index_names']

    541 

--> 542         self._make_engine(self.engine)

    543 

    544     def _get_options_with_defaults(self, engine):
c:\data\external\pandas\pandas\io\parsers.pyc in _make_engine(self, engine)

    677     def _make_engine(self, engine='c'):

    678         if engine == 'c':

--> 679             self._engine = CParserWrapper(self.f, **self.options)

    680         else:

    681             if engine == 'python':
c:\data\external\pandas\pandas\io\parsers.pyc in init(self, src, **kwds)

   1039         kwds['allow_leading_cols'] = self.index_col is not False

   1040 

-> 1041         self._reader = _parser.TextReader(src, **kwds)

   1042 

   1043         # XXX
c:\data\external\pandas\pandas\parser.pyd in pandas.parser.TextReader.cinit (pandas\parser.c:4629)()
c:\data\external\pandas\pandas\parser.pyd in pandas.parser.TextReader._get_header (pandas\parser.c:6092)()
CParserError: Passed header=0 but only 0 lines in file
In [142]: df2.dtypes

Out[142]: 

s       category

vals       int64

dtype: object
In [143]: df2["vals"]

Out[143]: 

0    1

1    2

2    3

3    4

4    5

5    6

Name: vals, dtype: int64
# Redo the category

In [144]: df2["vals"] = df2["vals"].astype("category")
In [145]: df2["vals"].cat.levels = list(df2["vals"].cat.levels) + ["medium", "very bad"]
In [146]: df2["vals"].cat.reorder_levels(["very bad", "bad", "medium", "good", "very good"])

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-146-d48b87b27d90> in <module>()

----> 1 df2["vals"].cat.reorder_levels(["very bad", "bad", "medium", "good", "very good"])
c:\data\external\pandas\pandas\core\categorical.pyc in reorder_levels(self, new_levels, ordered)

    342 

    343         if len(new_levels) != len(self._levels):

--> 344             raise ValueError('Reordered levels must be of same length as old levels')

    345         if len(new_levels-self._levels):

    346             raise ValueError('Reordered levels be the same as the original levels')
ValueError: Reordered levels must be of same length as old levels
In [147]: df2.dtypes

Out[147]: 

s       category

vals    category

dtype: object
In [148]: df2["vals"]

Out[148]: 

0    1

1    2

2    3

3    4

4    5

5    6

Name: vals, dtype: category

Missing Data¶

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the Missing Data section

There are two ways a np.nan can be represented in Categorical: either the value is not available or np.nan is a valid level.

In [149]: s = pd.Series(pd.Categorical(["a","b",np.nan,"a"]))
In [150]: s

Out[150]: 

0      a

1      b

2    NaN

3      a

dtype: category
# only two levels

In [151]: s.cat.levels

Out[151]: Index([u'a', u'b'], dtype='object')
In [152]: s2 = pd.Series(pd.Categorical(["a","b","c","a"]))
In [153]: s2.cat.levels = [1,2,np.nan]
In [154]: s2

Out[154]: 

0     1

1     2

2   NaN

3     1

dtype: category
# three levels, np.nan included

# Note: as int arrays can't hold NaN the levels were converted to float

In [155]: s2.cat.levels

Out[155]: Float64Index([1.0, 2.0, nan], dtype='float64')

Gotchas¶

Categorical is not a numpy array¶

Currently, Categorical and the corresponding category Series is implemented as a python object and not as a low level numpy array dtype. This leads to some problems.

numpy itself doesn’t know about the new dtype:

In [156]: try:
   .....:     np.dtype("category")
   .....: except TypeError as e:
   .....:      print("TypeError: " + str(e))
   .....: 
TypeError: data type "category" not understood
In [157]: dtype = pd.Categorical(["a"]).dtype
In [158]: try:

   .....:     np.dtype(dtype)

   .....: except TypeError as e:

   .....:      print("TypeError: " + str(e))

   .....:

TypeError: data type not understood
# dtype comparisons work:

In [159]: dtype == np.str_

Out[159]: False
In [160]: np.str_ == dtype

Out[160]: False

Using numpy functions on a Series of type category should not work as Categoricals are not numeric data (even in the case that .levels is numeric).

In [161]: s = pd.Series(pd.Categorical([1,2,3,4]))
In [162]: try:

   .....:     np.sum(s)

   .....: except TypeError as e:

   .....:      print("TypeError: " + str(e))

   .....:

TypeError: Categorical cannot perform the operation sum

Note

If such a function works, please file a bug at https://github.com/pydata/pandas!

Side effects¶

Constructing a Series from a Categorical will not copy the input Categorical. This means that changes to the Series will in most cases change the original Categorical:

In [163]: cat = pd.Categorical([1,2,3,10], levels=[1,2,3,4,10])
In [164]: s = pd.Series(cat, name="cat")
In [165]: cat

Out[165]: 

  1

  2

  3

 10

Levels (5): Int64Index([ 1,  2,  3,  4, 10], dtype=int64), ordered
In [166]: s.iloc[0:2] = 10
In [167]: cat

Out[167]: 

 10

 10

  3

 10

Levels (5): Int64Index([ 1,  2,  3,  4, 10], dtype=int64), ordered
In [168]: df = pd.DataFrame(s)
In [169]: df["cat"].cat.levels = [1,2,3,4,5]
In [170]: cat

Out[170]: 

 5

 5

 3

 5

Levels (5): Int64Index([1, 2, 3, 4, 5], dtype=int64), ordered

Use copy=True to prevent such a behaviour:

In [171]: cat = pd.Categorical([1,2,3,10], levels=[1,2,3,4,10])
In [172]: s = pd.Series(cat, name="cat", copy=True)
In [173]: cat

Out[173]: 

  1

  2

  3

 10

Levels (5): Int64Index([ 1,  2,  3,  4, 10], dtype=int64), ordered
In [174]: s.iloc[0:2] = 10
In [175]: cat

Out[175]: 

  1

  2

  3

 10

Levels (5): Int64Index([ 1,  2,  3,  4, 10], dtype=int64), ordered

Note

This also happens in some cases when you supply a numpy array instea dof a Categorical: using an int array (e.g. np.array([1,2,3,4])) will exhibit the same behaviour, but using a string array (e.g. np.array(["a","b","c","a"])) will not.

Danger of confusion¶

Both Series and Categorical have a method .reorder_levels() but for different things. For Series of type category this means that there is some danger to confuse both methods.

In [176]: s = pd.Series(pd.Categorical([1,2,3,4]))
In [177]: print(s.cat.levels)

Int64Index([1, 2, 3, 4], dtype='int64')
# wrong and raises an error:

In [178]: try:

   .....:     s.reorder_levels([4,3,2,1])

   .....: except Exception as e:

   .....:     print("Exception: " + str(e))

   .....:

Exception: Can only reorder levels on a hierarchical axis.
# right

In [179]: s.cat.reorder_levels([4,3,2,1])
In [180]: print(s.cat.levels)

Int64Index([4, 3, 2, 1], dtype='int64')

See also the API documentation for pandas.Series.reorder_levels() and pandas.Categorical.reorder_levels()

Old style constructor usage¶

I earlier versions, a Categorical could be constructed by passing in precomputed level_codes (called then labels) instead of values with levels. The level_codes are interpreted as pointers to the levels with -1 as NaN. This usage is now deprecated and not available unless compat=True is passed to the constructor of Categorical.

In [181]: cat = pd.Categorical([1,2], levels=[1,2,3], compat=True)
In [182]: cat.get_values()

Out[182]: array([2, 3], dtype=int64)

In the default case (compat=False) the first argument is interpreted as values.

In [183]: cat = pd.Categorical([1,2], levels=[1,2,3], compat=False)
In [184]: cat.get_values()

Out[184]: array([1, 2], dtype=int64)

Warning

Using Categorical with precomputed level_codes and levels is deprecated and a FutureWarning is raised. Please change your code to use one of the proper constructor modes instead of adding compat=False.

No categorical index¶

There is currently no index of type category, so setting the index to a Categorical will convert the Categorical to a normal numpy array first and therefore remove any custom ordering of the levels:

In [185]: cats = pd.Categorical([1,2,3,4], levels=[4,2,3,1])
In [186]: strings = ["a","b","c","d"]
In [187]: values = [4,2,3,1]
In [188]: df = pd.DataFrame({"strings":strings, "values":values}, index=cats)
In [189]: df.index

Out[189]: Int64Index([1, 2, 3, 4], dtype='int64')
# This should sort by levels but does not as there is no CategoricalIndex!

In [190]: df.sort_index()

Out[190]: 

  strings  values

1       a       4

2       b       2

3       c       3

4       d       1

Note

This could change if a CategoricalIndex is implemented (see #7629)

dtype in apply¶

Pandas currently does not preserve the dtype in apply functions: If you apply along rows you get a Series of object dtype (same as getting a row -> getting one element will return a basic type) and applying along columns will also convert to object.

In [191]: df = pd.DataFrame({"a":[1,2,3,4], "b":["a","b","c","d"], "cats":pd.Categorical([1,2,3,2])})
In [192]: df.apply(lambda row: type(row["cats"]), axis=1)

Out[192]: 

0    <type 'long'>

1    <type 'long'>

2    <type 'long'>

3    <type 'long'>

dtype: object
In [193]: df.apply(lambda col: col.dtype, axis=0)

Out[193]: 

a       object

b       object

cats    object

dtype: object

Future compatibility¶

As Categorical is not a native numpy dtype, the implementation details of Series.cat can change if such a numpy dtype is implemented.

jreback · 2014-06-12T20:11:51Z

can u show the rendered one in the top section?

jreback · 2014-06-12T20:14:27Z

doc/build/html/categorical.html

@@ -0,0 +1,721 @@
+
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"


it not useful to actually include this file; I only wanted to see the rendered version in the top section of the PR, which could simply be a ipython session and/or jpg of the rendered html

jreback · 2014-06-12T20:35:43Z

I am not a big fan of the long Series accessors, e.g. category_levels

however, easy enough to do something like

Series.cat.levels or Series.category.levels (or Series.values.levels, which u can do now)

jankatins · 2014-06-12T21:00:14Z

EDIT: Doc removed and new version added below

jankatins · 2014-06-12T21:04:02Z

@jreback: I like the idea with Series.cat. What should Series.cat do when the series is not a category? Same as Series.str (-> convert) or raise an TypeError?

jreback · 2014-06-12T21:05:30Z

naw will raise a TypeError

jreback · 2014-06-12T21:10:28Z

very easy impl

@property
def cat(self):
    if self.dtype != 'category':
       raise TypeError('not a category')
    return self.values

jankatins · 2014-06-21T09:39:35Z

I've updated the docs and added new unittests for slicing (lots of failures and crashes :-( )

Edit: I also changed a typo in one of my earlier commits

jreback · 2014-06-21T19:59:00Z

I updated the main PR; all of the slicing FIXME's are done: #7217

jankatins · 2014-06-21T21:57:52Z

Thanks! I started to fix that at the completely wrong place and got stuck right away :/

jankatins · 2014-06-22T23:02:33Z

@jreback I've fixed some screwups in my slicing testcases and added more slicing tests.

I also added assigning testcases (which of course fail because assigning is not yet implmented). I do hope that I got the testcases right...

jreback · 2014-06-22T23:04:09Z

ok thanks

jreback · 2014-06-23T15:24:11Z

Assign value already in levels

> df = orig.copy()
(Pdb) n
> /mnt/home/jreback/pandas/pandas/tests/test_categorical.py(979)test_assigning_ops()
-> df.iloc[2,0] = "b"
(Pdb) p df
  cats  values
h    a       1
i    a       1
j    b       1
k    a       1
l    a       1
m    a       1
n    a       1

Assign value NOT in levels, just add the levels to the end
(or is this a Value Error?)

(Pdb) !df.iloc[2,0]
'b'
(Pdb) !df.iloc[2,0] = 'c'
(Pdb) p df
  cats  values
h    a       1
i    a       1
j    c       1
k    a       1
l    a       1
m    a       1
n    a       1
(Pdb) p df['cats'].cat.levels
Index([u'a', u'b', u'c'], dtype='object')
(Pdb) p df['cats'].cat.labels
array([0, 0, 2, 0, 0, 0, 0])
(Pdb)

jankatins · 2014-06-23T17:16:32Z

Assigning values not in levels should raise: if you want to have them included, you need to first add such a level and then assign the value.

If levels and level ordering has a meaning, it makes no sense to add a level just because a value was assigned: s.levels = ["best", "good", "medium", "bad", "worst"]; s.ordered=True. Now assigning a milk (or a float/int) makes no sense.

jreback · 2014-06-23T17:18:19Z

ok gr8....thxs

jankatins · 2014-06-23T17:19:50Z

This is what R does:

> factor(c(1,2,3,4))
[1] 1 2 3 4
Levels: 1 2 3 4
> a = factor(c(1,2,3,4))
> a[1]
[1] 1
Levels: 1 2 3 4
> a[1] <- 6
Warning message:
In `[<-.factor`(`*tmp*`, 1, value = 6) : invalid factor level, NA generated

(One difference: a single value is also of type factor...)

jreback · 2014-06-23T17:23:23Z

ok....changed to raising (easy enough for say an ordered=False to allow this, just append to the level set)

jreback · 2014-06-24T16:56:57Z

yep...that's what I do from your branch

jankatins · 2014-06-24T16:58:31Z

[removed and added in the first comment]

jreback · 2014-06-24T17:12:21Z

I changed newstyle -> compat (as that's the usual argument). So one can set it if its an issue to preserver old behavior. good

jankatins · 2014-06-24T17:34:56Z

But this would mean that the default is changed to "new style", which will break old code?

jreback · 2014-06-24T17:37:07Z

I think I changed the semantics, but were'nt you +1 on just making newstyle (compat=False) the default anyhow?

jankatins · 2014-06-24T17:58:33Z

I'm fine with it, but wasn't sure if you intended to break old code with a rename :-)

If you are ok with the API change, I would remove the codepath completely and just add something similar to the fastpath=True case.

jreback · 2014-06-24T18:00:21Z

ok....i'll leave for now, pls review after I push again (i'll let u know)

jreback · 2014-06-24T18:26:18Z

@JanSchulz

got fillna working with a value (assume its an error if its NOT in the levels)

for method='pad/bfill` its an error if its not an ordered Categorical
then I guess its normal propogation yes?

jankatins · 2014-06-24T18:35:39Z

As far as I understood pad it just takes the value at position x-1, so I think it should also work for unordered categoricals. [1,2,nan,3] -> [1,2,2,3] because 2 was at the second position?

jreback · 2014-07-01T13:50:38Z

updated

jankatins · 2014-07-01T15:04:42Z

@jreback fix for describe -> This fixes IMO these two entries under "FIXME":

Series(categorical).describe() / Categorical.unique() -> should this return all levels or only used levels?
Series(cat).describe() -> show information about the levels?

Re level information on print: If series prints dtype, I think it would be nice to also print level information if Series.values is a categorical. I will add such a commit next.

Re "Documentation" -> There should be two more entries:
[ ] Add a API change note about the group_agg and factor_agg
[ ] Add a API change note about Categorical.labels -> Categorical.codes

BTW: where should these API change notes go? v0.14.1.txt or should I open a new v0.15.0.txt?

jreback · 2014-07-01T15:16:21Z

rebase: I added a v0.15.0.txt file: 0b6ea39

jankatins · 2014-07-01T21:42:59Z

rebased and added a commit to print level information when printing a Series of type category. I'm now writing some release notes.

From my standpoint, only the groupby/pivot testcases and getting to_csv working (this is as far as I can tell a more general problem as the slicer is used in way which raises in categorical).

jreback · 2014-07-01T22:14:47Z

I am not 100% sure about printing the levels, maybe just print the number of levels?
otherwise then you would have to truncate the levels if they are too long

jankatins · 2014-07-01T22:27:59Z

Release notes added.

I will add truncating to the level footer

I'm also thinking to change the guard in reorder_levels from "same items in levels" to "must include all old itemes". That would mean that you can add new (unused) levels on reordering.

Current:

# Rename the levels and simultaneously add the missing levels at the end
df["grade"].cat.levels = ["very good", "good", "very bad", "bad", "medium"]
# Reorder the levels
df["grade"].cat.reorder_levels(["very bad", "bad", "medium", "good", "very good"])

Then:

# Rename the levels 
df["grade"].cat.levels = ["very good", "good", "very bad"]
# Reorder the levels and simultaneously add the missing levels
df["grade"].cat.reorder_levels(["very bad", "bad", "medium", "good", "very good"])

Thoughts?

jankatins · 2014-07-02T15:01:50Z

Added the new reorder_levels feature. Now you can add levels both with assigning to levels and with reorder_levels

jreback · 2014-07-02T18:41:11Z

updated

jankatins · 2014-07-02T21:06:41Z

@jreback I've added a different solution to the "max level problem": b606c24

jreback · 2014-07-02T21:09:46Z

@JanSchulz gr8! that looks better than what I was doing

jreback · 2014-07-02T21:19:26Z

In [4]: pd.Series(pd.Categorical(list('abcdefghijklmno'), name="a"))
Out[4]: 
0     a
1     b
2     c
3     d
4     e
5     f
6     g
7     h
8     i
9     j
10    k
11    l
12    m
13    n
14    o
Levels (15, ordered, object): a < b < c < d < e < ... < k < l < m < n < o,
Name: a

How about somethign for the last line like? (more how you do unordered), so then they are the same except for the trruncation separator?

Levels (15, ordered, object): [a, b, c, d <<< k, l, m, n, o],
Name: a

jankatins · 2014-07-02T21:32:37Z

I actually modeled that after the way R prints factors:

> factor(c(5,2,3,4))
[1] 5 2 3 4
Levels: 2 3 4 5
> factor(c(5,2,3,4), ordered=T)
[1] 5 2 3 4
Levels: 2 < 3 < 4 < 5

If I change only the truncation separator, there is no difference in case the levels are not truncated which IMO will be mostly the case (e.g. Lickert scales have at most 7 levels + "no answer")

I like the [....] idea, as that will clearly show empty levels.

At one point I had the Series printout in the same order as the Categorical printout (currently the Categorical first prints Name,... and then levels and the Series is the other way around like for frequencies. And thoughts on this or simple leave as is?

jreback · 2014-07-02T21:37:31Z

Hmm, try to conform to the Series as much as possible (it can be changed, but should be a deliberate action). For a future PR.

I think the R printing ordered factors is just to verbose (you can just use '...' as the truncation separate for both), after all it DOES say ordered/unordered already.

jankatins · 2014-07-03T14:15:32Z

Actually, < is used both during truncating and when all levels are shown:

In[3]: import pandas as pd
In[4]: pd.Series(pd.Categorical([1,2,3,4,5,6]))
Out[4]: 
0    1
1    2
2    3
3    4
4    5
5    6
Levels (6, ordered, int64): 1 < 2 < 3 < 4 < 5 < 6,

I like "<" because it clearly shows which way the levels are ordered: reading a Levels (6, ordered, int64): 6, 5, 4, 3, 2, 1 would not show that in this case a the lexical order is reversed.

My suggestion would be to change truncating to a < b < c < d ... l < m < n < o (i.e. s/ < ... < / ... /) and reduce display.max_levels to 8). [edit] One could also remove the "ordered" word in the parentheses.[/]

[Edit: yikes, a trailing ,. I'm starting to hate display formatting :-( ]

jreback · 2014-07-03T14:33:32Z

@JanSchulz that looks fine, and yes I would reduce max_levels
and removing ordered/unorderd makes a bit shorter; I like that too

jreback · 2014-07-07T19:17:13Z

@JanSchulz http://stackoverflow.com/questions/24617793/scikit-learn-categorical-variables-in-regression

GH3943, GH5313, GH5314, GH7444 ENH: delegate _reduction and ops from Series to the categorical to support min/max and raise TypeError on other ops (numerical) and reduction Add Categorical Properties to Series Default to 'ordered' Categoricals if values are ordered Categorical: add level assignments and reordering + changed default for ordered Add a `Categorical.reorder_levels()` method. Change some naming in `Series`, so that the methods do not clash with established standards and rename the other categorical methods accordingly. Also change the default for `ordered` to True if values + levels are passed in at creation time. Initial doc version for working with Categorical data Categorical: add Categorical.mode() and use that in Series.mode() Categorical: implement remove_unused_levels() Categorical: implement value_count() for categorical series Categorical: make Series.astype("category") work ENH: add setitem to Categorical BUG: assigning to levels not in level set now raises ValueError API: disallow numpy ufuncs with categoricals Categorical: Categorical assignment to int/obj column ENH: add support for fillna to Categoricals API: deprecate old style categorical constructor usage and change default Before it was possible to pass in precomputed labels/pointer and the corresponding levels (e.g.: `Categorical([0,1,2], levels=["a","b","c"])`). This could lead to subtle errors in case of integer categoricals: the following could be both interpreted as "precomputed pointers and levels" or "values and levels", but converting it back to a integer array would result in different arrays: `np.array(Categorical([1,2], levels=[1,2,3]))` interpreted as pointers: `[2,3]` interpreted as values: `[1,2]` Up to now we would favour old style "pointer and levels" if these values could be interpreted as such (see code for details...). With this commit we favour new style "values and levels" and only attempt to interprete them as "pointers and levels" if "compat=True" is passed to the constructor. BREAKS: This will break code which uses Categoricals with "pointer and levels". A short google search and a search on stackoverflow revealed no such useage. Categorical: document constructor changes and small fixes Categorical: document that inappropriate numpy functions won't work anymore ENH: concat support

Doc: Add Release notes for pandas-dev#7217

ERR: codes modification raises ValueError always Categorical: use Categorical.from_codes() in a few places Categorical: Fix assigning a Categorical to an existing string column

DISPLAY: show dtype when displaying Categorical series (for consistency)

…t (for select_dtypes)

…tegory

jreback · 2014-07-14T21:46:02Z

closed by #7217

jankatins mentioned this pull request Jun 12, 2014

WIP: categoricals as an internal CategoricalBlock GH5313 #7217

Merged

31 tasks

jreback reviewed Jun 12, 2014
View reviewed changes

jreback added Categorical labels Jun 12, 2014

jreback added this to the 0.15.0 milestone Jun 12, 2014

jreback modified the milestones: 0.15.0, 0.15.1 Jul 6, 2014

jreback and others added 10 commits July 9, 2014 18:56

Categorical: Thanks for Jan Schulz for much of the work on Categoricals

b8972bb

Doc: Add Release notes for pandas-dev#7217

DOC: update v0.15.0 notes

e474c68

Categorical: .codes should be immutable

89c42a3

ERR: codes modification raises ValueError always Categorical: use Categorical.from_codes() in a few places Categorical: Fix assigning a Categorical to an existing string column

CLN: CategoricalDtype repr now yields category

97aece1

DISPLAY: show dtype when displaying Categorical series (for consistency)

BUG: fix groupby with multiple non-compressed categoricals

6fdb400

Categorical: minor doc cleanups

7788f46

ENH: add a metaclass to CategoricalDtype to provide issubclass suppor…

e6abfcf

…t (for select_dtypes)

TST: io/pytables.py tests now raise NotImplementedError for dtype==ca…

2a3278c

…tegory

DOC: document the new category dtype in select_dtypes

b96cf3c

jreback closed this Jul 14, 2014

		@@ -0,0 +1,721 @@

		<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

Uh oh!

WIP/DO NOT MERGE: Categorical improvements #7444

WIP/DO NOT MERGE: Categorical improvements #7444

Uh oh!

Conversation

jankatins commented Jun 12, 2014

The Docs (updated 1st july, 4pm CEST)

Categorical¶

Object Creation¶

Working with levels¶

Ordered or not...¶

Operations¶

Data munging¶

Getting¶

Setting¶

Merging¶

Getting Data In/Out¶

Missing Data¶

Gotchas¶

Categorical is not a numpy array¶

Side effects¶

Danger of confusion¶

Old style constructor usage¶

No categorical index¶

dtype in apply¶

Future compatibility¶

Uh oh!

jreback commented Jun 12, 2014

Uh oh!

jreback Jun 12, 2014

Choose a reason for hiding this comment

Uh oh!

jreback commented Jun 12, 2014

Uh oh!

jankatins commented Jun 12, 2014

Uh oh!

jankatins commented Jun 12, 2014

Uh oh!

jreback commented Jun 12, 2014

Uh oh!

jreback commented Jun 12, 2014

Uh oh!

jankatins commented Jun 21, 2014

Uh oh!

jreback commented Jun 21, 2014

Uh oh!

jankatins commented Jun 21, 2014

Uh oh!

jankatins commented Jun 22, 2014

Uh oh!

jreback commented Jun 22, 2014

Uh oh!

jreback commented Jun 23, 2014

Uh oh!

jankatins commented Jun 23, 2014

Uh oh!

jreback commented Jun 23, 2014

Uh oh!

jankatins commented Jun 23, 2014

Uh oh!

jreback commented Jun 23, 2014

Uh oh!

jreback commented Jun 24, 2014

Uh oh!

jankatins commented Jun 24, 2014

Uh oh!

jreback commented Jun 24, 2014

Uh oh!

jankatins commented Jun 24, 2014

Uh oh!

jreback commented Jun 24, 2014

Uh oh!

jankatins commented Jun 24, 2014

Uh oh!

jreback commented Jun 24, 2014

Uh oh!

jreback commented Jun 24, 2014

Uh oh!

jankatins commented Jun 24, 2014

Uh oh!