Skip to content

Commit 8020bf5

Browse files
committed
DOC: whatsnew and docs for multiple accessors
Also add some docs in text.rst to mention the performance gains when using ``s_cat.str`` vs ``s.str``.
1 parent a7c65ed commit 8020bf5

File tree

3 files changed

+62
-0
lines changed

3 files changed

+62
-0
lines changed

doc/source/categorical.rst

+44
Original file line numberDiff line numberDiff line change
@@ -515,6 +515,50 @@ To get a single value `Series` of type ``category`` pass in a list with a single
515515
516516
df.loc[["h"],"cats"]
517517
518+
String and datetime accessors
519+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
520+
521+
.. versionadded:: 0.17.1
522+
523+
The accessors ``.dt`` and ``.str`` will work if the ``s.cat.categories`` are of an appropriate
524+
type:
525+
526+
527+
.. ipython:: python
528+
529+
str_s = pd.Series(list('aabb'))
530+
str_cat = str_s.astype('category')
531+
str_cat.str.contains("a")
532+
533+
date_s = pd.Series(date_range('1/1/2015', periods=5))
534+
date_cat = date_s.astype('category')
535+
date_cat.dt.day
536+
537+
.. note::
538+
539+
The returned ``Series`` (or ``DataFrame``) is of the same type as if you used the
540+
``.str.<method>`` / ``.dt.<method>`` on a ``Series`` of that type (and not of
541+
type ``category``!).
542+
543+
That means, that the returned values from methods and properties on the accessors of a
544+
``Series`` and the returned values from methods and properties on the accessors of this
545+
``Series`` transformed to one of type `category` will be equal:
546+
547+
.. ipython:: python
548+
549+
ret_s = str_s.str.contains("a")
550+
ret_cat = str_cat.str.contains("a")
551+
ret_s.dtype == ret_cat.dtype
552+
ret_s == ret_cat
553+
554+
.. note::
555+
556+
The work is done on the ``categories`` and then a new ``Series`` is constructed. This has
557+
some performance implication if you have a ``Series`` of type string, where lots of elements
558+
are repeated (i.e. the number of unique elements in the ``Series`` is a lot smaller than the
559+
length of the ``Series``). In this case it can be faster to convert the original ``Series``
560+
to one of type ``category`` and use ``.str.<method>`` or ``.dt.<property>`` on that.
561+
518562
Setting
519563
~~~~~~~
520564

doc/source/text.rst

+16
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,22 @@ and replacing any remaining whitespaces with underscores:
6363
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
6464
df
6565
66+
.. note::
67+
68+
If you do a lot of string munging and have a ``Series`` where lots of elements are repeated
69+
(i.e. the number of unique elements in the ``Series`` is a lot smaller than the length of the
70+
``Series``), it can be faster to convert the original ``Series`` to one of type
71+
``category`` and then use ``.str.<method>`` or ``.dt.<property>`` on that. The
72+
performance difference comes from the fact that, for ``Series`` of type ``category``, the
73+
string operations are done on the ``.categories`` and not on each element of the
74+
``Series``. Please note that a ``Series`` of type ``category`` with string ``.categories`` has
75+
some limitations in comparison of ``Series`` of type string (e.g. you can't add strings to
76+
each other: ``s + " " + s`` won't work if ``s`` is a ``Series`` of type ``category``). Also,
77+
``.str`` methods which operate on elements of type ``list`` are not available on such a
78+
``Series``. If you are interested in having these performance gains on all string ``Series``,
79+
please look at `this bug report <https://github.com/pydata/pandas/issues/8640>`_.
80+
81+
6682
Splitting and Replacing Strings
6783
-------------------------------
6884

doc/source/whatsnew/v0.17.1.txt

+2
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,8 @@ Enhancements
6565

6666
pd.Index([1, np.nan, 3]).fillna(2)
6767

68+
- Series of type ``"category"`` now make ``.str.<...>`` and ``.dt.<...>`` accessor methods / properties available, if the categories are of that type. (:issue:`10661`)
69+
6870
- ``pivot_table`` now has a ``margins_name`` argument so you can use something other than the default of 'All' (:issue:`3335`)
6971

7072
.. _whatsnew_0171.api:

0 commit comments

Comments
 (0)