Skip to content

Commit 6eefe75

Browse files
committed
Merge branch 'JanSchulz-dt_str_accessor_for_cats'
2 parents 89f46a5 + 7a82ee3 commit 6eefe75

File tree

9 files changed

+331
-69
lines changed

9 files changed

+331
-69
lines changed

doc/source/categorical.rst

+44
Original file line numberDiff line numberDiff line change
@@ -515,6 +515,50 @@ To get a single value `Series` of type ``category`` pass in a list with a single
515515
516516
df.loc[["h"],"cats"]
517517
518+
String and datetime accessors
519+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
520+
521+
.. versionadded:: 0.17.1
522+
523+
The accessors ``.dt`` and ``.str`` will work if the ``s.cat.categories`` are of an appropriate
524+
type:
525+
526+
527+
.. ipython:: python
528+
529+
str_s = pd.Series(list('aabb'))
530+
str_cat = str_s.astype('category')
531+
str_cat.str.contains("a")
532+
533+
date_s = pd.Series(date_range('1/1/2015', periods=5))
534+
date_cat = date_s.astype('category')
535+
date_cat.dt.day
536+
537+
.. note::
538+
539+
The returned ``Series`` (or ``DataFrame``) is of the same type as if you used the
540+
``.str.<method>`` / ``.dt.<method>`` on a ``Series`` of that type (and not of
541+
type ``category``!).
542+
543+
That means, that the returned values from methods and properties on the accessors of a
544+
``Series`` and the returned values from methods and properties on the accessors of this
545+
``Series`` transformed to one of type `category` will be equal:
546+
547+
.. ipython:: python
548+
549+
ret_s = str_s.str.contains("a")
550+
ret_cat = str_cat.str.contains("a")
551+
ret_s.dtype == ret_cat.dtype
552+
ret_s == ret_cat
553+
554+
.. note::
555+
556+
The work is done on the ``categories`` and then a new ``Series`` is constructed. This has
557+
some performance implication if you have a ``Series`` of type string, where lots of elements
558+
are repeated (i.e. the number of unique elements in the ``Series`` is a lot smaller than the
559+
length of the ``Series``). In this case it can be faster to convert the original ``Series``
560+
to one of type ``category`` and use ``.str.<method>`` or ``.dt.<property>`` on that.
561+
518562
Setting
519563
~~~~~~~
520564

doc/source/text.rst

+17
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,23 @@ and replacing any remaining whitespaces with underscores:
6363
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
6464
df
6565
66+
.. note::
67+
68+
If you have a ``Series`` where lots of elements are repeated
69+
(i.e. the number of unique elements in the ``Series`` is a lot smaller than the length of the
70+
``Series``), it can be faster to convert the original ``Series`` to one of type
71+
``category`` and then use ``.str.<method>`` or ``.dt.<property>`` on that.
72+
The performance difference comes from the fact that, for ``Series`` of type ``category``, the
73+
string operations are done on the ``.categories`` and not on each element of the
74+
``Series``.
75+
76+
Please note that a ``Series`` of type ``category`` with string ``.categories`` has
77+
some limitations in comparison of ``Series`` of type string (e.g. you can't add strings to
78+
each other: ``s + " " + s`` won't work if ``s`` is a ``Series`` of type ``category``). Also,
79+
``.str`` methods which operate on elements of type ``list`` are not available on such a
80+
``Series``.
81+
82+
6683
Splitting and Replacing Strings
6784
-------------------------------
6885

doc/source/whatsnew/v0.17.1.txt

+2
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,8 @@ Enhancements
6565

6666
pd.Index([1, np.nan, 3]).fillna(2)
6767

68+
- Series of type ``"category"`` now make ``.str.<...>`` and ``.dt.<...>`` accessor methods / properties available, if the categories are of that type. (:issue:`10661`)
69+
6870
- ``pivot_table`` now has a ``margins_name`` argument so you can use something other than the default of 'All' (:issue:`3335`)
6971
- Implement export of ``datetime64[ns, tz]`` dtypes with a fixed HDF5 store (:issue:`11411`)
7072

pandas/core/series.py

-2
Original file line numberDiff line numberDiff line change
@@ -2706,12 +2706,10 @@ def _dir_deletions(self):
27062706

27072707
def _dir_additions(self):
27082708
rv = set()
2709-
# these accessors are mutually exclusive, so break loop when one exists
27102709
for accessor in self._accessors:
27112710
try:
27122711
getattr(self, accessor)
27132712
rv.add(accessor)
2714-
break
27152713
except AttributeError:
27162714
pass
27172715
return rv

0 commit comments

Comments
 (0)