Skip to content

Make .str/.dt available for Series of type category with string/datetime #11582

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions doc/source/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -515,6 +515,50 @@ To get a single value `Series` of type ``category`` pass in a list with a single

df.loc[["h"],"cats"]

String and datetime accessors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. versionadded:: 0.17.1

The accessors ``.dt`` and ``.str`` will work if the ``s.cat.categories`` are of an appropriate
type:


.. ipython:: python

str_s = pd.Series(list('aabb'))
str_cat = str_s.astype('category')
str_cat.str.contains("a")

date_s = pd.Series(date_range('1/1/2015', periods=5))
date_cat = date_s.astype('category')
date_cat.dt.day

.. note::

The returned ``Series`` (or ``DataFrame``) is of the same type as if you used the
``.str.<method>`` / ``.dt.<method>`` on a ``Series`` of that type (and not of
type ``category``!).

That means, that the returned values from methods and properties on the accessors of a
``Series`` and the returned values from methods and properties on the accessors of this
``Series`` transformed to one of type `category` will be equal:

.. ipython:: python

ret_s = str_s.str.contains("a")
ret_cat = str_cat.str.contains("a")
ret_s.dtype == ret_cat.dtype
ret_s == ret_cat

.. note::

The work is done on the ``categories`` and then a new ``Series`` is constructed. This has
some performance implication if you have a ``Series`` of type string, where lots of elements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not correct from a user perspective. What you want / need to know is that you are NOT getting a categorical type back.

The fact that this can equally be done ONLY on the categories and manually constructing a new categorical is important ,but should be separate from this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is explained in the text above the example?
But I agree it can maybe be repeated explicitly in the note, as this note is what stands out from the text

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to have this note only on the performance implications, nothing to do with the return type. IMO this could also go into a note in the string method docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think this would be gr8 on the doc-strings as well (you can do a follow up) for that if you'd like

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will add the same para on to of text.rst

are repeated (i.e. the number of unique elements in the ``Series`` is a lot smaller than the
length of the ``Series``). In this case it can be faster to convert the original ``Series``
to one of type ``category`` and use ``.str.<method>`` or ``.dt.<property>`` on that.

Setting
~~~~~~~

Expand Down
16 changes: 16 additions & 0 deletions doc/source/text.rst
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,22 @@ and replacing any remaining whitespaces with underscores:
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
df

.. note::

If you do a lot of string munging and have a ``Series`` where lots of elements are repeated
(i.e. the number of unique elements in the ``Series`` is a lot smaller than the length of the
``Series``), it can be faster to convert the original ``Series`` to one of type
``category`` and then use ``.str.<method>`` or ``.dt.<property>`` on that. The
performance difference comes from the fact that, for ``Series`` of type ``category``, the
string operations are done on the ``.categories`` and not on each element of the
``Series``. Please note that a ``Series`` of type ``category`` with string ``.categories`` has
some limitations in comparison of ``Series`` of type string (e.g. you can't add strings to
each other: ``s + " " + s`` won't work if ``s`` is a ``Series`` of type ``category``). Also,
``.str`` methods which operate on elements of type ``list`` are not available on such a
``Series``. If you are interested in having these performance gains on all string ``Series``,
please look at `this bug report <https://github.com/pydata/pandas/issues/8640>`_.


Splitting and Replacing Strings
-------------------------------

Expand Down
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.17.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,8 @@ Enhancements

pd.Index([1, np.nan, 3]).fillna(2)

- Series of type ``"category"`` now make ``.str.<...>`` and ``.dt.<...>`` accessor methods / properties available, if the categories are of that type. (:issue:`10661`)

- ``pivot_table`` now has a ``margins_name`` argument so you can use something other than the default of 'All' (:issue:`3335`)

.. _whatsnew_0171.api:
Expand Down
2 changes: 0 additions & 2 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -2704,12 +2704,10 @@ def _dir_deletions(self):

def _dir_additions(self):
rv = set()
# these accessors are mutually exclusive, so break loop when one exists
for accessor in self._accessors:
try:
getattr(self, accessor)
rv.add(accessor)
break
except AttributeError:
pass
return rv
Expand Down
Loading