Make .str/.dt available for Series of type category with string/datetime #11582

jankatins · 2015-11-12T13:21:27Z

If a series is a type category and the underlying Categorical has categories of type string or datetime, then make it possible to use the .str/.dt assessor on such a series.

The string/dt methods work on the categories (and therefore fast if we have
only a few categories), but return a Series with a dtype other than
category (integer, boolean, string,...), so that it is no different if we use
.str / .dt on a series of type string or of type category.

The main reason for that is that I think things like s.str.slice(...) + s.str.slice(...) should work.

Closes: #10661

jankatins · 2015-11-12T13:56:15Z

I've no idea why .cat is not in dir(series_of_type_category)?!?

kawochen · 2015-11-12T20:59:38Z

@JanSchulz
https://github.com/pydata/pandas/blob/master/pandas/core/series.py#L2675

jankatins · 2015-11-12T23:42:03Z

@kawochen Thanks! 7a20c36

jreback · 2015-11-12T23:46:24Z

pandas/core/base.py

-            # but that isn't practical for performance reasons until we have a
+        if isinstance(self, Series) and not(
+                    (com.is_categorical_dtype(self.dtype) and
+                     com.is_object_dtype(self.values.categories)) or


use self.cat.categories, because otherwise you could be forcing a conversion here

Can do, but as I understand the code, I made sure that it is a categorical before using the categories on values: com.is_categorical_dtype(self.dtype) and com.is_object_dtype(self.values.categories)

jankatins · 2015-11-13T01:18:44Z

Yay, it's green :-)

jreback · 2015-11-13T13:17:59Z

pandas/tests/test_categorical.py

+        # https://github.com/pydata/pandas/issues/10661
+        from pandas.core.strings import StringMethods
+        s = Series(list('aabb'))
+        c = s.astype('category')


much better to have a loop here, that just looks at all of the StringMethods and tests against them, rather than writing out each one. (of course prob will need some special cases).

Im not so sure: you would need to replace this 6 3liners by a much harder to read list:

tests = [("method", (args), {"kwarg":"whatever"}),...]

I still don't think it should be necessary to build tests for all string methods, as far as I can tell, I covered all code paths (_wrap_result and _wrap_result_expand)

jreback · 2015-11-13T13:19:03Z

can you update the doc as well (maybe a reference / example in the Categorical section to one in the TextMethods section)

jankatins · 2015-11-13T14:25:24Z

Added docs in categorical.rst

jankatins · 2015-11-15T21:03:47Z

Ok, @jreback, you were right, it needed more tests...

Rebased + new tests + fixes... Lets wait what travis says...

jreback · 2015-11-16T12:19:35Z

pandas/tests/test_categorical.py

+        # Handcrafted by
+        # s = pandas.Series(list("abcd"))
+        # t = "('%s', (), {})"
+        # functs = [t % f for f in dir(s.str) if not f.startswith("_")]


so instead of listing these out as this would be instantly out-dated if something else is added.

create a list of the ones that need special arguments, then programatically create the rest

jankatins · 2015-11-16T13:58:37Z

Ok, two failures: https://travis-ci.org/pydata/pandas/builds/91286698, but I can't reproduce both errors (in py3.5 and py2.7, and it seems I'm too stupid to get a proper py3.4 environment which can run python setup.py build_ext --inplace :-( )`.

On py3.3/3.4:
nosetests pandas.io.tests.test_data

=====================================================================
ERROR: test_get_underlying_price (pandas.io.tests.test_data.TestYahooOptions)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/pydata/pandas/pandas/util/testing.py", line 1774, in wrapper
    return t(*args, **kwargs)
  File "/home/travis/build/pydata/pandas/pandas/io/tests/test_data.py", line 400, in test_get_underlying_price
    quote_price = options_object._underlying_price_from_root(root)
  File "/home/travis/build/pydata/pandas/pandas/io/data.py", line 737, in _underlying_price_from_root
    underlying_price = root.xpath('.//*[@class="time_rtq_ticker Fz-30 Fw-b"]')[0]\
IndexError: list index out of range

Py27, https://travis-ci.org/pydata/pandas/jobs/91286713
nosetests pandas.tseries.tests.test_timeseries

======================================================================
FAIL: test_datetime64_with_DateOffset (pandas.tseries.tests.test_timeseries.TestDatetimeIndex)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/pydata/pandas/pandas/tseries/tests/test_timeseries.py", line 2668, in test_datetime64_with_DateOffset
    assert_func(klass([op + x for x in s]), op + s)
  File "/opt/python/2.7.9/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/home/travis/build/pydata/pandas/pandas/util/testing.py", line 2042, in assert_produces_warning
    % expected_warning.__name__)
AssertionError: Did not see expected warning of class 'PerformanceWarning'.

jreback · 2015-11-16T14:14:56Z

ignore the first one, 2nd I am fixing now.

jorisvandenbossche · 2015-11-17T10:22:54Z

doc/source/categorical.rst

+
+.. ipython:: python
+
+    str_s = Series(list('aabb'))


can you use pd.Series here instead of Series?

jorisvandenbossche · 2015-11-17T10:26:00Z

@JanSchulz Nice! I just added a few minor doc comments

jankatins · 2015-11-17T12:37:35Z

@jorisvandenbossche Addressed your comments. Thanks for the review!

jreback · 2015-11-17T13:13:45Z

doc/source/categorical.rst

+.. note::
+
+    The work is done on the ``categories`` and then a new ``Series`` is constructed. This has
+    some performance implication if you have a ``Series`` of type string, where lots of elements


This is still not correct from a user perspective. What you want / need to know is that you are NOT getting a categorical type back.

The fact that this can equally be done ONLY on the categories and manually constructing a new categorical is important ,but should be separate from this.

I think this is explained in the text above the example?
But I agree it can maybe be repeated explicitly in the note, as this note is what stands out from the text

I wanted to have this note only on the performance implications, nothing to do with the return type. IMO this could also go into a note in the string method docs.

yes, I think this would be gr8 on the doc-strings as well (you can do a follow up) for that if you'd like

Ok, will add the same para on to of text.rst

jreback · 2015-11-17T13:20:20Z

couple comments. pls squash when finished.

jankatins · 2015-11-17T13:56:48Z

doc/source/categorical.rst

+The returned ``Series`` (or ``DataFrame``) is of the same type as if you used the
+``.str.<method>`` / ``.dt.<method>`` on a ``Series`` of that type. In the above case, the
+returned values of ``str_cat.str.contains("a")`` and ``str_s.str.contains("a")`` or
+``date_cat.dt.day`` and ``date_s.dt.day`` will be equal:


Would it be ok to make (half of ) this paragraph into note and insert a (and not of type "category") like this:

.. note:: The returned ``Series`` (or ``DataFrame``) is of the same type as if you used the ``.str.<method>`` / ``.dt.<method>`` on a ``Series`` of that type (and not of type ``category``). That means, that in the above case, the returned values of ``str_cat.str.contains("a")`` and ``str_s.str.contains("a")`` or ``date_cat.dt.day`` and ``date_s.dt.day`` will be equal: ...

jankatins · 2015-11-17T14:19:52Z

@jreback: please no squash: I really put some work into making each commit stand on its own, the unit of change is each commit (4 of them...) and not the complete PR.

jankatins · 2015-11-17T14:26:00Z

Ok, addressed the comments, please review the doc change if that is clearer now.

jreback · 2015-11-17T14:34:18Z

ok, lgtm. ping on green.

on the squash; ok for this one :) normally we just squashem as a matter of course.

jreback · 2015-11-17T15:00:46Z

@JanSchulz ok, ping when green (after any more doc changes you have)

If a series is a type category and the underlying Categorical has categories of type string, then make it possible to use the `.str` assessor on such a series. The string methods work on the categories (and therefor fast if we have only a few categories), but return a Series with a dtype other than category (boolean, string,...), so that it is no different if we use `.str` on a series of type string or of type category.

If a series is a type category and the underlying Categorical has categories of type datetime, then make it possible to use the .dt assessor on such a series. The string methods work on the categories (and therefore fast if we have only a few categories), but return a Series with a dtype other than category (integer,...), so that it is no different if we use .dt on a series of type datetime or of type category.

jankatins · 2015-11-17T15:10:47Z

Ok, added something to text.rst. Please review. It would also good if someone with access to travis would cancel the other builds so that we have a few hours less to wait for this :-)

Also add some docs in text.rst to mention the performance gains when using ``s_cat.str`` vs ``s.str``.

jankatins · 2015-11-17T22:41:13Z

@jreback ping :-)

jreback · 2015-11-18T11:46:16Z

merged via 6eefe75

thanks!

as usual, pls review built docs for accuracy

jankatins · 2015-11-18T13:20:54Z

@jreback Thanks for all the reviews and for your patience! The PR got much better thanks to you and @jorisvandenbossche!

Will add a PR forthis:

In [124]: date_s = pd.Series(date_range('1/1/2015', periods=5))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-124-f7c79f210480> in <module>()
----> 1 date_s = pd.Series(date_range('1/1/2015', periods=5))

NameError: name 'date_range' is not defined

jreback · 2015-11-18T13:39:25Z

@JanSchulz no thank you!

great PR as always from you.

I got the doc fix, was making some corrections anyhow.

jankatins mentioned this pull request Nov 12, 2015

API: Add str/dt accessors to categorical #10661

Closed

jankatins changed the title ~~Make .str available for Series of type category with strings~~ Make .str/.dt available for Series of type category with string/datetime Nov 12, 2015

jankatins force-pushed the dt_str_accessor_for_cats branch from 9047483 to c9d4a6c Compare November 12, 2015 23:41

jankatins mentioned this pull request Nov 12, 2015

Prevent adding new attributes to the accessors .str, .dt and .cat #11575

Merged

jreback reviewed Nov 12, 2015
View reviewed changes

jreback added API Design Categorical Categorical Data Type Enhancement labels Nov 12, 2015

jankatins force-pushed the dt_str_accessor_for_cats branch 2 times, most recently from 86c87e0 to d0a0cdd Compare November 13, 2015 00:15

jankatins force-pushed the dt_str_accessor_for_cats branch from d0a0cdd to 511e866 Compare November 13, 2015 10:11

jreback reviewed Nov 13, 2015
View reviewed changes

jankatins force-pushed the dt_str_accessor_for_cats branch from 511e866 to 41aaa1a Compare November 13, 2015 14:24

jankatins force-pushed the dt_str_accessor_for_cats branch from 41aaa1a to 0769c81 Compare November 15, 2015 21:02

jankatins force-pushed the dt_str_accessor_for_cats branch from 0769c81 to 4c97f9e Compare November 15, 2015 22:56

jreback reviewed Nov 16, 2015
View reviewed changes

jorisvandenbossche reviewed Nov 17, 2015
View reviewed changes

jankatins force-pushed the dt_str_accessor_for_cats branch from b622abd to 4496f82 Compare November 17, 2015 12:36

jreback reviewed Nov 17, 2015
View reviewed changes

jreback added this to the 0.17.1 milestone Nov 17, 2015

jankatins reviewed Nov 17, 2015
View reviewed changes

jankatins force-pushed the dt_str_accessor_for_cats branch from 4496f82 to 9ad797e Compare November 17, 2015 14:21

jankatins added 2 commits November 17, 2015 16:06

jankatins force-pushed the dt_str_accessor_for_cats branch from ce5dd6a to c7bb283 Compare November 17, 2015 15:08

DOC: whatsnew and docs for multiple accessors

8020bf5

Also add some docs in text.rst to mention the performance gains when using ``s_cat.str`` vs ``s.str``.

jankatins force-pushed the dt_str_accessor_for_cats branch from c7bb283 to 8020bf5 Compare November 17, 2015 15:14

jreback closed this Nov 18, 2015

nbonnotte mentioned this pull request Jan 26, 2016

CLN: Moving Series.rank and DataFrame.rank to generic.py #11924

Closed

jreback mentioned this pull request Jan 23, 2017

API: .str ops on category should return category if result is non-boolean #15198

Open

Make .str/.dt available for Series of type category with string/datetime #11582

Make .str/.dt available for Series of type category with string/datetime #11582

Conversation

jankatins commented Nov 12, 2015

jankatins commented Nov 12, 2015

kawochen commented Nov 12, 2015

jankatins commented Nov 12, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jankatins commented Nov 13, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Nov 13, 2015

jankatins commented Nov 13, 2015

jankatins commented Nov 15, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jankatins commented Nov 16, 2015

jreback commented Nov 16, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 17, 2015

jankatins commented Nov 17, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Nov 17, 2015

Choose a reason for hiding this comment

jankatins commented Nov 17, 2015

jankatins commented Nov 17, 2015

jreback commented Nov 17, 2015

jreback commented Nov 17, 2015

jankatins commented Nov 17, 2015

jankatins commented Nov 17, 2015

jreback commented Nov 18, 2015

jankatins commented Nov 18, 2015

jreback commented Nov 18, 2015