-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Make .str/.dt available for Series of type category with string/datetime #11582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I've no idea why |
9047483
to
c9d4a6c
Compare
@kawochen Thanks! 7a20c36 |
# but that isn't practical for performance reasons until we have a | ||
if isinstance(self, Series) and not( | ||
(com.is_categorical_dtype(self.dtype) and | ||
com.is_object_dtype(self.values.categories)) or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use self.cat.categories
, because otherwise you could be forcing a conversion here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do, but as I understand the code, I made sure that it is a categorical before using the categories on values
: com.is_categorical_dtype(self.dtype) and com.is_object_dtype(self.values.categories)
86c87e0
to
d0a0cdd
Compare
Yay, it's green :-) |
d0a0cdd
to
511e866
Compare
# https://github.com/pydata/pandas/issues/10661 | ||
from pandas.core.strings import StringMethods | ||
s = Series(list('aabb')) | ||
c = s.astype('category') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
much better to have a loop here, that just looks at all of the StringMethods and tests against them, rather than writing out each one. (of course prob will need some special cases).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not so sure: you would need to replace this 6 3liners by a much harder to read list:
tests = [("method", (args), {"kwarg":"whatever"}),...]
I still don't think it should be necessary to build tests for all string methods, as far as I can tell, I covered all code paths (_wrap_result
and _wrap_result_expand
)
can you update the doc as well (maybe a reference / example in the Categorical section to one in the TextMethods section) |
511e866
to
41aaa1a
Compare
Added docs in categorical.rst |
41aaa1a
to
0769c81
Compare
Ok, @jreback, you were right, it needed more tests... Rebased + new tests + fixes... Lets wait what travis says... |
0769c81
to
4c97f9e
Compare
# Handcrafted by | ||
# s = pandas.Series(list("abcd")) | ||
# t = "('%s', (), {})" | ||
# functs = [t % f for f in dir(s.str) if not f.startswith("_")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so instead of listing these out as this would be instantly out-dated if something else is added.
create a list of the ones that need special arguments, then programatically create the rest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Ok, two failures: https://travis-ci.org/pydata/pandas/builds/91286698, but I can't reproduce both errors (in py3.5 and py2.7, and it seems I'm too stupid to get a proper py3.4 environment which can run On py3.3/3.4:
Py27, https://travis-ci.org/pydata/pandas/jobs/91286713
|
ignore the first one, 2nd I am fixing now. |
|
||
.. ipython:: python | ||
|
||
str_s = Series(list('aabb')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use pd.Series
here instead of Series
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@JanSchulz Nice! I just added a few minor doc comments |
b622abd
to
4496f82
Compare
@jorisvandenbossche Addressed your comments. Thanks for the review! |
.. note:: | ||
|
||
The work is done on the ``categories`` and then a new ``Series`` is constructed. This has | ||
some performance implication if you have a ``Series`` of type string, where lots of elements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still not correct from a user perspective. What you want / need to know is that you are NOT getting a categorical type back.
The fact that this can equally be done ONLY on the categories and manually constructing a new categorical is important ,but should be separate from this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is explained in the text above the example?
But I agree it can maybe be repeated explicitly in the note, as this note is what stands out from the text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to have this note only on the performance implications, nothing to do with the return type. IMO this could also go into a note in the string method docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think this would be gr8 on the doc-strings as well (you can do a follow up) for that if you'd like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will add the same para on to of text.rst
couple comments. pls squash when finished. |
The returned ``Series`` (or ``DataFrame``) is of the same type as if you used the | ||
``.str.<method>`` / ``.dt.<method>`` on a ``Series`` of that type. In the above case, the | ||
returned values of ``str_cat.str.contains("a")`` and ``str_s.str.contains("a")`` or | ||
``date_cat.dt.day`` and ``date_s.dt.day`` will be equal: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be ok to make (half of ) this paragraph into note and insert a (and not of type "category")
like this:
.. note::
The returned ``Series`` (or ``DataFrame``) is of the same type as if you used the
``.str.<method>`` / ``.dt.<method>`` on a ``Series`` of that type (and not of type ``category``).
That means, that in the above case, the returned values of ``str_cat.str.contains("a")`` and ``str_s.str.contains("a")`` or ``date_cat.dt.day`` and ``date_s.dt.day`` will be equal:
...
@jreback: please no squash: I really put some work into making each commit stand on its own, the unit of change is each commit (4 of them...) and not the complete PR. |
4496f82
to
9ad797e
Compare
Ok, addressed the comments, please review the doc change if that is clearer now. |
ok, lgtm. ping on green. on the squash; ok for this one :) normally we just squashem as a matter of course. |
@JanSchulz ok, ping when green (after any more doc changes you have) |
If a series is a type category and the underlying Categorical has categories of type string, then make it possible to use the `.str` assessor on such a series. The string methods work on the categories (and therefor fast if we have only a few categories), but return a Series with a dtype other than category (boolean, string,...), so that it is no different if we use `.str` on a series of type string or of type category.
If a series is a type category and the underlying Categorical has categories of type datetime, then make it possible to use the .dt assessor on such a series. The string methods work on the categories (and therefore fast if we have only a few categories), but return a Series with a dtype other than category (integer,...), so that it is no different if we use .dt on a series of type datetime or of type category.
ce5dd6a
to
c7bb283
Compare
Ok, added something to text.rst. Please review. It would also good if someone with access to travis would cancel the other builds so that we have a few hours less to wait for this :-) |
Also add some docs in text.rst to mention the performance gains when using ``s_cat.str`` vs ``s.str``.
c7bb283
to
8020bf5
Compare
@jreback ping :-) |
merged via 6eefe75 thanks! as usual, pls review built docs for accuracy |
@jreback Thanks for all the reviews and for your patience! The PR got much better thanks to you and @jorisvandenbossche! Will add a PR forthis:
|
@JanSchulz no thank you! great PR as always from you. I got the doc fix, was making some corrections anyhow. |
If a series is a type category and the underlying Categorical has categories of type string or datetime, then make it possible to use the
.str
/.dt
assessor on such a series.The string/dt methods work on the categories (and therefore fast if we have
only a few categories), but return a Series with a dtype other than
category (integer, boolean, string,...), so that it is no different if we use
.str
/.dt
on a series of type string or of type category.The main reason for that is that I think things like
s.str.slice(...) + s.str.slice(...)
should work.Closes: #10661