PERF: perform .str operations on categoricals #8627

jreback · 2014-10-24T20:27:02Z

So huge win to perform .str operations on the .categories of a Categorical (versus actually doing these on an object array), when you have << number of categories relative to the number of objects.

The text was updated successfully, but these errors were encountered:

shoyer · 2014-10-24T22:11:00Z

I agree this is a good idea, but note that categoricals intentionally do not support arithmetic. That seems inconsistent to me. So I would either consider adding support for arithmetic with numeric categories, or create a more specialized "interned string" array type, which has slightly different meaning than a categorical.

jreback · 2014-10-24T23:50:35Z

I was thinking more along the lines of this:

s.str.startswith('a') is MUCH more performant if the number of factorized categories is relatively small to the size of the object.

FYI, cc @JanSchultz

I think it might be nice to have a .density method on Categoricals? essentially
100*len(categories)/float(len(codes)) ? is this called something in R land?

shoyer · 2014-10-25T01:09:25Z

@jreback I totally agree with you, but s + 1 has equivalent performance gains.

Here's what the categorical docs say about numeric operations:

In contrast to statistical categorical variables, categorical data might have an order (e.g. ‘strongly agree’ vs ‘agree’ or ‘first observation’ vs. ‘second observation’), but numerical operations (additions, divisions, ...) are not possible.

Maybe this is more a practical statement than a principled one?

jreback · 2014-10-25T01:11:19Z

not talking about numeric ops
but rather str ops here (like in the decode issue)

jreback · 2014-10-25T01:12:07Z

side issue - you interested in speaking at pydata in November in NYC ? (I think. 22-23)

shoyer · 2014-10-25T01:33:12Z

@jreback OK, made a new issue for arithmetic. As for PyData NYC, I am sort of intrigued but already doing a lot of travel these days.

jreback · 2014-10-25T01:35:19Z

@shoyer ok cool np.

jankatins · 2014-10-26T19:12:17Z

Again I think that this is more an argument to implement a 'pandas-string' type. If one wants to have the above gains then doing a np.asarray(cat.cat.set_categories(cat.cat.categories.str.whatever(...))). On the other hand if this can essentially be done in 3 lines of additional code in the .str implementation, then why not...

PS: Schulz with 'z' and not 'tz'. :-)

jreback · 2014-10-26T19:50:54Z

@JanSchulz sorry about that git is annoying with that :)

jankatins · 2015-11-12T23:48:28Z

IMO, this should be closed in favor of #8640...

jreback · 2015-11-13T13:32:39Z

@JanSchulz you mean #10661 right? (which mostly solves the problem ,though does blow back to object arrays), I suspect you get almost all of the perf gains. I fact want to add a little benchmark in the perf suite?

jankatins · 2015-11-13T13:58:42Z

Nope: I understand this issue that you want all .str function work on a shrinked down dataset of unique values.

You could actually do that now by simple calling the codes, categories = factorize(values, sort=True) and using that as a kind of categorical. On the other hand, that would do a factorice on all string methods (and does not cache it), so the penalty is still there.

Today (as of #10661) you can do s.astype("categorical").str.whatever() and get that behaviour (if you cache the cat yourself). But this has the problem that you get the categorical behaviour: no s+s, no arbitrary new string values,...

IMO the real solution is to build a "PandasString" class in the same spirit as "Categorical" and use that: s.astype("pdstring").str.whatever(). And that's #8640

Or get a real numpy string type...

jreback · 2015-11-18T11:47:32Z

closing in favor of #8640

jreback added Performance Memory or execution speed performance Strings String extension data type and string data Categorical Categorical Data Type labels Oct 24, 2014

jreback added this to the 0.16.0 milestone Oct 24, 2014

shoyer mentioned this issue Oct 25, 2014

Add arithmetic to categoricals? #8629

Closed

jreback mentioned this issue Oct 26, 2014

API/ENH: dtype='string' / pd.String #8640

Closed

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jreback mentioned this issue Jul 23, 2015

API: Add str/dt accessors to categorical #10661

Closed

jreback closed this as completed Nov 18, 2015

jorisvandenbossche modified the milestones: No action, Next Major Release Jul 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: perform .str operations on categoricals #8627

PERF: perform .str operations on categoricals #8627

jreback commented Oct 24, 2014

shoyer commented Oct 24, 2014

jreback commented Oct 24, 2014

shoyer commented Oct 25, 2014

jreback commented Oct 25, 2014

jreback commented Oct 25, 2014

shoyer commented Oct 25, 2014

jreback commented Oct 25, 2014

jankatins commented Oct 26, 2014

jreback commented Oct 26, 2014

jankatins commented Nov 12, 2015

jreback commented Nov 13, 2015

jankatins commented Nov 13, 2015

jreback commented Nov 18, 2015

PERF: perform .str operations on categoricals #8627

PERF: perform .str operations on categoricals #8627

Comments

jreback commented Oct 24, 2014

shoyer commented Oct 24, 2014

jreback commented Oct 24, 2014

shoyer commented Oct 25, 2014

jreback commented Oct 25, 2014

jreback commented Oct 25, 2014

shoyer commented Oct 25, 2014

jreback commented Oct 25, 2014

jankatins commented Oct 26, 2014

jreback commented Oct 26, 2014

jankatins commented Nov 12, 2015

jreback commented Nov 13, 2015

jankatins commented Nov 13, 2015

jreback commented Nov 18, 2015