Skip to content

PERF: perform .str operations on categoricals #8627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Oct 24, 2014 · 13 comments
Closed

PERF: perform .str operations on categoricals #8627

jreback opened this issue Oct 24, 2014 · 13 comments
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance Strings String extension data type and string data

Comments

@jreback
Copy link
Contributor

jreback commented Oct 24, 2014

So huge win to perform .str operations on the .categories of a Categorical (versus actually doing these on an object array), when you have << number of categories relative to the number of objects.

@jreback jreback added Performance Memory or execution speed performance Strings String extension data type and string data Categorical Categorical Data Type labels Oct 24, 2014
@jreback jreback added this to the 0.16.0 milestone Oct 24, 2014
@shoyer
Copy link
Member

shoyer commented Oct 24, 2014

I agree this is a good idea, but note that categoricals intentionally do not support arithmetic. That seems inconsistent to me. So I would either consider adding support for arithmetic with numeric categories, or create a more specialized "interned string" array type, which has slightly different meaning than a categorical.

@jreback
Copy link
Contributor Author

jreback commented Oct 24, 2014

I was thinking more along the lines of this:

s.str.startswith('a') is MUCH more performant if the number of factorized categories is relatively small to the size of the object.

FYI, cc @JanSchultz

I think it might be nice to have a .density method on Categoricals? essentially
100*len(categories)/float(len(codes)) ? is this called something in R land?

@shoyer
Copy link
Member

shoyer commented Oct 25, 2014

@jreback I totally agree with you, but s + 1 has equivalent performance gains.

Here's what the categorical docs say about numeric operations:

In contrast to statistical categorical variables, categorical data might have an order (e.g. ‘strongly agree’ vs ‘agree’ or ‘first observation’ vs. ‘second observation’), but numerical operations (additions, divisions, ...) are not possible.

Maybe this is more a practical statement than a principled one?

@jreback
Copy link
Contributor Author

jreback commented Oct 25, 2014

not talking about numeric ops
but rather str ops here (like in the decode issue)

@jreback
Copy link
Contributor Author

jreback commented Oct 25, 2014

side issue - you interested in speaking at pydata in November in NYC ? (I think. 22-23)

@shoyer
Copy link
Member

shoyer commented Oct 25, 2014

@jreback OK, made a new issue for arithmetic. As for PyData NYC, I am sort of intrigued but already doing a lot of travel these days.

@jreback
Copy link
Contributor Author

jreback commented Oct 25, 2014

@shoyer ok cool np.

@jankatins
Copy link
Contributor

Again I think that this is more an argument to implement a 'pandas-string' type. If one wants to have the above gains then doing a np.asarray(cat.cat.set_categories(cat.cat.categories.str.whatever(...))). On the other hand if this can essentially be done in 3 lines of additional code in the .str implementation, then why not...

PS: Schulz with 'z' and not 'tz'. :-)

@jreback
Copy link
Contributor Author

jreback commented Oct 26, 2014

@JanSchulz sorry about that git is annoying with that :)

@jankatins
Copy link
Contributor

IMO, this should be closed in favor of #8640...

@jreback
Copy link
Contributor Author

jreback commented Nov 13, 2015

@JanSchulz you mean #10661 right? (which mostly solves the problem ,though does blow back to object arrays), I suspect you get almost all of the perf gains. I fact want to add a little benchmark in the perf suite?

@jankatins
Copy link
Contributor

Nope: I understand this issue that you want all .str function work on a shrinked down dataset of unique values.

You could actually do that now by simple calling the codes, categories = factorize(values, sort=True) and using that as a kind of categorical. On the other hand, that would do a factorice on all string methods (and does not cache it), so the penalty is still there.

Today (as of #10661) you can do s.astype("categorical").str.whatever() and get that behaviour (if you cache the cat yourself). But this has the problem that you get the categorical behaviour: no s+s, no arbitrary new string values,...

IMO the real solution is to build a "PandasString" class in the same spirit as "Categorical" and use that: s.astype("pdstring").str.whatever(). And that's #8640

Or get a real numpy string type...

@jreback
Copy link
Contributor Author

jreback commented Nov 18, 2015

closing in favor of #8640

@jreback jreback closed this as completed Nov 18, 2015
@jorisvandenbossche jorisvandenbossche modified the milestones: No action, Next Major Release Jul 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

4 participants