Skip to content

API: .str ops on category should return category if result is non-boolean #15198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jreback opened this issue Jan 23, 2017 · 4 comments
Open
Labels
Categorical Categorical Data Type Enhancement Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data

Comments

@jreback
Copy link
Contributor

jreback commented Jan 23, 2017

In the PR implementing .str/.dt on Categoricals, #11582.

This is perfectly reasonable. We perform the string op on the uniques. This routine is a boolean result, so we return a boolean result.

In [2]: s = pd.Series(list('aabb')).astype('category')

In [3]: s
Out[3]: 
0    a
1    a
2    b
3    b
dtype: category
Categories (2, object): [a, b]

In [4]:  s.str.contains("a")
Out[4]: 
0     True
1     True
2    False
3    False
dtype: bool

However, I don't recall the rationale for: performing the op on the uniques (as its a categorical), but then returning an object dtype.

In [5]: s.str.upper()
Out[5]: 
0    A
1    A
2    B
3    B
dtype: object

These are by-definition pure transforms, and so a new categorical makes sense. e.g. in this case

In [6]: pd.Series(pd.Categorical.from_codes(s.cat.codes, s.cat.categories.str.upper()))
Out[6]: 
0    A
1    A
2    B
3    B
dtype: category
Categories (2, object): [A, B]

This will be way more efficient than actually converting to object.

@jreback jreback added API Design Strings String extension data type and string data labels Jan 23, 2017
@jreback
Copy link
Contributor Author

jreback commented Jan 23, 2017

@jorisvandenbossche
Copy link
Member

There is some discussion on this in the issue #10661 (but not that there are very strong arguments listed for either option)

@jankatins
Copy link
Contributor

I think the main argument was that s.str + s.str should be possible, but it won't if .str (or any method under .str) returns a categorical. I think most of these arguments will go away if there is a new pd.String...

@jorisvandenbossche
Copy link
Member

Looking at that thread again, the argument I made there for returning strings and not categoricals is that it would be a lot simpler to just return it as an object series to not have to deal with all possible corner cases you can run into when manipulating the categoricals (eg what to do with pd.Series(list('abAb'), dtype='category').str.upper(). Merge categories?). I think that still holds, but I don't feel strongly about it.

If you want to change the categoricals, you can always directly act on the categories itself (which is also more explicit), like

s.cat.categories = s.cat.categories.str.upper()

@jorisvandenbossche jorisvandenbossche added the Needs Discussion Requires discussion from core team before further action label Mar 6, 2017
@mroeschke mroeschke added Enhancement Categorical Categorical Data Type labels May 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Enhancement Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

4 participants