Skip to content

DEPR: Deprecate .str accessor on object-dtype. #29710

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TomAugspurger opened this issue Nov 19, 2019 · 7 comments
Open

DEPR: Deprecate .str accessor on object-dtype. #29710

TomAugspurger opened this issue Nov 19, 2019 · 7 comments
Labels
Deprecate Functionality to remove in pandas Strings String extension data type and string data

Comments

@TomAugspurger
Copy link
Contributor

This issue discusses a future deprecation of the .str accessor on StringArray.
It's probably blocked by making Series(['a', 'b', 'c']) infer StringDtype, rather
than object dtype.

In #29640, we split the implementation
of .str methods into two: one for the old object-dtype arrays and one for StringArray.
The StringArray implementation is much nicer since we know the result dtype statically
for most methods. We don't need to worry about the presence of NAs changing int to
floats or bool to object.

Given that it's nicer for the user (faster, more predictable) and it's nicer for us
(clean up old code), we should deprecate the .str accessor on object dtype.

There are a few complications. Certain .str methods aren't actually methods
called on scalar strings.

  1. .str.join can work with Series where each row is a list of strings.
  2. .str.decode works on bytes, not strings. It makes no sense on a StringArray (but
    we can pretty easily implement a BytesArray)
  3. .str.get is extremely flexible / complicated. It works on anything that
    implements __getitem__ or .get. So it's maybe useful for strings, but
    also for nested Series storing lists / dicts.

For these, we may want separate accessors. Or we can keep the .str accessor
and deprecate every method except for those.

@WillAyd
Copy link
Member

WillAyd commented Nov 19, 2019

Makes sense to me. As far as the complications I don't know that str.decode is used all that often to worry about; str.get is a pretty weird method to use on containers

str.join is also kind of a misnomer and we don't have first class support for lists stored in Series anyway, so again I don't think should jump through hopes for that

@jschendel
Copy link
Member

str.get is a pretty weird method to use on containers

Not disagreeing, but note that the documentation for str.get explicitly states that containers are valid and gives example usage with list/tuple/dict. There's also an example of this in the Working with text data portion of the user guide. So there is some precedence for using str.get in such a way, and it's behavior we've explicitly supported.

@Liam3851
Copy link
Contributor

In our firm .str.split(sep).str[n] is a common idiom to quickly parse the nth substring of a delimited string (e.g. to parse paths stored in a Series). As @jschendel noted it has been officially supported in the docs for years (since at least 0.19 if not earlier).

str.len is also currently supported in the docs as working on tuple, list, and dict, so I'd add that to the list above. str.slice also works in the logical way on sequences, but is not currently documented as supported for sequences.

Perhaps a .seq accessor would be appropriate for sequences and would support get/getitem/join/len/slice?

@WillAyd
Copy link
Member

WillAyd commented Nov 19, 2019

Thanks for the feedback @Liam3851 that is great.

So sounds like we would want to deprecate this. Just as a thought though instead of using accessors at all I wonder if a Series should just dispatch to the underlying array. We could take something like the DictArray in #29557 but make it even more generic (say a ContainerArray) and define a .get method directly against that.

So instead of today where you do something like:

pd.Series().str.split(",").str.get(...)

You could just do

pd.Series().split(",").get(...)

This assumes that pd.Series contains a StringArray which defines a split method that returns say a ListArray, which itself defines a get method

@jorisvandenbossche
Copy link
Member

I don't think we can deprecate this on a reasonably short term.

First, as Tom says himself, we first still need to switch to "string" dtype by default for strings (in pandas 2.0?). And as long as that is not done, I don't think we should deprecate .str on object.
But even then, using object dtype for strings is so ingrained that I am not sure we should deprecate it quickly (although it would be a way to push remaining people from object to string).

Just as a thought though instead of using accessors at all I wonder if a Series should just dispatch to the underlying array.

I would like to explore that idea very much (I mentioned that in #26710 (comment), but it was not further discussed there). Personally, I think it would be interesting if someone wrote up a roadmap proposal to discuss this (it's been on my to do list, but currently focusing on missing values :))

@minggli
Copy link
Contributor

minggli commented Dec 2, 2019

also to add to the discussion. at the moment, .str accessor silently casts non-string objects to NaN, when it perhaps raises an error of some sort.

In [96]: class Dummy(object): 
    ...:     pass 
    ...:                                                                        

In [97]: ser = pd.Series([Dummy(), Dummy()])
In [102]: ser                                                                   
Out[102]: 
0    <__main__.Dummy object at 0x7f4ace630518>
1    <__main__.Dummy object at 0x7f4ace630240>
dtype: object

In [103]: ser.str.lower()                                                       
Out[103]: 
0   NaN
1   NaN
dtype: float64

@ghost
Copy link

ghost commented Nov 9, 2020

Just ran into the issue @minggli describes when using pd.concat on two CSVs with slightly different auto-dtype detection.

e.g.

In [1]: s1 = pd.Series(['1,000', '2,000'])
In [2]: s2 = pd.Series(['1000', '2000'], dtype=float)
In [3]: s_joined = pd.concat([s1, s2], ignore_index=True)
In [4]: s_joined
Out[4]:
0    1,000
1    2,000
2     1000
3     2000
dtype: object

In [5]: s_joined.str.replace(',', '')
Out[5]:
0     1000
1     2000
2      NaN
3      NaN
dtype: object

Very much support both the new String dtypes and the deprecation of the .str accessor on object dtypes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

7 participants