-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DEPR: Deprecate .str
accessor on object-dtype.
#29710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Makes sense to me. As far as the complications I don't know that
|
Not disagreeing, but note that the documentation for |
In our firm
Perhaps a |
Thanks for the feedback @Liam3851 that is great. So sounds like we would want to deprecate this. Just as a thought though instead of using accessors at all I wonder if a Series should just dispatch to the underlying array. We could take something like the DictArray in #29557 but make it even more generic (say a ContainerArray) and define a So instead of today where you do something like:
You could just do
This assumes that pd.Series contains a StringArray which defines a |
I don't think we can deprecate this on a reasonably short term. First, as Tom says himself, we first still need to switch to "string" dtype by default for strings (in pandas 2.0?). And as long as that is not done, I don't think we should deprecate
I would like to explore that idea very much (I mentioned that in #26710 (comment), but it was not further discussed there). Personally, I think it would be interesting if someone wrote up a roadmap proposal to discuss this (it's been on my to do list, but currently focusing on missing values :)) |
also to add to the discussion. at the moment, In [96]: class Dummy(object):
...: pass
...:
In [97]: ser = pd.Series([Dummy(), Dummy()])
In [102]: ser
Out[102]:
0 <__main__.Dummy object at 0x7f4ace630518>
1 <__main__.Dummy object at 0x7f4ace630240>
dtype: object
In [103]: ser.str.lower()
Out[103]:
0 NaN
1 NaN
dtype: float64 |
Just ran into the issue @minggli describes when using e.g. In [1]: s1 = pd.Series(['1,000', '2,000'])
In [2]: s2 = pd.Series(['1000', '2000'], dtype=float)
In [3]: s_joined = pd.concat([s1, s2], ignore_index=True)
In [4]: s_joined
Out[4]:
0 1,000
1 2,000
2 1000
3 2000
dtype: object
In [5]: s_joined.str.replace(',', '')
Out[5]:
0 1000
1 2000
2 NaN
3 NaN
dtype: object Very much support both the new String dtypes and the deprecation of the |
This issue discusses a future deprecation of the
.str
accessor on StringArray.It's probably blocked by making
Series(['a', 'b', 'c'])
infer StringDtype, ratherthan object dtype.
In #29640, we split the implementation
of
.str
methods into two: one for the oldobject
-dtype arrays and one for StringArray.The StringArray implementation is much nicer since we know the result dtype statically
for most methods. We don't need to worry about the presence of NAs changing int to
floats or bool to object.
Given that it's nicer for the user (faster, more predictable) and it's nicer for us
(clean up old code), we should deprecate the
.str
accessor on object dtype.There are a few complications. Certain
.str
methods aren't actually methodscalled on scalar strings.
.str.join
can work with Series where each row is a list of strings..str.decode
works on bytes, not strings. It makes no sense on a StringArray (butwe can pretty easily implement a BytesArray)
.str.get
is extremely flexible / complicated. It works on anything thatimplements
__getitem__
or.get
. So it's maybe useful for strings, butalso for nested Series storing lists / dicts.
For these, we may want separate accessors. Or we can keep the
.str
accessorand deprecate every method except for those.
The text was updated successfully, but these errors were encountered: