-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Should pandas preserve NaNs in str methods? #5672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'd be on board with that. Definitely gets annoying to constantly be |
You can do this. Though I agree with you both that this should be the default for the 'boolean' type of ops on strings.
|
Yes it should. 👍 |
Isn't there a legitimate distinction between missing values and empty strings to be preserved?
|
@y-p this only applies to Series.str methods (and only the ones that return bools) |
@jreback, isn't proper nan propagation the right thing everywhere, even the Series.str methods? An empty string is not a nan, and it looks like at least in the example given if you don't mixup the two - There may be other cases, if so, let's look at those. |
I think the issue is that a method that is in effect a boolean method can return Nan's so you always need to fill those methods should be guaranteed boolean returns kind of like any/all always return a boolean result |
they are methods that return boolean series, I don't think it's true nans have no intrinsic
...excepting the usual limitations of numpy arrays wrt to nan for some dtypes. That's not a good
If you want nans to behave like empty strings, use empty strings instead. Problem solved. |
their is already an argument, na to the string method to fill the Nan's - no interpretation required I think a method in python space startswith always returns a boolean result and the default makes sense to return a boolean array in pandas (right now the default for na is False, the proposal is to make it True) I think their are 2 classes of string methods: boolean operators (startswitj, contains) and transformers (split,extract) the first should fillna according to the na arg, the second should not booleans tend to be used in indexing, so u can argue whether the default should be True or False but I think it's clear that it should be boolean |
It makes sense to me that some questions have undefined as their answer.
|
It's a bit annoying that you have to do fillna, but at least it errors out when you try to use it as a selector: >>> df = pd.DataFrame({"A":['a', np.nan]})
>>> df[df.A.str.startswith('a')]
Traceback
...
ValueError: cannot index with vector containing NA / NaN values. |
That's a more reasonable place to choose the semantics of indexing with nan. |
Why is the kwarg called |
@jorisvandenbossche could this be closed? Is this discussion superseded by the new BooleanArray and it's default behaviors or is this orthogonal and we could still consider changing the existing behavior of the str methods so that they always return a boolean array and not object when np.NaN is present? |
I think this could also be superseded by the string extension dtype with NA support. Happy to reopen if we want to continue the original discussion |
Since the output of the string methods is primarily boolean, do we want to preserve NaNs? Use case. I load in some data, empty strings (i.e., missing data in a column that's string data) gets automatically converted to NaNs. I have to do fillna everywhere first, or I run into situations like this.
Since presumably the primary use case of a boolean series is indexing, do we want to just return False in place of NaNs? Just a thought.
The text was updated successfully, but these errors were encountered: