Should pandas preserve NaNs in str methods? #5672

jseabold · 2013-12-10T10:27:50Z

Since the output of the string methods is primarily boolean, do we want to preserve NaNs? Use case. I load in some data, empty strings (i.e., missing data in a column that's string data) gets automatically converted to NaNs. I have to do fillna everywhere first, or I run into situations like this.

dta = pd.DataFrame([["blah"],[np.nan],["foobar"]], columns=['vari'])
dta.vari.str.startswith('f')
dta.ix[dta.vari.str.startswith('f')]

Since presumably the primary use case of a boolean series is indexing, do we want to just return False in place of NaNs? Just a thought.

The text was updated successfully, but these errors were encountered:

jtratner · 2013-12-10T12:49:50Z

I'd be on board with that. Definitely gets annoying to constantly be
calling fillna() before/after bool-producing string methods.

jreback · 2013-12-10T14:51:30Z

You can do this. Though I agree with you both that this should be the default for the 'boolean' type of ops on strings.

In [6]: dta.ix[dta.vari.str.startswith('f',na=False)]
Out[6]: 
     vari
2  foobar

[1 rows x 1 columns]

dragoljub · 2013-12-14T06:22:38Z

Yes it should. 👍

ghost · 2013-12-14T16:21:11Z

Isn't there a legitimate distinction between missing values and empty strings to be preserved?

In [43]: pd.read_csv(StringIO("'123', '',, '123'"),dialect=csv.get_dialect("excel"),header=None)
Out[43]: 
       0    1   2       3
0  '123'   '' NaN   '123'

jreback · 2013-12-14T16:42:40Z

@y-p this only applies to Series.str methods (and only the ones that return bools)

ghost · 2013-12-14T16:55:14Z

@jreback, isn't proper nan propagation the right thing everywhere, even the Series.str methods?

An empty string is not a nan, and it looks like at least in the example given if you don't mixup the two -
the problem goes away.

There may be other cases, if so, let's look at those.

jreback · 2013-12-14T17:06:10Z

I think the issue is that a method that is in effect a boolean method can return Nan's so you always need to fill

those methods should be guaranteed boolean returns

kind of like any/all always return a boolean result

ghost · 2013-12-14T17:38:58Z

they are methods that return boolean series, I don't think it's true nans have no intrinsic
meaning in a boolean series. course they do.

In [45]: np.array([False, np.nan, True])
Out[45]: array([  0.,  nan,   1.])

...excepting the usual limitations of numpy arrays wrt to nan for some dtypes. That's not a good
reason though.
any and all seem to have that behaviour on the surface, but i's because of the way they test
values, not because they explicitly ignore nans. Plus, I'm pretty sure that's not what you want here:

In [51]: np.array([np.nan,  np.nan]).any()
Out[51]: True
In [50]: bool(np.nan)
Out[50]: True

If you want nans to behave like empty strings, use empty strings instead. Problem solved.

jreback · 2013-12-14T17:48:40Z

their is already an argument, na to the string method to fill the Nan's - no interpretation required

I think a method in python space

startswith always returns a boolean result

and the default makes sense to return a boolean array in pandas (right now the default for na is False, the proposal is to make it True)

I think their are 2 classes of string methods: boolean operators (startswitj, contains) and transformers (split,extract)

the first should fillna according to the na arg, the second should not

booleans tend to be used in indexing, so u can argue whether the default should be True or False but I think it's clear that it should be boolean

ghost · 2013-12-14T18:11:32Z

It makes sense to me that some questions have undefined as their answer.
Isn't np.nan a signal for "I can't answer that"?

In [87]: s=pd.Series([1])
    ...: s.str.startswith("a")
    ...: 
Out[87]: 
0   NaN
dtype: float64

jtratner · 2013-12-14T19:57:06Z

It's a bit annoying that you have to do fillna, but at least it errors out when you try to use it as a selector:

>>> df = pd.DataFrame({"A":['a', np.nan]})
>>> df[df.A.str.startswith('a')]
Traceback
    ...
ValueError: cannot index with vector containing NA / NaN values.

ghost · 2013-12-15T01:50:28Z

That's a more reasonable place to choose the semantics of indexing with nan.

jtratner · 2013-12-15T02:29:38Z

Why is the kwarg called na? Do we do that elsewhere?

simonjayhawkins · 2019-12-12T14:14:54Z

@jorisvandenbossche could this be closed?

Is this discussion superseded by the new BooleanArray and it's default behaviors or is this orthogonal and we could still consider changing the existing behavior of the str methods so that they always return a boolean array and not object when np.NaN is present?

mroeschke · 2020-05-03T19:58:03Z

I think this could also be superseded by the string extension dtype with NA support. Happy to reopen if we want to continue the original discussion

jreback modified the milestones: 0.15.0, 0.14.0 Feb 18, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 1, 2015

sinhrks mentioned this issue Aug 2, 2016

ERR/API: non-string types #13877

Closed

datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018

jbrockmendel removed the Indexing Related to indexing on series/frames, not to indexes themselves label Apr 14, 2020

mroeschke closed this as completed May 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should pandas preserve NaNs in str methods? #5672

Should pandas preserve NaNs in str methods? #5672

jseabold commented Dec 10, 2013

jtratner commented Dec 10, 2013

jreback commented Dec 10, 2013

dragoljub commented Dec 14, 2013

ghost commented Dec 14, 2013

jreback commented Dec 14, 2013

ghost commented Dec 14, 2013

jreback commented Dec 14, 2013

ghost commented Dec 14, 2013

jreback commented Dec 14, 2013

ghost commented Dec 14, 2013

jtratner commented Dec 14, 2013

ghost commented Dec 15, 2013

jtratner commented Dec 15, 2013

simonjayhawkins commented Dec 12, 2019

mroeschke commented May 3, 2020

Should pandas preserve NaNs in str methods? #5672

Should pandas preserve NaNs in str methods? #5672

Comments

jseabold commented Dec 10, 2013

jtratner commented Dec 10, 2013

jreback commented Dec 10, 2013

dragoljub commented Dec 14, 2013

ghost commented Dec 14, 2013

jreback commented Dec 14, 2013

ghost commented Dec 14, 2013

jreback commented Dec 14, 2013

ghost commented Dec 14, 2013

jreback commented Dec 14, 2013

ghost commented Dec 14, 2013

jtratner commented Dec 14, 2013

ghost commented Dec 15, 2013

jtratner commented Dec 15, 2013

simonjayhawkins commented Dec 12, 2019

mroeschke commented May 3, 2020