Skip to content

Should pandas preserve NaNs in str methods? #5672

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jseabold opened this issue Dec 10, 2013 · 15 comments
Closed

Should pandas preserve NaNs in str methods? #5672

jseabold opened this issue Dec 10, 2013 · 15 comments
Labels
API Design Strings String extension data type and string data

Comments

@jseabold
Copy link
Contributor

Since the output of the string methods is primarily boolean, do we want to preserve NaNs? Use case. I load in some data, empty strings (i.e., missing data in a column that's string data) gets automatically converted to NaNs. I have to do fillna everywhere first, or I run into situations like this.

dta = pd.DataFrame([["blah"],[np.nan],["foobar"]], columns=['vari'])
dta.vari.str.startswith('f')
dta.ix[dta.vari.str.startswith('f')]

Since presumably the primary use case of a boolean series is indexing, do we want to just return False in place of NaNs? Just a thought.

@jtratner
Copy link
Contributor

I'd be on board with that. Definitely gets annoying to constantly be
calling fillna() before/after bool-producing string methods.

@jreback
Copy link
Contributor

jreback commented Dec 10, 2013

You can do this. Though I agree with you both that this should be the default for the 'boolean' type of ops on strings.

In [6]: dta.ix[dta.vari.str.startswith('f',na=False)]
Out[6]: 
     vari
2  foobar

[1 rows x 1 columns]

@dragoljub
Copy link

Yes it should. 👍

@ghost
Copy link

ghost commented Dec 14, 2013

Isn't there a legitimate distinction between missing values and empty strings to be preserved?

In [43]: pd.read_csv(StringIO("'123', '',, '123'"),dialect=csv.get_dialect("excel"),header=None)
Out[43]: 
       0    1   2       3
0  '123'   '' NaN   '123'

@jreback
Copy link
Contributor

jreback commented Dec 14, 2013

@y-p this only applies to Series.str methods (and only the ones that return bools)

@ghost
Copy link

ghost commented Dec 14, 2013

@jreback, isn't proper nan propagation the right thing everywhere, even the Series.str methods?

An empty string is not a nan, and it looks like at least in the example given if you don't mixup the two -
the problem goes away.

There may be other cases, if so, let's look at those.

@jreback
Copy link
Contributor

jreback commented Dec 14, 2013

I think the issue is that a method that is in effect a boolean method can return Nan's so you always need to fill

those methods should be guaranteed boolean returns

kind of like any/all always return a boolean result

@ghost
Copy link

ghost commented Dec 14, 2013

they are methods that return boolean series, I don't think it's true nans have no intrinsic
meaning in a boolean series. course they do.

In [45]: np.array([False, np.nan, True])
Out[45]: array([  0.,  nan,   1.])

...excepting the usual limitations of numpy arrays wrt to nan for some dtypes. That's not a good
reason though.
any and all seem to have that behaviour on the surface, but i's because of the way they test
values, not because they explicitly ignore nans. Plus, I'm pretty sure that's not what you want here:

In [51]: np.array([np.nan,  np.nan]).any()
Out[51]: True
In [50]: bool(np.nan)
Out[50]: True

If you want nans to behave like empty strings, use empty strings instead. Problem solved.

@jreback
Copy link
Contributor

jreback commented Dec 14, 2013

their is already an argument, na to the string method to fill the Nan's - no interpretation required

I think a method in python space

startswith always returns a boolean result

and the default makes sense to return a boolean array in pandas (right now the default for na is False, the proposal is to make it True)

I think their are 2 classes of string methods: boolean operators (startswitj, contains) and transformers (split,extract)

the first should fillna according to the na arg, the second should not

booleans tend to be used in indexing, so u can argue whether the default should be True or False but I think it's clear that it should be boolean

@ghost
Copy link

ghost commented Dec 14, 2013

It makes sense to me that some questions have undefined as their answer.
Isn't np.nan a signal for "I can't answer that"?

In [87]: s=pd.Series([1])
    ...: s.str.startswith("a")
    ...: 
Out[87]: 
0   NaN
dtype: float64

@jtratner
Copy link
Contributor

It's a bit annoying that you have to do fillna, but at least it errors out when you try to use it as a selector:

>>> df = pd.DataFrame({"A":['a', np.nan]})
>>> df[df.A.str.startswith('a')]
Traceback
    ...
ValueError: cannot index with vector containing NA / NaN values.

@ghost
Copy link

ghost commented Dec 15, 2013

That's a more reasonable place to choose the semantics of indexing with nan.

@jtratner
Copy link
Contributor

Why is the kwarg called na? Do we do that elsewhere?

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Feb 18, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 1, 2015
@datapythonista datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018
@simonjayhawkins
Copy link
Member

@jorisvandenbossche could this be closed?

Is this discussion superseded by the new BooleanArray and it's default behaviors or is this orthogonal and we could still consider changing the existing behavior of the str methods so that they always return a boolean array and not object when np.NaN is present?

@jbrockmendel jbrockmendel removed the Indexing Related to indexing on series/frames, not to indexes themselves label Apr 14, 2020
@mroeschke
Copy link
Member

I think this could also be superseded by the string extension dtype with NA support. Happy to reopen if we want to continue the original discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

8 participants