str.split on np.nan gives np.nan in one column but None in another column #18450

JeroenDelcour · 2017-11-23T15:09:25Z

import pandas as pd
import numpy as np

s = pd.Series(['19HT|C2', np.nan, '20ZT|C1'])
print(s)

0    19HT|C2
1        NaN
2    20ZT|C1
dtype: object

s_split = s.str.split('|', expand=True)
print(s_split)

      0     1
0  19HT    C2
1   NaN  None
2  20ZT    C1

print(s_split.dtypes)

0    object
1    object
dtype: object

print(type(s_split.loc[1,0]))

float

print(type(s_split.loc[1,1]))

NoneType

Problem description

When np.nan gets split, it becomes np.nan (of type float) in the first column but None (of type NoneType) in the second column. I'd consider this unexpected behavior. How come splitting a value of one type results in two values of different types?

Expected Output

      0     1
0  19HT    C2
1   NaN   NaN
2  20ZT    C1

Either np.nan or None in both columns, but not a mix of both. I'd say np.nan makes most sense, since that's the original value of the row.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.0-40-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: 3.0.5
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.19.1
pyarrow: 0.7.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: 0.4.0
matplotlib: 2.0.2
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 4.1.1
bs4: 4.5.3
html5lib: 0.9999999
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2017-11-24T01:17:26Z

Just to be clear the issue here is how expand=True handles NaN values. You'd still mix str and None instances using expansion if the split character isn't in one of the records. For example,

s = pd.Series(['19HT|C2', np.nan, '20ZT|C1', 'foo'])
s.str.split("|", expand=True)

yields

      0     1
0  19HT    C2
1   NaN  None
2  20ZT    C1
3   foo  None

The str_split docstring is a little ambiguous because it says it should propagate NaN values, but I think that references the value returned irrespective of the expansion mechanism. Will leave it to others here to comment as to whether or not we think this is a bug with how expansion works, or if the docstring should be modified.

pandas/pandas/core/strings.py

Line 1005 in 97fea48

pattern, propagating NA values. Equivalent to :meth:`str.split`.

jreback · 2017-11-24T01:27:13Z

this is a bug; we want to use np.nan as the missing value indicator

…s-dev#18462) (cherry picked from commit 20f6512)

(cherry picked from commit 20f6512)

louis-red · 2018-06-27T13:22:27Z

FYI this behavior (pd.Series.str.split with expand=True expands with Nones) is still present in my version of pandas : 0.23.1

markemus · 2018-12-19T15:27:07Z

+1 for me, 0.23.4

WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 24, 2017

Propogating NaN values when using str.split (pandas-dev#18450)

d881f94

WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 24, 2017

Propogating NaN values when using str.split (pandas-dev#18450)

d64995a

This was referenced Nov 24, 2017

Propogating NaN values when using str.split (#18450) #18462

Merged

assert_frame_equal not differentiating NaN and None within object dtype #18463

Open

jreback added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Strings String extension data type and string data labels Nov 25, 2017

jreback added this to the 0.21.1 milestone Nov 25, 2017

WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 25, 2017

Propogating NaN values when using str.split (pandas-dev#18450)

a08e940

jreback closed this as completed in #18462 Nov 25, 2017

jreback pushed a commit that referenced this issue Nov 25, 2017

Propogating NaN values when using str.split (#18450) (#18462)

20f6512

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this issue Dec 8, 2017

Propogating NaN values when using str.split (pandas-dev#18450) (panda…

7a72e3e

…s-dev#18462) (cherry picked from commit 20f6512)

TomAugspurger pushed a commit that referenced this issue Dec 11, 2017

Propogating NaN values when using str.split (#18450) (#18462)

b91af62

(cherry picked from commit 20f6512)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

str.split on np.nan gives np.nan in one column but None in another column #18450

str.split on np.nan gives np.nan in one column but None in another column #18450

JeroenDelcour commented Nov 23, 2017

INSTALLED VERSIONS

WillAyd commented Nov 24, 2017 •

edited

Loading

jreback commented Nov 24, 2017

louis-red commented Jun 27, 2018

markemus commented Dec 19, 2018

str.split on np.nan gives np.nan in one column but None in another column #18450

str.split on np.nan gives np.nan in one column but None in another column #18450

Comments

JeroenDelcour commented Nov 23, 2017

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented Nov 24, 2017 • edited Loading

jreback commented Nov 24, 2017

louis-red commented Jun 27, 2018

markemus commented Dec 19, 2018

Output of `pd.show_versions()`

WillAyd commented Nov 24, 2017 •

edited

Loading