Skip to content

0.22.0 - Splitting blank string causes exception #20002

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
montanalow opened this issue Mar 5, 2018 · 3 comments · Fixed by #20067
Closed

0.22.0 - Splitting blank string causes exception #20002

montanalow opened this issue Mar 5, 2018 · 3 comments · Fixed by #20067
Labels
Regression Functionality that used to work in a prior pandas version Strings String extension data type and string data
Milestone

Comments

@montanalow
Copy link
Contributor

montanalow commented Mar 5, 2018

Code Sample, a copy-pastable example if possible

import pandas
df = pandas.DataFrame({'test': ['']})
df.test.str.split(expand=True)

Problem description

Splitting a blank (empty or whitespace) string causes an exception when expand=True.

IndexError                                Traceback (most recent call last)
<ipython-input-22-a7876ad70d70> in <module>()
      1 import pandas
      2 df = pandas.DataFrame({'test': ['']})
----> 3 df.test.str.split(expand=True)
      4

~/.pyenv/versions/3.6.3/envs/picking/lib/python3.6/site-packages/pandas/core/strings.py in split(self, pat, n, expand)
   1479     def split(self, pat=None, n=-1, expand=False):
   1480         result = str_split(self._data, pat, n=n)
-> 1481         return self._wrap_result(result, expand=expand)
   1482
   1483     @copy(str_rsplit)

~/.pyenv/versions/3.6.3/envs/picking/lib/python3.6/site-packages/pandas/core/strings.py in _wrap_result(self, result, use_codes, name, expand)
   1427                 # propogate nan values to match longest sequence (GH 18450)
   1428                 max_len = max(len(x) for x in result)
-> 1429                 result = [x * max_len if x[0] is np.nan else x for x in result]
   1430
   1431         if not isinstance(expand, bool):

~/.pyenv/versions/3.6.3/envs/picking/lib/python3.6/site-packages/pandas/core/strings.py in <listcomp>(.0)
   1427                 # propogate nan values to match longest sequence (GH 18450)
   1428                 max_len = max(len(x) for x in result)
-> 1429                 result = [x * max_len if x[0] is np.nan else x for x in result]
   1430
   1431         if not isinstance(expand, bool):

IndexError: list index out of range

Expected Output

Either nans or empty strings as placeholder values in the output dataframe.

Out[24]:
   0
0  ''

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Darwin
OS-release: 17.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.14.0
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: 1.1.14
pymysql: None
psycopg2: 2.7.3.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@DylanDmitri
Copy link
Contributor

I would argue the current behavior is the correct behavior. It is consistent with the way str.split() works in core python.

>>> ''.split()
[]
>>> 'a'.split()
['a']
>>> 'a b'.split()
['a', 'b']
>>> pd.Series(['', 'a', 'a b']).str.split(expand=True)
      0     1
0  None  None
1     a  None
2     a     b

and thus the current behavior makes a lot of sense

>>> pandas.DataFrame({'test': ['']}).test.str.split(expand=True)
Empty DataFrame
Columns: []
Index: [0]

@montanalow
Copy link
Contributor Author

montanalow commented Mar 6, 2018

This is an undocumented behavior change from 0.20 to 0.22. 0.20 returns null filled arrays, which is consistent with the standard library like you demonstrate. 0.22 raises an exception, which is not consistent with the standard library or previous versions.

@montanalow montanalow changed the title Splitting blank string causes exception 0.22.0 - Splitting blank string causes exception Mar 6, 2018
@gfyoung gfyoung added Regression Functionality that used to work in a prior pandas version Strings String extension data type and string data labels Mar 8, 2018
@gfyoung
Copy link
Member

gfyoung commented Mar 8, 2018

@montanalow : Thanks for pointing that out. Investigation and patch are welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Regression Functionality that used to work in a prior pandas version Strings String extension data type and string data
Projects
None yet
4 participants