Skip to content

ENH: add Series.str iterator #3645

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 19, 2013
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion RELEASE.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ pandas 0.11.1
multi-index column.
Note: The default value will change in 0.12 to make the default *to* write and
read multi-index columns in the new format. (GH3571_, GH1651_, GH3141_)
- Add iterator to ``Series.str`` (GH3638_)

**Improvements to existing features**

Expand Down Expand Up @@ -199,7 +200,7 @@ pandas 0.11.1
.. _GH3571: https://github.com/pydata/pandas/issues/3571
.. _GH1651: https://github.com/pydata/pandas/issues/1651
.. _GH3141: https://github.com/pydata/pandas/issues/3141

.. _GH3638: https://github.com/pydata/pandas/issues/3638

pandas 0.11.0
=============
Expand Down
22 changes: 22 additions & 0 deletions doc/source/v0.11.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,27 @@ Enhancements
- ``DataFrame.replace()`` now allows regular expressions on contained
``Series`` with object dtype. See the examples section in the regular docs
:ref:`Replacing via String Expression <missing_data.replace_expression>`
- ``Series.str`` now supports iteration (GH3638_). You can iterate over the
individual elements of each string in the ``Series``. Each iteration yields
yields a ``Series`` with either a single character at each index of the
original ``Series`` or ``NaN``. For example,

.. ipython:: python

strs = 'go', 'bow', 'joe', 'slow'
ds = Series(strs)

for s in ds.str:
print s

s
s.dropna().values.item() == 'w'

The last element yielded by the iterator will be a ``Series`` containing
the last element of the longest string in the ``Series`` with all other
elements being ``NaN``. Here since ``'wikitravel'`` is the longest string
and there are no other strings with the same length ``'l'`` is the only
non-null string in the yielded ``Series``.

- Multi-index column support for reading and writing csvs

Expand Down Expand Up @@ -133,3 +154,4 @@ on GitHub for a complete list.
.. _GH3571: https://github.com/pydata/pandas/issues/3571
.. _GH1651: https://github.com/pydata/pandas/issues/1651
.. _GH3141: https://github.com/pydata/pandas/issues/3141
.. _GH3638: https://github.com/pydata/pandas/issues/3638
8 changes: 8 additions & 0 deletions pandas/core/strings.py
Original file line number Diff line number Diff line change
Expand Up @@ -661,6 +661,14 @@ def __getitem__(self, key):
else:
return self.get(key)

def __iter__(self):
i = 0
g = self.get(i)
while g.notnull().any():
yield g
i += 1
g = self.get(i)

def _wrap_result(self, result):
return Series(result, index=self.series.index,
name=self.series.name)
Expand Down
78 changes: 78 additions & 0 deletions pandas/tests/test_strings.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@

from numpy import nan as NA
import numpy as np
from numpy.testing import assert_array_equal
from numpy.random import randint

from pandas import (Index, Series, TimeSeries, DataFrame, isnull, notnull,
bdate_range, date_range)
Expand All @@ -25,6 +27,82 @@ class TestStringMethods(unittest.TestCase):

_multiprocess_can_split_ = True

def test_iter(self):
# GH3638
strs = 'google', 'wikimedia', 'wikipedia', 'wikitravel'
ds = Series(strs)

for s in ds.str:
# iter must yield a Series
self.assert_(isinstance(s, Series))

# indices of each yielded Series should be equal to the index of
# the original Series
assert_array_equal(s.index, ds.index)

for el in s:
# each element of the series is either a basestring or nan
self.assert_(isinstance(el, basestring) or isnull(el))

# desired behavior is to iterate until everything would be nan on the
# next iter so make sure the last element of the iterator was 'l' in
# this case since 'wikitravel' is the longest string
self.assertEqual(s.dropna().values.item(), 'l')

def test_iter_empty(self):
ds = Series([], dtype=object)

i, s = 100, 1

for i, s in enumerate(ds.str):
pass

# nothing to iterate over so nothing defined values should remain
# unchanged
self.assertEqual(i, 100)
self.assertEqual(s, 1)

def test_iter_single_element(self):
ds = Series(['a'])

for i, s in enumerate(ds.str):
pass

self.assertFalse(i)
assert_series_equal(ds, s)

def test_iter_numeric_try_string(self):
# behavior identical to empty series
dsi = Series(range(4))

i, s = 100, 'h'

for i, s in enumerate(dsi.str):
pass

self.assertEqual(i, 100)
self.assertEqual(s, 'h')

dsf = Series(np.arange(4.))

for i, s in enumerate(dsf.str):
pass

self.assertEqual(i, 100)
self.assertEqual(s, 'h')

def test_iter_object_try_string(self):
ds = Series([slice(None, randint(10), randint(10, 20))
for _ in xrange(4)])

i, s = 100, 'h'

for i, s in enumerate(ds.str):
pass

self.assertEqual(i, 100)
self.assertEqual(s, 'h')

def test_cat(self):
one = ['a', 'a', 'b', 'b', 'c', NA]
two = ['a', NA, 'b', 'd', 'foo', NA]
Expand Down