Skip to content

ENH: add regex functionality to DataFrame.replace #3584

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 17, 2013
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -465,6 +465,7 @@ Missing data handling

DataFrame.dropna
DataFrame.fillna
DataFrame.replace

Reshaping, sorting, transposing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -492,7 +493,6 @@ Combining / joining / merging
DataFrame.append
DataFrame.join
DataFrame.merge
DataFrame.replace
DataFrame.update

Time series-related
Expand Down
127 changes: 127 additions & 0 deletions doc/source/missing_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -334,6 +334,133 @@ missing and interpolate over them:

ser.replace([1, 2, 3], method='pad')

String/Regular Expression Replacement
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. note::

Python strings prefixed with the ``r`` character such as ``r'hello world'``
are so-called "raw" strings. They have different semantics regarding
backslashes than strings without this prefix. Backslashes in raw strings
will be interpreted as an escaped backslash, e.g., ``r'\' == '\\'``. You
should `read about them
<http://docs.python.org/2/reference/lexical_analysis.html#string-literals>`_
if this is unclear.

Replace the '.' with ``nan`` (str -> str)

.. ipython:: python

from numpy.random import rand, randn
from numpy import nan
from pandas import DataFrame
d = {'a': range(4), 'b': list('ab..'), 'c': ['a', 'b', nan, 'd']}
df = DataFrame(d)
df.replace('.', nan)

Now do it with a regular expression that removes surrounding whitespace
(regex -> regex)

.. ipython:: python

df.replace(r'\s*\.\s*', nan, regex=True)

Replace a few different values (list -> list)

.. ipython:: python

df.replace(['a', '.'], ['b', nan])

list of regex -> list of regex

.. ipython:: python

df.replace([r'\.', r'(a)'], ['dot', '\1stuff'], regex=True)

Only search in column ``'b'`` (dict -> dict)

.. ipython:: python

df.replace({'b': '.'}, {'b': nan})

Same as the previous example, but use a regular expression for
searching instead (dict of regex -> dict)

.. ipython:: python

df.replace({'b': r'\s*\.\s*'}, {'b': nan}, regex=True)

You can pass nested dictionaries of regular expressions that use ``regex=True``

.. ipython:: python

df.replace({'b': {'b': r''}}, regex=True)

or you can pass the nested dictionary like so

.. ipython:: python

df.replace(regex={'b': {'b': r'\s*\.\s*'}})

You can also use the group of a regular expression match when replacing (dict
of regex -> dict of regex), this works for lists as well

.. ipython:: python

df.replace({'b': r'\s*(\.)\s*'}, {'b': r'\1ty'}, regex=True)

You can pass a list of regular expressions, of which those that match
will be replaced with a scalar (list of regex -> regex)

.. ipython:: python

df.replace([r'\s*\.\*', r'a|b'], nan, regex=True)

All of the regular expression examples can also be passed with the
``to_replace`` argument as the ``regex`` argument. In this case the ``value``
argument must be passed explicity by name or ``regex`` must be a nested
dictionary. The previous example, in this case, would then be

.. ipython:: python

df.replace(regex=[r'\s*\.\*', r'a|b'], value=nan)

This can be convenient if you do not want to pass ``regex=True`` every time you
want to use a regular expression.

.. note::

Anywhere in the above ``replace`` examples that you see a regular expression
a compiled regular expression is valid as well.

Numeric Replacement
^^^^^^^^^^^^^^^^^^^

Similiar to ``DataFrame.fillna``

.. ipython:: python

from numpy.random import rand, randn
from numpy import nan
from pandas import DataFrame
from pandas.util.testing import assert_frame_equal
df = DataFrame(randn(10, 2))
df[rand(df.shape[0]) > 0.5] = 1.5
df.replace(1.5, nan)

Replacing more than one value via lists works as well

.. ipython:: python

df00 = df.values[0, 0]
df.replace([1.5, df00], [nan, 'a'])
df[1].dtype

You can also operate on the DataFrame in place

.. ipython:: python

df.replace(1.5, nan, inplace=True)

Missing data casting rules and indexing
---------------------------------------
Expand Down
4 changes: 4 additions & 0 deletions doc/source/v0.11.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,9 @@ Enhancements
- ``fillna`` methods now raise a ``TypeError`` if the ``value`` parameter is
a list or tuple.
- Added module for reading and writing Stata files: pandas.io.stata (GH1512_)
- ``DataFrame.replace()`` now allows regular expressions on contained
``Series`` with object dtype. See the examples section in the regular docs
and the generated documentation for the method for more details.

See the `full release notes
<https://github.com/pydata/pandas/blob/master/RELEASE.rst>`__ or issue tracker
Expand All @@ -70,3 +73,4 @@ on GitHub for a complete list.
.. _GH3590: https://github.com/pydata/pandas/issues/3590
.. _GH3435: https://github.com/pydata/pandas/issues/3435
.. _GH1512: https://github.com/pydata/pandas/issues/1512
.. _GH2285: https://github.com/pydata/pandas/issues/2285
Loading