Skip to content

DOC: update the pandas.DataFrame.replace docstring #20271

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

math-and-data
Copy link
Contributor

  • PR title is "DOC: update the docstring"
  • The validation script passes: scripts/validate_docstrings.py <your-function-or-method>
  • The PEP8 style check passes: git diff upstream/master -u -- "*.py" | flake8 --diff
  • The html version looks good: python doc/make.py --single <your-function-or-method>
  • It has been proofread on language by another sprint participant

Note: Just did a minor improvement, not a full change!

Still a few verification errors:

  • Errors in parameters section
    • Parameter "to_replace" description should start with capital letter
    • Parameter "axis" description should finish with "."
  • Examples do not pass tests
################################################################################
##################### Docstring (pandas.DataFrame.replace) #####################
################################################################################

Replace values given in 'to_replace' with 'value'.

Values of the DataFrame or a Series are being replaced with
other values. One or several values can be replaced with one
or several values.

Parameters
----------
to_replace : str, regex, list, dict, Series, numeric, or None

    * numeric, str or regex:

        - numeric: numeric values equal to ``to_replace`` will be
          replaced with ``value``
        - str: string exactly matching ``to_replace`` will be replaced
          with ``value``
        - regex: regexs matching ``to_replace`` will be replaced with
          ``value``

    * list of str, regex, or numeric:

        - First, if ``to_replace`` and ``value`` are both lists, they
          **must** be the same length.
        - Second, if ``regex=True`` then all of the strings in **both**
          lists will be interpreted as regexs otherwise they will match
          directly. This doesn't matter much for ``value`` since there
          are only a few possible substitution regexes you can use.
        - str, regex and numeric rules apply as above.

    * dict:

        - Dicts can be used to specify different replacement values
          for different existing values. For example,
          {'a': 'b', 'y': 'z'} replaces the value 'a' with 'b' and
          'y' with 'z'. To use a dict in this way the ``value``
          parameter should be ``None``.
        - For a DataFrame a dict can specify that different values
          should be replaced in different columns. For example,
          {'a': 1, 'b': 'z'} looks for the value 1 in column 'a' and
          the value 'z' in column 'b' and replaces these values with
          whatever is specified in ``value``. The ``value`` parameter
          should not be ``None`` in this case. You can treat this as a
          special case of passing two lists except that you are
          specifying the column to search in.
        - For a DataFrame nested dictionaries, e.g.,
          {'a': {'b': np.nan}}, are read as follows: look in column 'a'
          for the value 'b' and replace it with NaN. The ``value``
          parameter should be ``None`` to use a nested dict in this
          way. You can nest regular expressions as well. Note that
          column names (the top-level dictionary keys in a nested
          dictionary) **cannot** be regular expressions.

    * None:

        - This means that the ``regex`` argument must be a string,
          compiled regular expression, or list, dict, ndarray or Series
          of such elements. If ``value`` is also ``None`` then this
          **must** be a nested dictionary or ``Series``.

    See the examples section for examples of each of these.
value : scalar, dict, list, str, regex, default None
    Value to replace any values matching ``to_replace`` with.
    For a DataFrame a dict of values can be used to specify which
    value to use for each column (columns not in the dict will not be
    filled). Regular expressions, strings and lists or dicts of such
    objects are also allowed.
inplace : boolean, default False
    If True, in place. Note: this will modify any
    other views on this object (e.g. a column from a DataFrame).
    Returns the caller if this is True.
limit : int, default None
    Maximum size gap to forward or backward fill.
regex : bool or same types as ``to_replace``, default False
    Whether to interpret ``to_replace`` and/or ``value`` as regular
    expressions. If this is ``True`` then ``to_replace`` *must* be a
    string. Alternatively, this could be a regular expression or a
    list, dict, or array of regular expressions in which case
    ``to_replace`` must be ``None``.
method : string, optional, {'pad', 'ffill', 'bfill'}, default is 'pad'
    The method to use when for replacement, when ``to_replace`` is a
    scalar, list or tuple and ``value`` is None.
axis : None
    Deprecated.

    .. versionchanged:: 0.23.0
        Added to DataFrame

See Also
--------
DataFrame.fillna : Fill NA/NaN values
DataFrame.where : Replace values based on boolean condition

Returns
-------
DataFrame
    Some values have been substituted for new values.

Raises
------
AssertionError
    * If ``regex`` is not a ``bool`` and ``to_replace`` is not
      ``None``.
TypeError
    * If ``to_replace`` is a ``dict`` and ``value`` is not a ``list``,
      ``dict``, ``ndarray``, or ``Series``
    * If ``to_replace`` is ``None`` and ``regex`` is not compilable
      into a regular expression or is a list, dict, ndarray, or
      Series.
    * When replacing multiple ``bool`` or ``datetime64`` objects and
      the arguments to ``to_replace`` does not match the type of the
      value being replaced
ValueError
    * If a ``list`` or an ``ndarray`` is passed to ``to_replace`` and
      `value` but they are not the same length.

Notes
-----
* Regex substitution is performed under the hood with ``re.sub``. The
  rules for substitution for ``re.sub`` are the same.
* Regular expressions will only substitute on strings, meaning you
  cannot provide, for example, a regular expression matching floating
  point numbers and expect the columns in your frame that have a
  numeric dtype to be matched. However, if those floating point
  numbers *are* strings, then you can do this.
* This method has *a lot* of options. You are encouraged to experiment
  and play with this method to gain intuition about how it works.

Examples
--------

>>> s = pd.Series([0, 1, 2, 3, 4])
>>> s.replace(0, 5)
0    5
1    1
2    2
3    3
4    4
dtype: int64
>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
...                    'B': [5, 6, 7, 8, 9],
...                    'C': ['a', 'b', 'c', 'd', 'e']})
>>> df.replace(0, 5)
   A  B  C
0  5  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e

>>> df.replace([0, 1, 2, 3], 4)
   A  B  C
0  4  5  a
1  4  6  b
2  4  7  c
3  4  8  d
4  4  9  e
>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1])
   A  B  C
0  4  5  a
1  3  6  b
2  2  7  c
3  1  8  d
4  4  9  e
>>> s.replace([1, 2], method='bfill')
0    0
1    3
2    3
3    3
4    4
dtype: int64

>>> df.replace({0: 10, 1: 100})
     A  B  C
0   10  5  a
1  100  6  b
2    2  7  c
3    3  8  d
4    4  9  e
>>> df.replace({'A': 0, 'B': 5}, 100)
     A    B  C
0  100  100  a
1    1    6  b
2    2    7  c
3    3    8  d
4    4    9  e
>>> df.replace({'A': {0: 100, 4: 400}})
     A  B  C
0  100  5  a
1    1  6  b
2    2  7  c
3    3  8  d
4  400  9  e

>>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],
...                    'B': ['abc', 'bar', 'xyz']})
>>> df.replace(to_replace=r'^ba.$', value='new', regex=True)
      A    B
0   new  abc
1   foo  new
2  bait  xyz
>>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)
      A    B
0   new  abc
1   foo  bar
2  bait  xyz
>>> df.replace(regex=r'^ba.$', value='new')
      A    B
0   new  abc
1   foo  new
2  bait  xyz
>>> df.replace(regex={r'^ba.$':'new', 'foo':'xyz'})
      A    B
0   new  abc
1   xyz  new
2  bait  xyz
>>> df.replace(regex=[r'^ba.$', 'foo'], value='new')
      A    B
0   new  abc
1   new  new
2  bait  xyz

Note that when replacing multiple ``bool`` or ``datetime64`` objects,
the data types in the ``to_replace`` parameter must match the data
type of the value being replaced:

>>> df = pd.DataFrame({'A': [True, False, True],
...                    'B': [False, True, False]})
>>> df.replace({'a string': 'new value', True: False})  # raises
TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'

This raises a ``TypeError`` because one of the ``dict`` keys is not of
the correct type for replacement.

Compare the behavior of
``s.replace('a', None)`` and ``s.replace({'a': None})`` to understand
the pecularities of the ``to_replace`` parameter.
``s.replace('a', None)`` is actually equivalent to
``s.replace(to_replace='a', value=None, method='pad')``,
because when ``value=None`` and ``to_replace`` is a scalar, list or
tuple, ``replace`` uses the method parameter to do the replacement.
So this is why the 'a' values are being replaced by 30 in rows 3 and 4
and 'b' in row 6 in this case. However, this behaviour does not occur
when you use a dict as the ``to_replace`` value. In this case, it is
like the value(s) in the dict are equal to the value parameter.

>>> s = pd.Series([10, 20, 30, 'a', 'a', 'b', 'a'])
>>> print(s)
0    10
1    20
2    30
3     a
4     a
5     b
6     a
dtype: object
>>> print(s.replace('a', None))
0    10
1    20
2    30
3    30
4    30
5     b
6     b
dtype: object
>>> print(s.replace({'a': None}))
0      10
1      20
2      30
3    None
4    None
5       b
6    None
dtype: object

################################################################################
################################## Validation ##################################
################################################################################

Errors found:
        Errors in parameters section
                Parameter "to_replace" description should start with capital letter
                Parameter "axis" description should finish with "."
        Examples do not pass tests

################################################################################
################################### Doctests ###################################
################################################################################

**********************************************************************
Line 229, in pandas.DataFrame.replace
Failed example:
    df.replace({'a string': 'new value', True: False})  # raises
Exception raised:
    Traceback (most recent call last):
      File "C:\Users\thisi\AppData\Local\conda\conda\envs\pandas_dev\lib\doctest.py", line 1330, in __run
        compileflags, 1), test.globs)
      File "<doctest pandas.DataFrame.replace[17]>", line 1, in <module>
        df.replace({'a string': 'new value', True: False})  # raises
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\frame.py", line 3136, in replace
        method=method, axis=axis)
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\generic.py", line 5208, in replace
        limit=limit, regex=regex)
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\frame.py", line 3136, in replace
        method=method, axis=axis)
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\generic.py", line 5257, in replace
        regex=regex)
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 3696, in replace_list
        masks = [comp(s) for i, s in enumerate(src_list)]
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 3696, in <listcomp>
        masks = [comp(s) for i, s in enumerate(src_list)]
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 3694, in comp
        return _maybe_compare(values, getattr(s, 'asm8', s), operator.eq)
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 5122, in _maybe_compare
        b=type_names[1]))
    TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'


.. versionchanged:: 0.23.0
Added to DataFrame
.. versionchanged:: 0.23.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't what you want to do - make sure you keep the versionchanged directive below the method argument as that's what was added in v0.23

regex : bool or same types as ``to_replace``, default False
Whether to interpret ``to_replace`` and/or ``value`` as regular
expressions. If this is ``True`` then ``to_replace`` *must* be a
string. Alternatively, this could be a regular expression or a
list, dict, or array of regular expressions in which case
``to_replace`` must be ``None``.
method : string, optional, {'pad', 'ffill', 'bfill'}
method : string, optional, {'pad', 'ffill', 'bfill'}, default is 'pad'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

method : {'pad', 'ffill', 'bfill', `None`}

The method to use when for replacement, when ``to_replace`` is a
scalar, list or tuple and ``value`` is None.
axis : None
Deprecated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning says this will be removed in v0.13? Woof...I guess OK to document for this change but should have a follow up change to actually go ahead and remove - care to take a stab at that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd where is this warning?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to take a stab at this - always nice when I can remove code too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@math-and-data awesome thanks! Can you open a separate issue for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd I was waiting for this PR to be approved, then I would open a new request where I change the relevant code (remove the 'axis' reference) and edit the documentation accordingly. Is there anything else I had missed in this PR (other than the suggestion of breaking out the DataFrame and Series examples)?

@@ -4869,6 +4869,10 @@ def bfill(self, axis=None, inplace=False, limit=None, downcast=None):
_shared_docs['replace'] = ("""
Replace values given in 'to_replace' with 'value'.

Values of the DataFrame or a Series are being replaced with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this extended description is adding much. Better served to make mention of how this can replace values with a dynamic set of inputs like dicts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thank you for the suggestion

Values of the DataFrame or a Series are being replaced with
other values. One or several values can be replaced with one
or several values.

Parameters
----------
to_replace : str, regex, list, dict, Series, numeric, or None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say int, float instead of numeric (if float is even valid?)

the pecularities of the ``to_replace`` parameter.
``s.replace('a', None)`` is actually equivalent to
``s.replace(to_replace='a', value=None, method='pad')``,
because when ``value=None`` and ``to_replace`` is a scalar, list or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting as I was not aware of this behavior. Certainly great to have it documented, though I would move the majority of the writing into the Notes section and shorten the blurb introducing the comparison here.

``s.replace(to_replace='a', value=None, method='pad')``,
because when ``value=None`` and ``to_replace`` is a scalar, list or
tuple, ``replace`` uses the method parameter to do the replacement.
So this is why the 'a' values are being replaced by 30 in rows 3 and 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just reinforce that it's the fill behavior that is really replacing values here

like the value(s) in the dict are equal to the value parameter.

>>> s = pd.Series([10, 20, 30, 'a', 'a', 'b', 'a'])
>>> print(s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Series is simple enough where you don't need to explicitly print it - the constructor shows you everything of interest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally have found the visual of inspecting the changes before/after easier for such replacements (both in vertical positions). You have more experience and I'll rely on your suggestion and make the change.

5 b
6 b
dtype: object
>>> print(s.replace({'a': None}))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this example first as it is (from my perspective) the behavior most would expect. Having it first makes it a better segue into the nuance that you want to describe with the other example

when you use a dict as the ``to_replace`` value. In this case, it is
like the value(s) in the dict are equal to the value parameter.

>>> s = pd.Series([10, 20, 30, 'a', 'a', 'b', 'a'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to keep things concise why don't you get rid of 10 and 20 in this example? They don't serve any real purpose but make the documentation longer. Can also replace 30 with 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great suggestion of simplifying.

The method to use when for replacement, when ``to_replace`` is a
scalar, list or tuple and ``value`` is None.
axis : None
Deprecated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd where is this warning?

5 b
6 a
dtype: object
>>> print(s.replace('a', None))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the prints, use a blank line between cases. Having an expl for each case is also nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 11, 2018
@math-and-data
Copy link
Contributor Author

  • Docstring validation not passing
################################################################################
##################### Docstring (pandas.DataFrame.replace) #####################
################################################################################

Replace values given in 'to_replace' with 'value'.

Values of the DataFrame or a Series are being replaced with
other values in a dynamic way. Instead of replacing values in a
specific cell (row/column combination), this method allows for more
flexibility with replacements. For instance, values can be replaced
by specifying lists of values and replacements separately or
with a dynamic set of inputs like dicts.

Parameters
----------
to_replace : str, regex, list, dict, Series, int, float, or None
    * numeric, str or regex:

        - numeric: numeric values equal to ``to_replace`` will be
          replaced with ``value``
        - str: string exactly matching ``to_replace`` will be replaced
          with ``value``
        - regex: regexs matching ``to_replace`` will be replaced with
          ``value``

    * list of str, regex, or numeric:

        - First, if ``to_replace`` and ``value`` are both lists, they
          **must** be the same length.
        - Second, if ``regex=True`` then all of the strings in **both**
          lists will be interpreted as regexs otherwise they will match
          directly. This doesn't matter much for ``value`` since there
          are only a few possible substitution regexes you can use.
        - str, regex and numeric rules apply as above.

    * dict:

        - Dicts can be used to specify different replacement values
          for different existing values. For example,
          {'a': 'b', 'y': 'z'} replaces the value 'a' with 'b' and
          'y' with 'z'. To use a dict in this way the ``value``
          parameter should be ``None``.
        - For a DataFrame a dict can specify that different values
          should be replaced in different columns. For example,
          {'a': 1, 'b': 'z'} looks for the value 1 in column 'a' and
          the value 'z' in column 'b' and replaces these values with
          whatever is specified in ``value``. The ``value`` parameter
          should not be ``None`` in this case. You can treat this as a
          special case of passing two lists except that you are
          specifying the column to search in.
        - For a DataFrame nested dictionaries, e.g.,
          {'a': {'b': np.nan}}, are read as follows: look in column
          'a' for the value 'b' and replace it with NaN. The ``value``
          parameter should be ``None`` to use a nested dict in this
          way. You can nest regular expressions as well. Note that
          column names (the top-level dictionary keys in a nested
          dictionary) **cannot** be regular expressions.

    * None:

        - This means that the ``regex`` argument must be a string,
          compiled regular expression, or list, dict, ndarray or
          Series of such elements. If ``value`` is also ``None`` then
          this **must** be a nested dictionary or ``Series``.

    See the examples section for examples of each of these.
value : scalar, dict, list, str, regex, default None
    Value to replace any values matching ``to_replace`` with.
    For a DataFrame a dict of values can be used to specify which
    value to use for each column (columns not in the dict will not be
    filled). Regular expressions, strings and lists or dicts of such
    objects are also allowed.
inplace : boolean, default False
    If True, in place. Note: this will modify any
    other views on this object (e.g. a column from a DataFrame).
    Returns the caller if this is True.
limit : int, default None
    Maximum size gap to forward or backward fill.
regex : bool or same types as ``to_replace``, default False
    Whether to interpret ``to_replace`` and/or ``value`` as regular
    expressions. If this is ``True`` then ``to_replace`` *must* be a
    string. Alternatively, this could be a regular expression or a
    list, dict, or array of regular expressions in which case
    ``to_replace`` must be ``None``.
method : {'pad', 'ffill', 'bfill', `None`}
    The method to use when for replacement, when ``to_replace`` is a
    scalar, list or tuple and ``value`` is `None`.
    .. versionchanged:: 0.23.0
        Added to DataFrame.
axis : None
    Deprecated.

See Also
--------
DataFrame.fillna : Fill `NaN` values
DataFrame.where : Replace values based on boolean condition

Returns
-------
DataFrame
    Object after replacement.

Raises
------
AssertionError
    * If ``regex`` is not a ``bool`` and ``to_replace`` is not
      ``None``.
TypeError
    * If ``to_replace`` is a ``dict`` and ``value`` is not a ``list``,
      ``dict``, ``ndarray``, or ``Series``
    * If ``to_replace`` is ``None`` and ``regex`` is not compilable
      into a regular expression or is a list, dict, ndarray, or
      Series.
    * When replacing multiple ``bool`` or ``datetime64`` objects and
      the arguments to ``to_replace`` does not match the type of the
      value being replaced
ValueError
    * If a ``list`` or an ``ndarray`` is passed to ``to_replace`` and
      `value` but they are not the same length.

Notes
-----
* Regex substitution is performed under the hood with ``re.sub``. The
  rules for substitution for ``re.sub`` are the same.
* Regular expressions will only substitute on strings, meaning you
  cannot provide, for example, a regular expression matching floating
  point numbers and expect the columns in your frame that have a
  numeric dtype to be matched. However, if those floating point
  numbers *are* strings, then you can do this.
* This method has *a lot* of options. You are encouraged to experiment
  and play with this method to gain intuition about how it works.
* When dict is used as the ``to_replace`` value, it is like
  key(s) in the dict are the to_replace part and
  value(s) in the dict are the value parameter.

Examples
--------

>>> s = pd.Series([0, 1, 2, 3, 4])
>>> s.replace(0, 5)
0    5
1    1
2    2
3    3
4    4
dtype: int64
>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
...                    'B': [5, 6, 7, 8, 9],
...                    'C': ['a', 'b', 'c', 'd', 'e']})
>>> df.replace(0, 5)
   A  B  C
0  5  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e

>>> df.replace([0, 1, 2, 3], 4)
   A  B  C
0  4  5  a
1  4  6  b
2  4  7  c
3  4  8  d
4  4  9  e
>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1])
   A  B  C
0  4  5  a
1  3  6  b
2  2  7  c
3  1  8  d
4  4  9  e
>>> s.replace([1, 2], method='bfill')
0    0
1    3
2    3
3    3
4    4
dtype: int64

>>> df.replace({0: 10, 1: 100})
     A  B  C
0   10  5  a
1  100  6  b
2    2  7  c
3    3  8  d
4    4  9  e
>>> df.replace({'A': 0, 'B': 5}, 100)
     A    B  C
0  100  100  a
1    1    6  b
2    2    7  c
3    3    8  d
4    4    9  e
>>> df.replace({'A': {0: 100, 4: 400}})
     A  B  C
0  100  5  a
1    1  6  b
2    2  7  c
3    3  8  d
4  400  9  e

>>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],
...                    'B': ['abc', 'bar', 'xyz']})
>>> df.replace(to_replace=r'^ba.$', value='new', regex=True)
      A    B
0   new  abc
1   foo  new
2  bait  xyz
>>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)
      A    B
0   new  abc
1   foo  bar
2  bait  xyz
>>> df.replace(regex=r'^ba.$', value='new')
      A    B
0   new  abc
1   foo  new
2  bait  xyz
>>> df.replace(regex={r'^ba.$':'new', 'foo':'xyz'})
      A    B
0   new  abc
1   xyz  new
2  bait  xyz
>>> df.replace(regex=[r'^ba.$', 'foo'], value='new')
      A    B
0   new  abc
1   new  new
2  bait  xyz

Note that when replacing multiple ``bool`` or ``datetime64`` objects,
the data types in the ``to_replace`` parameter must match the data
type of the value being replaced:

>>> df = pd.DataFrame({'A': [True, False, True],
...                    'B': [False, True, False]})
>>> df.replace({'a string': 'new value', True: False})  # raises
TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'

This raises a ``TypeError`` because one of the ``dict`` keys is not of
the correct type for replacement.

Compare the behavior of ``s.replace({'a': None})`` and
``s.replace('a', None)`` to understand the pecularities
of the ``to_replace`` parameter:

>>> s = pd.Series([10, 'a', 'a', 'b', 'a'])

When one uses a dict as the ``to_replace`` value, it is like the
value(s) in the dict are equal to the value parameter.
``s.replace({'a': None})`` is equivalent to
``s.replace(to_replace={'a': None}, value=None, method=None)``:

>>> s.replace({'a': None})
0      10
1    None
2    None
3       b
4    None
dtype: object

When ``value=None`` and ``to_replace`` are a scalar, list or
tuple, ``replace`` uses the method parameter (default 'pad') to do the
replacement. So this is why the 'a' values are being replaced by 10
in rows 1 and 2 and 'b' in row 4 in this case.
The command ``s.replace('a', None)`` is actually equivalent to
``s.replace(to_replace='a', value=None, method='pad')``:

>>> s.replace('a', None)
0    10
1    10
2    10
3     b
4     b
dtype: object

################################################################################
################################## Validation ##################################
################################################################################

Errors found:
        Errors in parameters section
                Parameter "to_replace" description should start with capital letter
        Examples do not pass tests

################################################################################
################################### Doctests ###################################
################################################################################

**********************************************************************
Line 233, in pandas.DataFrame.replace
Failed example:
    df.replace({'a string': 'new value', True: False})  # raises
Exception raised:
    Traceback (most recent call last):
      File "C:\Users\thisi\AppData\Local\conda\conda\envs\pandas_dev\lib\doctest.py", line 1330, in __run
        compileflags, 1), test.globs)
      File "<doctest pandas.DataFrame.replace[17]>", line 1, in <module>
        df.replace({'a string': 'new value', True: False})  # raises
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\frame.py", line 3136, in replace
        method=method, axis=axis)
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\generic.py", line 5205, in replace
        limit=limit, regex=regex)
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\frame.py", line 3136, in replace
        method=method, axis=axis)
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\generic.py", line 5254, in replace
        regex=regex)
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 3696, in replace_list
        masks = [comp(s) for i, s in enumerate(src_list)]
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 3696, in <listcomp>
        masks = [comp(s) for i, s in enumerate(src_list)]
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 3694, in comp
        return _maybe_compare(values, getattr(s, 'asm8', s), operator.eq)
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 5122, in _maybe_compare
        b=type_names[1]))
    TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'

Section headers.

Consistent quoting.

Formatting.

Traceback.
@codecov
Copy link

codecov bot commented Mar 15, 2018

Codecov Report

Merging #20271 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20271      +/-   ##
==========================================
+ Coverage   91.82%   91.84%   +0.02%     
==========================================
  Files         152      153       +1     
  Lines       49248    49305      +57     
==========================================
+ Hits        45222    45286      +64     
+ Misses       4026     4019       -7
Flag Coverage Δ
#multiple 90.24% <100%> (+0.02%) ⬆️
#single 41.89% <53.84%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/generic.py 95.94% <100%> (+0.08%) ⬆️
pandas/io/clipboard/clipboards.py 30.58% <0%> (-1.6%) ⬇️
pandas/core/config_init.py 99.24% <0%> (-0.76%) ⬇️
pandas/core/arrays/categorical.py 95.78% <0%> (-0.41%) ⬇️
pandas/core/nanops.py 96.3% <0%> (-0.4%) ⬇️
pandas/util/_decorators.py 82.25% <0%> (-0.15%) ⬇️
pandas/plotting/_core.py 82.39% <0%> (-0.12%) ⬇️
pandas/io/pytables.py 92.41% <0%> (-0.05%) ⬇️
pandas/core/frame.py 97.16% <0%> (-0.02%) ⬇️
pandas/tseries/offsets.py 97% <0%> (-0.01%) ⬇️
... and 27 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cdfce2b...58f6531. Read the comment docs.

@TomAugspurger
Copy link
Contributor

Updated

################################################################################
##################### Docstring (pandas.DataFrame.replace) #####################
################################################################################

Replace values given in `to_replace` with `value`.

Values of the DataFrame are replaced with other values dynamically.
This differs from updating with ``.loc`` or ``.iloc``, which require
you to specify a location to update with some value.

Parameters
----------
to_replace : str, regex, list, dict, Series, int, float, or None
    How to find the values that will be replaced.

    * numeric, str or regex:

        - numeric: numeric values equal to `to_replace` will be
          replaced with `value`
        - str: string exactly matching `to_replace` will be replaced
          with `value`
        - regex: regexs matching `to_replace` will be replaced with
          `value`

    * list of str, regex, or numeric:

        - First, if `to_replace` and `value` are both lists, they
          **must** be the same length.
        - Second, if ``regex=True`` then all of the strings in **both**
          lists will be interpreted as regexs otherwise they will match
          directly. This doesn't matter much for `value` since there
          are only a few possible substitution regexes you can use.
        - str, regex and numeric rules apply as above.

    * dict:

        - Dicts can be used to specify different replacement values
          for different existing values. For example,
          ``{'a': 'b', 'y': 'z'}`` replaces the value 'a' with 'b' and
          'y' with 'z'. To use a dict in this way the `value`
          parameter should be `None`.
        - For a DataFrame a dict can specify that different values
          should be replaced in different columns. For example,
          ``{'a': 1, 'b': 'z'}`` looks for the value 1 in column 'a'
          and the value 'z' in column 'b' and replaces these values
          with whatever is specified in `value`. The `value` parameter
          should not be ``None`` in this case. You can treat this as a
          special case of passing two lists except that you are
          specifying the column to search in.
        - For a DataFrame nested dictionaries, e.g.,
          ``{'a': {'b': np.nan}}``, are read as follows: look in column
          'a' for the value 'b' and replace it with NaN. The `value`
          parameter should be ``None`` to use a nested dict in this
          way. You can nest regular expressions as well. Note that
          column names (the top-level dictionary keys in a nested
          dictionary) **cannot** be regular expressions.

    * None:

        - This means that the `regex` argument must be a string,
          compiled regular expression, or list, dict, ndarray or
          Series of such elements. If `value` is also ``None`` then
          this **must** be a nested dictionary or Series.

    See the examples section for examples of each of these.
value : scalar, dict, list, str, regex, default None
    Value to replace any values matching `to_replace` with.
    For a DataFrame a dict of values can be used to specify which
    value to use for each column (columns not in the dict will not be
    filled). Regular expressions, strings and lists or dicts of such
    objects are also allowed.
inplace : boolean, default False
    If True, in place. Note: this will modify any
    other views on this object (e.g. a column from a DataFrame).
    Returns the caller if this is True.
limit : int, default None
    Maximum size gap to forward or backward fill.
regex : bool or same types as `to_replace`, default False
    Whether to interpret `to_replace` and/or `value` as regular
    expressions. If this is ``True`` then `to_replace` *must* be a
    string. Alternatively, this could be a regular expression or a
    list, dict, or array of regular expressions in which case
    `to_replace` must be ``None``.
method : {'pad', 'ffill', 'bfill', `None`}
    The method to use when for replacement, when `to_replace` is a
    scalar, list or tuple and `value` is ``None``.

    .. versionchanged:: 0.23.0
        Added to DataFrame.
axis : None
    Deprecated.

See Also
--------
DataFrame.fillna : Fill `NaN` values
DataFrame.where : Replace values based on boolean condition
Series.str.replace : Simple string replacement.

Returns
-------
DataFrame
    Object after replacement.

Raises
------
AssertionError
    * If `regex` is not a ``bool`` and `to_replace` is not
      ``None``.
TypeError
    * If `to_replace` is a ``dict`` and `value` is not a ``list``,
      ``dict``, ``ndarray``, or ``Series``
    * If `to_replace` is ``None`` and `regex` is not compilable
      into a regular expression or is a list, dict, ndarray, or
      Series.
    * When replacing multiple ``bool`` or ``datetime64`` objects and
      the arguments to `to_replace` does not match the type of the
      value being replaced
ValueError
    * If a ``list`` or an ``ndarray`` is passed to `to_replace` and
      `value` but they are not the same length.

Notes
-----
* Regex substitution is performed under the hood with ``re.sub``. The
  rules for substitution for ``re.sub`` are the same.
* Regular expressions will only substitute on strings, meaning you
  cannot provide, for example, a regular expression matching floating
  point numbers and expect the columns in your frame that have a
  numeric dtype to be matched. However, if those floating point
  numbers *are* strings, then you can do this.
* This method has *a lot* of options. You are encouraged to experiment
  and play with this method to gain intuition about how it works.
* When dict is used as the `to_replace` value, it is like
  key(s) in the dict are the to_replace part and
  value(s) in the dict are the value parameter.

Examples
--------

**Scalar `to_replace` and `value`**

>>> s = pd.Series([0, 1, 2, 3, 4])
>>> s.replace(0, 5)
0    5
1    1
2    2
3    3
4    4
dtype: int64

>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
...                    'B': [5, 6, 7, 8, 9],
...                    'C': ['a', 'b', 'c', 'd', 'e']})
>>> df.replace(0, 5)
   A  B  C
0  5  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e

**List-like `to_replace`**

>>> df.replace([0, 1, 2, 3], 4)
   A  B  C
0  4  5  a
1  4  6  b
2  4  7  c
3  4  8  d
4  4  9  e

>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1])
   A  B  C
0  4  5  a
1  3  6  b
2  2  7  c
3  1  8  d
4  4  9  e

>>> s.replace([1, 2], method='bfill')
0    0
1    3
2    3
3    3
4    4
dtype: int64

**dict-like `to_replace`**

>>> df.replace({0: 10, 1: 100})
     A  B  C
0   10  5  a
1  100  6  b
2    2  7  c
3    3  8  d
4    4  9  e

>>> df.replace({'A': 0, 'B': 5}, 100)
     A    B  C
0  100  100  a
1    1    6  b
2    2    7  c
3    3    8  d
4    4    9  e

>>> df.replace({'A': {0: 100, 4: 400}})
     A  B  C
0  100  5  a
1    1  6  b
2    2  7  c
3    3  8  d
4  400  9  e

**Regular expression `to_replace`**

>>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],
...                    'B': ['abc', 'bar', 'xyz']})
>>> df.replace(to_replace=r'^ba.$', value='new', regex=True)
      A    B
0   new  abc
1   foo  new
2  bait  xyz

>>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)
      A    B
0   new  abc
1   foo  bar
2  bait  xyz

>>> df.replace(regex=r'^ba.$', value='new')
      A    B
0   new  abc
1   foo  new
2  bait  xyz

>>> df.replace(regex={r'^ba.$':'new', 'foo':'xyz'})
      A    B
0   new  abc
1   xyz  new
2  bait  xyz

>>> df.replace(regex=[r'^ba.$', 'foo'], value='new')
      A    B
0   new  abc
1   new  new
2  bait  xyz

Note that when replacing multiple ``bool`` or ``datetime64`` objects,
the data types in the `to_replace` parameter must match the data
type of the value being replaced:

>>> df = pd.DataFrame({'A': [True, False, True],
...                    'B': [False, True, False]})
>>> df.replace({'a string': 'new value', True: False})  # raises
Traceback (most recent call last):
    ...
TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'

This raises a ``TypeError`` because one of the ``dict`` keys is not of
the correct type for replacement.

Compare the behavior of ``s.replace({'a': None})`` and
``s.replace('a', None)`` to understand the pecularities
of the `to_replace` parameter:

>>> s = pd.Series([10, 'a', 'a', 'b', 'a'])

When one uses a dict as the `to_replace` value, it is like the
value(s) in the dict are equal to the `value` parameter.
``s.replace({'a': None})`` is equivalent to
``s.replace(to_replace={'a': None}, value=None, method=None)``:

>>> s.replace({'a': None})
0      10
1    None
2    None
3       b
4    None
dtype: object

When ``value=None`` and `to_replace` is a scalar, list or
tuple, `replace` uses the method parameter (default 'pad') to do the
replacement. So this is why the 'a' values are being replaced by 10
in rows 1 and 2 and 'b' in row 4 in this case.
The command ``s.replace('a', None)`` is actually equivalent to
``s.replace(to_replace='a', value=None, method='pad')``:

>>> s.replace('a', None)
0    10
1    10
2    10
3     b
4     b
dtype: object

################################################################################
################################## Validation ##################################
################################################################################

Docstring for "pandas.DataFrame.replace" correct. :)



fireshot capture 003 - pandas dataframe replace pandas 0 2_ - file____users_taugspurger_sandbox_

@jorisvandenbossche
Copy link
Member

I would personally split this docstring in separate ones for series and dataframe, it's becoming quite a monster :)

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One very minor edit but otherwise lgtm


See Also
--------
%(klass)s.fillna : Fill NA/NaN values
%(klass)s.fillna : Fill `NaN` values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be better as Fill NA values since it is talking about the concept of missing data and not necessarily the NaN value itself

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the linting failure. Let's get this merged when that passes.

@TomAugspurger TomAugspurger merged commit 4de2e9b into pandas-dev:master Apr 22, 2018
@TomAugspurger
Copy link
Contributor

Thanks @math-and-data!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants