Skip to content

Commit bae38fc

Browse files
datajankojreback
authored andcommitted
ENH: df.assign accepting dependent **kwargs (#14207) (#18852)
1 parent 5c76f33 commit bae38fc

File tree

4 files changed

+163
-37
lines changed

4 files changed

+163
-37
lines changed

doc/source/dsintro.rst

+65-20
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ constructed from the sorted keys of the dict, if possible.
9595

9696
NaN (not a number) is the standard missing data marker used in pandas.
9797

98-
**From scalar value**
98+
**From scalar value**
9999

100100
If ``data`` is a scalar value, an index must be
101101
provided. The value will be repeated to match the length of **index**.
@@ -154,7 +154,7 @@ See also the :ref:`section on attribute access<indexing.attribute_access>`.
154154
Vectorized operations and label alignment with Series
155155
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
156156

157-
When working with raw NumPy arrays, looping through value-by-value is usually
157+
When working with raw NumPy arrays, looping through value-by-value is usually
158158
not necessary. The same is true when working with Series in pandas.
159159
Series can also be passed into most NumPy methods expecting an ndarray.
160160

@@ -324,7 +324,7 @@ From a list of dicts
324324
From a dict of tuples
325325
~~~~~~~~~~~~~~~~~~~~~
326326

327-
You can automatically create a multi-indexed frame by passing a tuples
327+
You can automatically create a multi-indexed frame by passing a tuples
328328
dictionary.
329329

330330
.. ipython:: python
@@ -347,7 +347,7 @@ column name provided).
347347
**Missing Data**
348348

349349
Much more will be said on this topic in the :ref:`Missing data <missing_data>`
350-
section. To construct a DataFrame with missing data, we use ``np.nan`` to
350+
section. To construct a DataFrame with missing data, we use ``np.nan`` to
351351
represent missing values. Alternatively, you may pass a ``numpy.MaskedArray``
352352
as the data argument to the DataFrame constructor, and its masked entries will
353353
be considered missing.
@@ -370,7 +370,7 @@ set to ``'index'`` in order to use the dict keys as row labels.
370370

371371
``DataFrame.from_records`` takes a list of tuples or an ndarray with structured
372372
dtype. It works analogously to the normal ``DataFrame`` constructor, except that
373-
the resulting DataFrame index may be a specific field of the structured
373+
the resulting DataFrame index may be a specific field of the structured
374374
dtype. For example:
375375

376376
.. ipython:: python
@@ -506,25 +506,70 @@ to be inserted (for example, a ``Series`` or NumPy array), or a function
506506
of one argument to be called on the ``DataFrame``. A *copy* of the original
507507
DataFrame is returned, with the new values inserted.
508508

509+
.. versionmodified:: 0.23.0
510+
511+
Starting with Python 3.6 the order of ``**kwargs`` is preserved. This allows
512+
for *dependent* assignment, where an expression later in ``**kwargs`` can refer
513+
to a column created earlier in the same :meth:`~DataFrame.assign`.
514+
515+
.. ipython:: python
516+
517+
dfa = pd.DataFrame({"A": [1, 2, 3],
518+
"B": [4, 5, 6]})
519+
dfa.assign(C=lambda x: x['A'] + x['B'],
520+
D=lambda x: x['A'] + x['C'])
521+
522+
In the second expression, ``x['C']`` will refer to the newly created column,
523+
that's equal to ``dfa['A'] + dfa['B']``.
524+
525+
To write code compatible with all versions of Python, split the assignment in two.
526+
527+
.. ipython:: python
528+
529+
dependent = pd.DataFrame({"A": [1, 1, 1]})
530+
(dependent.assign(A=lambda x: x['A'] + 1)
531+
.assign(B=lambda x: x['A'] + 2))
532+
509533
.. warning::
510534

511-
Since the function signature of ``assign`` is ``**kwargs``, a dictionary,
512-
the order of the new columns in the resulting DataFrame cannot be guaranteed
513-
to match the order you pass in. To make things predictable, items are inserted
514-
alphabetically (by key) at the end of the DataFrame.
535+
Dependent assignment maybe subtly change the behavior of your code between
536+
Python 3.6 and older versions of Python.
537+
538+
If you wish write code that supports versions of python before and after 3.6,
539+
you'll need to take care when passing ``assign`` expressions that
540+
541+
* Updating an existing column
542+
* Refering to the newly updated column in the same ``assign``
543+
544+
For example, we'll update column "A" and then refer to it when creating "B".
545+
546+
.. code-block:: python
547+
548+
>>> dependent = pd.DataFrame({"A": [1, 1, 1]})
549+
>>> dependent.assign(A=lambda x: x["A"] + 1,
550+
B=lambda x: x["A"] + 2)
551+
552+
For Python 3.5 and earlier the expression creating ``B`` refers to the
553+
"old" value of ``A``, ``[1, 1, 1]``. The output is then
554+
555+
.. code-block:: python
556+
557+
A B
558+
0 2 3
559+
1 2 3
560+
2 2 3
561+
562+
For Python 3.6 and later, the expression creating ``A`` refers to the
563+
"new" value of ``A``, ``[2, 2, 2]``, which results in
564+
565+
.. code-block:: python
515566
516-
All expressions are computed first, and then assigned. So you can't refer
517-
to another column being assigned in the same call to ``assign``. For example:
567+
A B
568+
0 2 4
569+
1 2 4
570+
2 2 4
518571
519-
.. ipython::
520-
:verbatim:
521572
522-
In [1]: # Don't do this, bad reference to `C`
523-
df.assign(C = lambda x: x['A'] + x['B'],
524-
D = lambda x: x['A'] + x['C'])
525-
In [2]: # Instead, break it into two assigns
526-
(df.assign(C = lambda x: x['A'] + x['B'])
527-
.assign(D = lambda x: x['A'] + x['C']))
528573
529574
Indexing / Selection
530575
~~~~~~~~~~~~~~~~~~~~
@@ -914,7 +959,7 @@ For example, using the earlier example data, we could do:
914959
Squeezing
915960
~~~~~~~~~
916961

917-
Another way to change the dimensionality of an object is to ``squeeze`` a 1-len
962+
Another way to change the dimensionality of an object is to ``squeeze`` a 1-len
918963
object, similar to ``wp['Item1']``.
919964

920965
.. ipython:: python

doc/source/whatsnew/v0.23.0.txt

+40
Original file line numberDiff line numberDiff line change
@@ -248,6 +248,46 @@ Current Behavior:
248248

249249
pd.RangeIndex(1, 5) / 0
250250

251+
.. _whatsnew_0230.enhancements.assign_dependent:
252+
253+
``.assign()`` accepts dependent arguments
254+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
255+
256+
The :func:`DataFrame.assign` now accepts dependent keyword arguments for python version later than 3.6 (see also `PEP 468
257+
<https://www.python.org/dev/peps/pep-0468/>`_). Later keyword arguments may now refer to earlier ones if the argument is a callable. See the
258+
:ref:`documentation here <dsintro.chained_assignment>` (:issue:`14207`)
259+
260+
.. ipython:: python
261+
262+
df = pd.DataFrame({'A': [1, 2, 3]})
263+
df
264+
df.assign(B=df.A, C=lambda x:x['A']+ x['B'])
265+
266+
.. warning::
267+
268+
This may subtly change the behavior of your code when you're
269+
using ``.assign()`` to update an existing column. Previously, callables
270+
referring to other variables being updated would get the "old" values
271+
272+
Previous Behaviour:
273+
274+
.. code-block:: ipython
275+
276+
In [2]: df = pd.DataFrame({"A": [1, 2, 3]})
277+
278+
In [3]: df.assign(A=lambda df: df.A + 1, C=lambda df: df.A * -1)
279+
Out[3]:
280+
A C
281+
0 2 -1
282+
1 3 -2
283+
2 4 -3
284+
285+
New Behaviour:
286+
287+
.. ipython:: python
288+
289+
df.assign(A=df.A+1, C= lambda df: df.A* -1)
290+
251291
.. _whatsnew_0230.enhancements.other:
252292

253293
Other Enhancements

pandas/core/frame.py

+33-16
Original file line numberDiff line numberDiff line change
@@ -2687,12 +2687,17 @@ def assign(self, **kwargs):
26872687
26882688
Notes
26892689
-----
2690-
For python 3.6 and above, the columns are inserted in the order of
2691-
\*\*kwargs. For python 3.5 and earlier, since \*\*kwargs is unordered,
2692-
the columns are inserted in alphabetical order at the end of your
2693-
DataFrame. Assigning multiple columns within the same ``assign``
2694-
is possible, but you cannot reference other columns created within
2695-
the same ``assign`` call.
2690+
Assigning multiple columns within the same ``assign`` is possible.
2691+
For Python 3.6 and above, later items in '\*\*kwargs' may refer to
2692+
newly created or modified columns in 'df'; items are computed and
2693+
assigned into 'df' in order. For Python 3.5 and below, the order of
2694+
keyword arguments is not specified, you cannot refer to newly created
2695+
or modified columns. All items are computed first, and then assigned
2696+
in alphabetical order.
2697+
2698+
.. versionmodified :: 0.23.0
2699+
2700+
Keyword argument order is maintained for Python 3.6 and later.
26962701
26972702
Examples
26982703
--------
@@ -2728,22 +2733,34 @@ def assign(self, **kwargs):
27282733
7 8 -1.495604 2.079442
27292734
8 9 0.549296 2.197225
27302735
9 10 -0.758542 2.302585
2736+
2737+
Where the keyword arguments depend on each other
2738+
2739+
>>> df = pd.DataFrame({'A': [1, 2, 3]})
2740+
2741+
>>> df.assign(B=df.A, C=lambda x:x['A']+ x['B'])
2742+
A B C
2743+
0 1 1 2
2744+
1 2 2 4
2745+
2 3 3 6
27312746
"""
27322747
data = self.copy()
27332748

2734-
# do all calculations first...
2735-
results = OrderedDict()
2736-
for k, v in kwargs.items():
2737-
results[k] = com._apply_if_callable(v, data)
2738-
2739-
# preserve order for 3.6 and later, but sort by key for 3.5 and earlier
2749+
# >= 3.6 preserve order of kwargs
27402750
if PY36:
2741-
results = results.items()
2751+
for k, v in kwargs.items():
2752+
data[k] = com._apply_if_callable(v, data)
27422753
else:
2754+
# <= 3.5: do all calculations first...
2755+
results = OrderedDict()
2756+
for k, v in kwargs.items():
2757+
results[k] = com._apply_if_callable(v, data)
2758+
2759+
# <= 3.5 and earlier
27432760
results = sorted(results.items())
2744-
# ... and then assign
2745-
for k, v in results:
2746-
data[k] = v
2761+
# ... and then assign
2762+
for k, v in results:
2763+
data[k] = v
27472764
return data
27482765

27492766
def _sanitize_column(self, key, value, broadcast=True):

pandas/tests/frame/test_mutate_columns.py

+25-1
Original file line numberDiff line numberDiff line change
@@ -89,11 +89,35 @@ def test_assign_bad(self):
8989
df.assign(lambda x: x.A)
9090
with pytest.raises(AttributeError):
9191
df.assign(C=df.A, D=df.A + df.C)
92+
93+
@pytest.mark.skipif(PY36, reason="""Issue #14207: valid for python
94+
3.6 and above""")
95+
def test_assign_dependent_old_python(self):
96+
df = DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
97+
98+
# Key C does not exist at defition time of df
9299
with pytest.raises(KeyError):
93-
df.assign(C=lambda df: df.A, D=lambda df: df['A'] + df['C'])
100+
df.assign(C=lambda df: df.A,
101+
D=lambda df: df['A'] + df['C'])
94102
with pytest.raises(KeyError):
95103
df.assign(C=df.A, D=lambda x: x['A'] + x['C'])
96104

105+
@pytest.mark.skipif(not PY36, reason="""Issue #14207: not valid for
106+
python 3.5 and below""")
107+
def test_assign_dependent(self):
108+
df = DataFrame({'A': [1, 2], 'B': [3, 4]})
109+
110+
result = df.assign(C=df.A, D=lambda x: x['A'] + x['C'])
111+
expected = DataFrame([[1, 3, 1, 2], [2, 4, 2, 4]],
112+
columns=list('ABCD'))
113+
assert_frame_equal(result, expected)
114+
115+
result = df.assign(C=lambda df: df.A,
116+
D=lambda df: df['A'] + df['C'])
117+
expected = DataFrame([[1, 3, 1, 2], [2, 4, 2, 4]],
118+
columns=list('ABCD'))
119+
assert_frame_equal(result, expected)
120+
97121
def test_insert_error_msmgs(self):
98122

99123
# GH 7432

0 commit comments

Comments
 (0)