Skip to content

Commit c88b0ba

Browse files
author
Tom Augspurger
committed
Merge pull request pandas-dev#9239 from TomAugspurger/dfTransform
API: Add DataFrame.assign method
2 parents fc843d3 + 6a5bd89 commit c88b0ba

File tree

6 files changed

+255
-0
lines changed

6 files changed

+255
-0
lines changed
22.4 KB
Loading

doc/source/basics.rst

+2
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
from pandas.compat import lrange
1212
options.display.max_rows=15
1313
14+
1415
==============================
1516
Essential Basic Functionality
1617
==============================
@@ -793,6 +794,7 @@ This is equivalent to the following
793794
result
794795
result.loc[:,:,'ItemA']
795796
797+
796798
.. _basics.reindexing:
797799

798800

doc/source/dsintro.rst

+76
Original file line numberDiff line numberDiff line change
@@ -450,6 +450,82 @@ available to insert at a particular location in the columns:
450450
df.insert(1, 'bar', df['one'])
451451
df
452452
453+
.. _dsintro.chained_assignment:
454+
455+
Assigning New Columns in Method Chains
456+
--------------------------------------
457+
458+
.. versionadded:: 0.16.0
459+
460+
Inspired by `dplyr's
461+
<http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html#mutate>`__
462+
``mutate`` verb, DataFrame has an :meth:`~pandas.DataFrame.assign`
463+
method that allows you to easily create new columns that are potentially
464+
derived from existing columns.
465+
466+
.. ipython:: python
467+
468+
iris = read_csv('data/iris.data')
469+
iris.head()
470+
471+
(iris.assign(sepal_ratio = iris['SepalWidth'] / iris['SepalLength'])
472+
.head())
473+
474+
Above was an example of inserting a precomputed value. We can also pass in
475+
a function of one argument to be evalutated on the DataFrame being assigned to.
476+
477+
.. ipython:: python
478+
479+
iris.assign(sepal_ratio = lambda x: (x['SepalWidth'] /
480+
x['SepalLength'])).head()
481+
482+
``assign`` **always** returns a copy of the data, leaving the original
483+
DataFrame untouched.
484+
485+
Passing a callable, as opposed to an actual value to be inserted, is
486+
useful when you don't have a reference to the DataFrame at hand. This is
487+
common when using ``assign`` in chains of operations. For example,
488+
we can limit the DataFrame to just those observations with a Sepal Length
489+
greater than 5, calculate the ratio, and plot:
490+
491+
.. ipython:: python
492+
493+
@savefig basics_assign.png
494+
(iris.query('SepalLength > 5')
495+
.assign(SepalRatio = lambda x: x.SepalWidth / x.SepalLength,
496+
PetalRatio = lambda x: x.PetalWidth / x.PetalLength)
497+
.plot(kind='scatter', x='SepalRatio', y='PetalRatio'))
498+
499+
Since a function is passed in, the function is computed on the DataFrame
500+
being assigned to. Importantly, this is the DataFrame that's been filtered
501+
to those rows with sepal length greater than 5. The filtering happens first,
502+
and then the ratio calculations. This is an example where we didn't
503+
have a reference to the *filtered* DataFrame available.
504+
505+
The function signature for ``assign`` is simply ``**kwargs``. The keys
506+
are the column names for the new fields, and the values are either a value
507+
to be inserted (for example, a ``Series`` or NumPy array), or a function
508+
of one argument to be called on the ``DataFrame``. A *copy* of the original
509+
DataFrame is returned, with the new values inserted.
510+
511+
.. warning::
512+
513+
Since the function signature of ``assign`` is ``**kwargs``, a dictionary,
514+
the order of the new columns in the resulting DataFrame cannot be guaranteed.
515+
516+
All expressions are computed first, and then assigned. So you can't refer
517+
to another column being assigned in the same call to ``assign``. For example:
518+
519+
.. ipython::
520+
:verbatim:
521+
522+
In [1]: # Don't do this, bad reference to `C`
523+
df.assign(C = lambda x: x['A'] + x['B'],
524+
D = lambda x: x['A'] + x['C'])
525+
In [2]: # Instead, break it into two assigns
526+
(df.assign(C = lambda x: x['A'] + x['B'])
527+
.assign(D = lambda x: x['A'] + x['C']))
528+
453529
Indexing / Selection
454530
~~~~~~~~~~~~~~~~~~~~
455531
The basics of indexing are as follows:

doc/source/whatsnew/v0.16.0.txt

+41
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,47 @@ New features
2929

3030
This method is also exposed by the lower level ``Index.get_indexer`` and ``Index.get_loc`` methods.
3131

32+
- DataFrame assign method
33+
34+
Inspired by `dplyr's
35+
<http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html#mutate>`__ ``mutate`` verb, DataFrame has a new
36+
:meth:`~pandas.DataFrame.assign` method.
37+
The function signature for ``assign`` is simply ``**kwargs``. The keys
38+
are the column names for the new fields, and the values are either a value
39+
to be inserted (for example, a ``Series`` or NumPy array), or a function
40+
of one argument to be called on the ``DataFrame``. The new values are inserted,
41+
and the entire DataFrame (with all original and new columns) is returned.
42+
43+
.. ipython :: python
44+
45+
iris = read_csv('data/iris.data')
46+
iris.head()
47+
48+
iris.assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength']).head()
49+
50+
Above was an example of inserting a precomputed value. We can also pass in
51+
a function to be evalutated.
52+
53+
.. ipython :: python
54+
55+
iris.assign(sepal_ratio = lambda x: (x['SepalWidth'] /
56+
x['SepalLength'])).head()
57+
58+
The power of ``assign`` comes when used in chains of operations. For example,
59+
we can limit the DataFrame to just those with a Sepal Length greater than 5,
60+
calculate the ratio, and plot
61+
62+
.. ipython:: python
63+
64+
(iris.query('SepalLength > 5')
65+
.assign(SepalRatio = lambda x: x.SepalWidth / x.SepalLength,
66+
PetalRatio = lambda x: x.PetalWidth / x.PetalLength)
67+
.plot(kind='scatter', x='SepalRatio', y='PetalRatio'))
68+
69+
.. image:: _static/whatsnew_assign.png
70+
71+
See the :ref:`documentation <dsintro.chained_assignment>` for more. (:issue:`9229`)
72+
3273
.. _whatsnew_0160.api:
3374

3475
.. _whatsnew_0160.api_breaking:

pandas/core/frame.py

+82
Original file line numberDiff line numberDiff line change
@@ -2220,6 +2220,88 @@ def insert(self, loc, column, value, allow_duplicates=False):
22202220
self._data.insert(
22212221
loc, column, value, allow_duplicates=allow_duplicates)
22222222

2223+
def assign(self, **kwargs):
2224+
"""
2225+
Assign new columns to a DataFrame, returning a new object
2226+
(a copy) with all the original columns in addition to the new ones.
2227+
2228+
.. versionadded:: 0.16.0
2229+
2230+
Parameters
2231+
----------
2232+
kwargs : keyword, value pairs
2233+
keywords are the column names. If the values are
2234+
callable, they are computed on the DataFrame and
2235+
assigned to the new columns. If the values are
2236+
not callable, (e.g. a Series, scalar, or array),
2237+
they are simply assigned.
2238+
2239+
Returns
2240+
-------
2241+
df : DataFrame
2242+
A new DataFrame with the new columns in addition to
2243+
all the existing columns.
2244+
2245+
Notes
2246+
-----
2247+
Since ``kwargs`` is a dictionary, the order of your
2248+
arguments may not be preserved, and so the order of the
2249+
new columns is not well defined. Assigning multiple
2250+
columns within the same ``assign`` is possible, but you cannot
2251+
reference other columns created within the same ``assign`` call.
2252+
2253+
Examples
2254+
--------
2255+
>>> df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
2256+
2257+
Where the value is a callable, evaluated on `df`:
2258+
2259+
>>> df.assign(ln_A = lambda x: np.log(x.A))
2260+
A B ln_A
2261+
0 1 0.426905 0.000000
2262+
1 2 -0.780949 0.693147
2263+
2 3 -0.418711 1.098612
2264+
3 4 -0.269708 1.386294
2265+
4 5 -0.274002 1.609438
2266+
5 6 -0.500792 1.791759
2267+
6 7 1.649697 1.945910
2268+
7 8 -1.495604 2.079442
2269+
8 9 0.549296 2.197225
2270+
9 10 -0.758542 2.302585
2271+
2272+
Where the value already exists and is inserted:
2273+
2274+
>>> newcol = np.log(df['A'])
2275+
>>> df.assign(ln_A=newcol)
2276+
A B ln_A
2277+
0 1 0.426905 0.000000
2278+
1 2 -0.780949 0.693147
2279+
2 3 -0.418711 1.098612
2280+
3 4 -0.269708 1.386294
2281+
4 5 -0.274002 1.609438
2282+
5 6 -0.500792 1.791759
2283+
6 7 1.649697 1.945910
2284+
7 8 -1.495604 2.079442
2285+
8 9 0.549296 2.197225
2286+
9 10 -0.758542 2.302585
2287+
"""
2288+
data = self.copy()
2289+
2290+
# do all calculations first...
2291+
results = {}
2292+
for k, v in kwargs.items():
2293+
2294+
if callable(v):
2295+
results[k] = v(data)
2296+
else:
2297+
results[k] = v
2298+
2299+
# ... and then assign
2300+
for k, v in results.items():
2301+
data[k] = v
2302+
2303+
return data
2304+
22232305
def _sanitize_column(self, key, value):
22242306
# Need to make sure new columns (which go into the BlockManager as new
22252307
# blocks) are always copied

pandas/tests/test_frame.py

+54
Original file line numberDiff line numberDiff line change
@@ -13965,6 +13965,60 @@ def test_select_dtypes_bad_arg_raises(self):
1396513965
with tm.assertRaisesRegexp(TypeError, 'data type.*not understood'):
1396613966
df.select_dtypes(['blargy, blarg, blarg'])
1396713967

13968+
def test_assign(self):
13969+
df = DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
13970+
original = df.copy()
13971+
result = df.assign(C=df.B / df.A)
13972+
expected = df.copy()
13973+
expected['C'] = [4, 2.5, 2]
13974+
assert_frame_equal(result, expected)
13975+
13976+
# lambda syntax
13977+
result = df.assign(C=lambda x: x.B / x.A)
13978+
assert_frame_equal(result, expected)
13979+
13980+
# original is unmodified
13981+
assert_frame_equal(df, original)
13982+
13983+
# Non-Series array-like
13984+
result = df.assign(C=[4, 2.5, 2])
13985+
assert_frame_equal(result, expected)
13986+
# original is unmodified
13987+
assert_frame_equal(df, original)
13988+
13989+
result = df.assign(B=df.B / df.A)
13990+
expected = expected.drop('B', axis=1).rename(columns={'C': 'B'})
13991+
assert_frame_equal(result, expected)
13992+
13993+
# overwrite
13994+
result = df.assign(A=df.A + df.B)
13995+
expected = df.copy()
13996+
expected['A'] = [5, 7, 9]
13997+
assert_frame_equal(result, expected)
13998+
13999+
# lambda
14000+
result = df.assign(A=lambda x: x.A + x.B)
14001+
assert_frame_equal(result, expected)
14002+
14003+
def test_assign_multiple(self):
14004+
df = DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
14005+
result = df.assign(C=[7, 8, 9], D=df.A, E=lambda x: x.B)
14006+
expected = DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9],
14007+
'D': [1, 2, 3], 'E': [4, 5, 6]})
14008+
# column order isn't preserved
14009+
assert_frame_equal(result.reindex_like(expected), expected)
14010+
14011+
def test_assign_bad(self):
14012+
df = DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
14013+
# non-keyword argument
14014+
with tm.assertRaises(TypeError):
14015+
df.assign(lambda x: x.A)
14016+
with tm.assertRaises(AttributeError):
14017+
df.assign(C=df.A, D=df.A + df.C)
14018+
with tm.assertRaises(KeyError):
14019+
df.assign(C=lambda df: df.A, D=lambda df: df['A'] + df['C'])
14020+
with tm.assertRaises(KeyError):
14021+
df.assign(C=df.A, D=lambda x: x['A'] + x['C'])
1396814022

1396914023
def skip_if_no_ne(engine='numexpr'):
1397014024
if engine == 'numexpr':

0 commit comments

Comments
 (0)