Skip to content

Commit ed4dfd2

Browse files
phoflmeeseeksmachine
authored andcommitted
Backport PR pandas-dev#51454: DOC: Add user guide section about copy on write
1 parent fb35381 commit ed4dfd2

File tree

4 files changed

+215
-53
lines changed

4 files changed

+215
-53
lines changed

doc/source/development/copy_on_write.rst

+3-2
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
.. _copy_on_write:
1+
.. _copy_on_write_dev:
22

33
{{ header }}
44

@@ -9,7 +9,8 @@ Copy on write
99
Copy on Write is a mechanism to simplify the indexing API and improve
1010
performance through avoiding copies if possible.
1111
CoW means that any DataFrame or Series derived from another in any way always
12-
behaves as a copy.
12+
behaves as a copy. An explanation on how to use Copy on Write efficiently can be
13+
found :ref:`here <copy_on_write>`.
1314

1415
Reference tracking
1516
------------------
+209
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,209 @@
1+
.. _copy_on_write:
2+
3+
{{ header }}
4+
5+
*******************
6+
Copy-on-Write (CoW)
7+
*******************
8+
9+
.. ipython:: python
10+
:suppress:
11+
12+
pd.options.mode.copy_on_write = True
13+
14+
Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 most of the
15+
optimizations that become possible through CoW are implemented and supported. A complete list
16+
can be found at :ref:`Copy-on-Write optimizations <copy_on_write.optimizations>`.
17+
18+
We expect that CoW will be enabled by default in version 3.0.
19+
20+
CoW will lead to more predictable behavior since it is not possible to update more than
21+
one object with one statement, e.g. indexing operations or methods won't have side-effects. Additionally, through
22+
delaying copies as long as possible, the average performance and memory usage will improve.
23+
24+
Description
25+
-----------
26+
27+
CoW means that any DataFrame or Series derived from another in any way always
28+
behaves as a copy. As a consequence, we can only change the values of an object
29+
through modifying the object itself. CoW disallows updating a DataFrame or a Series
30+
that shares data with another DataFrame or Series object inplace.
31+
32+
This avoids side-effects when modifying values and hence, most methods can avoid
33+
actually copying the data and only trigger a copy when necessary.
34+
35+
The following example will operate inplace with CoW:
36+
37+
.. ipython:: python
38+
39+
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
40+
df.iloc[0, 0] = 100
41+
df
42+
43+
The object ``df`` does not share any data with any other object and hence no
44+
copy is triggered when updating the values. In contrast, the following operation
45+
triggers a copy of the data under CoW:
46+
47+
48+
.. ipython:: python
49+
50+
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
51+
df2 = df.reset_index(drop=True)
52+
df2.iloc[0, 0] = 100
53+
54+
df
55+
df2
56+
57+
``reset_index`` returns a lazy copy with CoW while it copies the data without CoW.
58+
Since both objects, ``df`` and ``df2`` share the same data, a copy is triggered
59+
when modifying ``df2``. The object ``df`` still has the same values as initially
60+
while ``df2`` was modified.
61+
62+
If the object ``df`` isn't needed anymore after performing the ``reset_index`` operation,
63+
you can emulate an inplace-like operation through assigning the output of ``reset_index``
64+
to the same variable:
65+
66+
.. ipython:: python
67+
68+
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
69+
df = df.reset_index(drop=True)
70+
df.iloc[0, 0] = 100
71+
df
72+
73+
The initial object gets out of scope as soon as the result of ``reset_index`` is
74+
reassigned and hence ``df`` does not share data with any other object. No copy
75+
is necessary when modifying the object. This is generally true for all methods
76+
listed in :ref:`Copy-on-Write optimizations <copy_on_write.optimizations>`.
77+
78+
Previously, when operating on views, the view and the parent object was modified:
79+
80+
.. ipython:: python
81+
82+
with pd.option_context("mode.copy_on_write", False):
83+
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
84+
view = df[:]
85+
df.iloc[0, 0] = 100
86+
87+
df
88+
view
89+
90+
CoW triggers a copy when ``df`` is changed to avoid mutating ``view`` as well:
91+
92+
.. ipython:: python
93+
94+
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
95+
view = df[:]
96+
df.iloc[0, 0] = 100
97+
98+
df
99+
view
100+
101+
Chained Assignment
102+
------------------
103+
104+
Chained assignment references a technique where an object is updated through
105+
two subsequent indexing operations, e.g.
106+
107+
.. ipython:: python
108+
109+
with pd.option_context("mode.copy_on_write", False):
110+
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
111+
df["foo"][df["bar"] > 5] = 100
112+
df
113+
114+
The column ``foo`` is updated where the column ``bar`` is greater than 5.
115+
This violates the CoW principles though, because it would have to modify the
116+
view ``df["foo"]`` and ``df`` in one step. Hence, chained assignment will
117+
consistently never work and raise a ``ChainedAssignmentError`` with CoW enabled:
118+
119+
.. ipython:: python
120+
:okexcept:
121+
122+
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
123+
df["foo"][df["bar"] > 5] = 100
124+
125+
With copy on write this can be done by using ``loc``.
126+
127+
.. ipython:: python
128+
129+
df.loc[df["bar"] > 5, "foo"] = 100
130+
131+
.. _copy_on_write.optimizations:
132+
133+
Copy-on-Write optimizations
134+
---------------------------
135+
136+
A new lazy copy mechanism that defers the copy until the object in question is modified
137+
and only if this object shares data with another object. This mechanism was added to
138+
following methods:
139+
140+
- :meth:`DataFrame.reset_index` / :meth:`Series.reset_index`
141+
- :meth:`DataFrame.set_index`
142+
- :meth:`DataFrame.set_axis` / :meth:`Series.set_axis`
143+
- :meth:`DataFrame.set_flags` / :meth:`Series.set_flags`
144+
- :meth:`DataFrame.rename_axis` / :meth:`Series.rename_axis`
145+
- :meth:`DataFrame.reindex` / :meth:`Series.reindex`
146+
- :meth:`DataFrame.reindex_like` / :meth:`Series.reindex_like`
147+
- :meth:`DataFrame.assign`
148+
- :meth:`DataFrame.drop`
149+
- :meth:`DataFrame.dropna` / :meth:`Series.dropna`
150+
- :meth:`DataFrame.select_dtypes`
151+
- :meth:`DataFrame.align` / :meth:`Series.align`
152+
- :meth:`Series.to_frame`
153+
- :meth:`DataFrame.rename` / :meth:`Series.rename`
154+
- :meth:`DataFrame.add_prefix` / :meth:`Series.add_prefix`
155+
- :meth:`DataFrame.add_suffix` / :meth:`Series.add_suffix`
156+
- :meth:`DataFrame.drop_duplicates` / :meth:`Series.drop_duplicates`
157+
- :meth:`DataFrame.droplevel` / :meth:`Series.droplevel`
158+
- :meth:`DataFrame.reorder_levels` / :meth:`Series.reorder_levels`
159+
- :meth:`DataFrame.between_time` / :meth:`Series.between_time`
160+
- :meth:`DataFrame.filter` / :meth:`Series.filter`
161+
- :meth:`DataFrame.head` / :meth:`Series.head`
162+
- :meth:`DataFrame.tail` / :meth:`Series.tail`
163+
- :meth:`DataFrame.isetitem`
164+
- :meth:`DataFrame.pipe` / :meth:`Series.pipe`
165+
- :meth:`DataFrame.pop` / :meth:`Series.pop`
166+
- :meth:`DataFrame.replace` / :meth:`Series.replace`
167+
- :meth:`DataFrame.shift` / :meth:`Series.shift`
168+
- :meth:`DataFrame.sort_index` / :meth:`Series.sort_index`
169+
- :meth:`DataFrame.sort_values` / :meth:`Series.sort_values`
170+
- :meth:`DataFrame.squeeze` / :meth:`Series.squeeze`
171+
- :meth:`DataFrame.swapaxes`
172+
- :meth:`DataFrame.swaplevel` / :meth:`Series.swaplevel`
173+
- :meth:`DataFrame.take` / :meth:`Series.take`
174+
- :meth:`DataFrame.to_timestamp` / :meth:`Series.to_timestamp`
175+
- :meth:`DataFrame.to_period` / :meth:`Series.to_period`
176+
- :meth:`DataFrame.truncate`
177+
- :meth:`DataFrame.iterrows`
178+
- :meth:`DataFrame.tz_convert` / :meth:`Series.tz_localize`
179+
- :meth:`DataFrame.fillna` / :meth:`Series.fillna`
180+
- :meth:`DataFrame.interpolate` / :meth:`Series.interpolate`
181+
- :meth:`DataFrame.ffill` / :meth:`Series.ffill`
182+
- :meth:`DataFrame.bfill` / :meth:`Series.bfill`
183+
- :meth:`DataFrame.where` / :meth:`Series.where`
184+
- :meth:`DataFrame.infer_objects` / :meth:`Series.infer_objects`
185+
- :meth:`DataFrame.astype` / :meth:`Series.astype`
186+
- :meth:`DataFrame.convert_dtypes` / :meth:`Series.convert_dtypes`
187+
- :meth:`DataFrame.join`
188+
- :func:`concat`
189+
- :func:`merge`
190+
191+
These methods return views when Copy-on-Write is enabled, which provides a significant
192+
performance improvement compared to the regular execution.
193+
194+
How to enable CoW
195+
-----------------
196+
197+
Copy-on-Write can be enabled through the configuration option ``copy_on_write``. The option can
198+
be turned on __globally__ through either of the following:
199+
200+
.. ipython:: python
201+
202+
pd.set_option("mode.copy_on_write", True)
203+
204+
pd.options.mode.copy_on_write = True
205+
206+
.. ipython:: python
207+
:suppress:
208+
209+
pd.options.mode.copy_on_write = False

doc/source/user_guide/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@ Guides
6767
pyarrow
6868
indexing
6969
advanced
70+
copy_on_write
7071
merging
7172
reshaping
7273
text

doc/source/whatsnew/v2.0.0.rst

+2-51
Original file line numberDiff line numberDiff line change
@@ -183,57 +183,8 @@ Copy-on-Write improvements
183183
^^^^^^^^^^^^^^^^^^^^^^^^^^
184184

185185
- A new lazy copy mechanism that defers the copy until the object in question is modified
186-
was added to the following methods:
187-
188-
- :meth:`DataFrame.reset_index` / :meth:`Series.reset_index`
189-
- :meth:`DataFrame.set_index`
190-
- :meth:`DataFrame.set_axis` / :meth:`Series.set_axis`
191-
- :meth:`DataFrame.set_flags` / :meth:`Series.set_flags`
192-
- :meth:`DataFrame.rename_axis` / :meth:`Series.rename_axis`
193-
- :meth:`DataFrame.reindex` / :meth:`Series.reindex`
194-
- :meth:`DataFrame.reindex_like` / :meth:`Series.reindex_like`
195-
- :meth:`DataFrame.assign`
196-
- :meth:`DataFrame.drop`
197-
- :meth:`DataFrame.dropna` / :meth:`Series.dropna`
198-
- :meth:`DataFrame.select_dtypes`
199-
- :meth:`DataFrame.align` / :meth:`Series.align`
200-
- :meth:`Series.to_frame`
201-
- :meth:`DataFrame.rename` / :meth:`Series.rename`
202-
- :meth:`DataFrame.add_prefix` / :meth:`Series.add_prefix`
203-
- :meth:`DataFrame.add_suffix` / :meth:`Series.add_suffix`
204-
- :meth:`DataFrame.drop_duplicates` / :meth:`Series.drop_duplicates`
205-
- :meth:`DataFrame.droplevel` / :meth:`Series.droplevel`
206-
- :meth:`DataFrame.reorder_levels` / :meth:`Series.reorder_levels`
207-
- :meth:`DataFrame.between_time` / :meth:`Series.between_time`
208-
- :meth:`DataFrame.filter` / :meth:`Series.filter`
209-
- :meth:`DataFrame.head` / :meth:`Series.head`
210-
- :meth:`DataFrame.tail` / :meth:`Series.tail`
211-
- :meth:`DataFrame.isetitem`
212-
- :meth:`DataFrame.pipe` / :meth:`Series.pipe`
213-
- :meth:`DataFrame.pop` / :meth:`Series.pop`
214-
- :meth:`DataFrame.replace` / :meth:`Series.replace`
215-
- :meth:`DataFrame.shift` / :meth:`Series.shift`
216-
- :meth:`DataFrame.sort_index` / :meth:`Series.sort_index`
217-
- :meth:`DataFrame.sort_values` / :meth:`Series.sort_values`
218-
- :meth:`DataFrame.squeeze` / :meth:`Series.squeeze`
219-
- :meth:`DataFrame.swapaxes`
220-
- :meth:`DataFrame.swaplevel` / :meth:`Series.swaplevel`
221-
- :meth:`DataFrame.take` / :meth:`Series.take`
222-
- :meth:`DataFrame.to_timestamp` / :meth:`Series.to_timestamp`
223-
- :meth:`DataFrame.to_period` / :meth:`Series.to_period`
224-
- :meth:`DataFrame.truncate`
225-
- :meth:`DataFrame.iterrows`
226-
- :meth:`DataFrame.tz_convert` / :meth:`Series.tz_localize`
227-
- :meth:`DataFrame.fillna` / :meth:`Series.fillna`
228-
- :meth:`DataFrame.interpolate` / :meth:`Series.interpolate`
229-
- :meth:`DataFrame.ffill` / :meth:`Series.ffill`
230-
- :meth:`DataFrame.bfill` / :meth:`Series.bfill`
231-
- :meth:`DataFrame.where` / :meth:`Series.where`
232-
- :meth:`DataFrame.infer_objects` / :meth:`Series.infer_objects`
233-
- :meth:`DataFrame.astype` / :meth:`Series.astype`
234-
- :meth:`DataFrame.convert_dtypes` / :meth:`Series.convert_dtypes`
235-
- :func:`concat`
236-
186+
was added to the methods listed in
187+
:ref:`Copy-on-Write optimizations <copy_on_write.optimizations>`.
237188
These methods return views when Copy-on-Write is enabled, which provides a significant
238189
performance improvement compared to the regular execution (:issue:`49473`).
239190

0 commit comments

Comments
 (0)