|
| 1 | +.. _copy_on_write: |
| 2 | + |
| 3 | +{{ header }} |
| 4 | + |
| 5 | +******************* |
| 6 | +Copy-on-Write (CoW) |
| 7 | +******************* |
| 8 | + |
| 9 | +.. ipython:: python |
| 10 | + :suppress: |
| 11 | +
|
| 12 | + pd.options.mode.copy_on_write = True |
| 13 | +
|
| 14 | +Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 most of the |
| 15 | +optimizations that become possible through CoW are implemented and supported. A complete list |
| 16 | +can be found at :ref:`Copy-on-Write optimizations <copy_on_write.optimizations>`. |
| 17 | + |
| 18 | +We expect that CoW will be enabled by default in version 3.0. |
| 19 | + |
| 20 | +CoW will lead to more predictable behavior since it is not possible to update more than |
| 21 | +one object with one statement, e.g. indexing operations or methods won't have side-effects. Additionally, through |
| 22 | +delaying copies as long as possible, the average performance and memory usage will improve. |
| 23 | + |
| 24 | +Description |
| 25 | +----------- |
| 26 | + |
| 27 | +CoW means that any DataFrame or Series derived from another in any way always |
| 28 | +behaves as a copy. As a consequence, we can only change the values of an object |
| 29 | +through modifying the object itself. CoW disallows updating a DataFrame or a Series |
| 30 | +that shares data with another DataFrame or Series object inplace. |
| 31 | + |
| 32 | +This avoids side-effects when modifying values and hence, most methods can avoid |
| 33 | +actually copying the data and only trigger a copy when necessary. |
| 34 | + |
| 35 | +The following example will operate inplace with CoW: |
| 36 | + |
| 37 | +.. ipython:: python |
| 38 | +
|
| 39 | + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) |
| 40 | + df.iloc[0, 0] = 100 |
| 41 | + df |
| 42 | +
|
| 43 | +The object ``df`` does not share any data with any other object and hence no |
| 44 | +copy is triggered when updating the values. In contrast, the following operation |
| 45 | +triggers a copy of the data under CoW: |
| 46 | + |
| 47 | + |
| 48 | +.. ipython:: python |
| 49 | +
|
| 50 | + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) |
| 51 | + df2 = df.reset_index(drop=True) |
| 52 | + df2.iloc[0, 0] = 100 |
| 53 | +
|
| 54 | + df |
| 55 | + df2 |
| 56 | +
|
| 57 | +``reset_index`` returns a lazy copy with CoW while it copies the data without CoW. |
| 58 | +Since both objects, ``df`` and ``df2`` share the same data, a copy is triggered |
| 59 | +when modifying ``df2``. The object ``df`` still has the same values as initially |
| 60 | +while ``df2`` was modified. |
| 61 | + |
| 62 | +If the object ``df`` isn't needed anymore after performing the ``reset_index`` operation, |
| 63 | +you can emulate an inplace-like operation through assigning the output of ``reset_index`` |
| 64 | +to the same variable: |
| 65 | + |
| 66 | +.. ipython:: python |
| 67 | +
|
| 68 | + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) |
| 69 | + df = df.reset_index(drop=True) |
| 70 | + df.iloc[0, 0] = 100 |
| 71 | + df |
| 72 | +
|
| 73 | +The initial object gets out of scope as soon as the result of ``reset_index`` is |
| 74 | +reassigned and hence ``df`` does not share data with any other object. No copy |
| 75 | +is necessary when modifying the object. This is generally true for all methods |
| 76 | +listed in :ref:`Copy-on-Write optimizations <copy_on_write.optimizations>`. |
| 77 | + |
| 78 | +Previously, when operating on views, the view and the parent object was modified: |
| 79 | + |
| 80 | +.. ipython:: python |
| 81 | +
|
| 82 | + with pd.option_context("mode.copy_on_write", False): |
| 83 | + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) |
| 84 | + view = df[:] |
| 85 | + df.iloc[0, 0] = 100 |
| 86 | +
|
| 87 | + df |
| 88 | + view |
| 89 | +
|
| 90 | +CoW triggers a copy when ``df`` is changed to avoid mutating ``view`` as well: |
| 91 | + |
| 92 | +.. ipython:: python |
| 93 | +
|
| 94 | + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) |
| 95 | + view = df[:] |
| 96 | + df.iloc[0, 0] = 100 |
| 97 | +
|
| 98 | + df |
| 99 | + view |
| 100 | +
|
| 101 | +Chained Assignment |
| 102 | +------------------ |
| 103 | + |
| 104 | +Chained assignment references a technique where an object is updated through |
| 105 | +two subsequent indexing operations, e.g. |
| 106 | + |
| 107 | +.. ipython:: python |
| 108 | +
|
| 109 | + with pd.option_context("mode.copy_on_write", False): |
| 110 | + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) |
| 111 | + df["foo"][df["bar"] > 5] = 100 |
| 112 | + df |
| 113 | +
|
| 114 | +The column ``foo`` is updated where the column ``bar`` is greater than 5. |
| 115 | +This violates the CoW principles though, because it would have to modify the |
| 116 | +view ``df["foo"]`` and ``df`` in one step. Hence, chained assignment will |
| 117 | +consistently never work and raise a ``ChainedAssignmentError`` with CoW enabled: |
| 118 | + |
| 119 | +.. ipython:: python |
| 120 | + :okexcept: |
| 121 | +
|
| 122 | + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) |
| 123 | + df["foo"][df["bar"] > 5] = 100 |
| 124 | +
|
| 125 | +With copy on write this can be done by using ``loc``. |
| 126 | + |
| 127 | +.. ipython:: python |
| 128 | +
|
| 129 | + df.loc[df["bar"] > 5, "foo"] = 100 |
| 130 | +
|
| 131 | +.. _copy_on_write.optimizations: |
| 132 | + |
| 133 | +Copy-on-Write optimizations |
| 134 | +--------------------------- |
| 135 | + |
| 136 | +A new lazy copy mechanism that defers the copy until the object in question is modified |
| 137 | +and only if this object shares data with another object. This mechanism was added to |
| 138 | +following methods: |
| 139 | + |
| 140 | + - :meth:`DataFrame.reset_index` / :meth:`Series.reset_index` |
| 141 | + - :meth:`DataFrame.set_index` |
| 142 | + - :meth:`DataFrame.set_axis` / :meth:`Series.set_axis` |
| 143 | + - :meth:`DataFrame.set_flags` / :meth:`Series.set_flags` |
| 144 | + - :meth:`DataFrame.rename_axis` / :meth:`Series.rename_axis` |
| 145 | + - :meth:`DataFrame.reindex` / :meth:`Series.reindex` |
| 146 | + - :meth:`DataFrame.reindex_like` / :meth:`Series.reindex_like` |
| 147 | + - :meth:`DataFrame.assign` |
| 148 | + - :meth:`DataFrame.drop` |
| 149 | + - :meth:`DataFrame.dropna` / :meth:`Series.dropna` |
| 150 | + - :meth:`DataFrame.select_dtypes` |
| 151 | + - :meth:`DataFrame.align` / :meth:`Series.align` |
| 152 | + - :meth:`Series.to_frame` |
| 153 | + - :meth:`DataFrame.rename` / :meth:`Series.rename` |
| 154 | + - :meth:`DataFrame.add_prefix` / :meth:`Series.add_prefix` |
| 155 | + - :meth:`DataFrame.add_suffix` / :meth:`Series.add_suffix` |
| 156 | + - :meth:`DataFrame.drop_duplicates` / :meth:`Series.drop_duplicates` |
| 157 | + - :meth:`DataFrame.droplevel` / :meth:`Series.droplevel` |
| 158 | + - :meth:`DataFrame.reorder_levels` / :meth:`Series.reorder_levels` |
| 159 | + - :meth:`DataFrame.between_time` / :meth:`Series.between_time` |
| 160 | + - :meth:`DataFrame.filter` / :meth:`Series.filter` |
| 161 | + - :meth:`DataFrame.head` / :meth:`Series.head` |
| 162 | + - :meth:`DataFrame.tail` / :meth:`Series.tail` |
| 163 | + - :meth:`DataFrame.isetitem` |
| 164 | + - :meth:`DataFrame.pipe` / :meth:`Series.pipe` |
| 165 | + - :meth:`DataFrame.pop` / :meth:`Series.pop` |
| 166 | + - :meth:`DataFrame.replace` / :meth:`Series.replace` |
| 167 | + - :meth:`DataFrame.shift` / :meth:`Series.shift` |
| 168 | + - :meth:`DataFrame.sort_index` / :meth:`Series.sort_index` |
| 169 | + - :meth:`DataFrame.sort_values` / :meth:`Series.sort_values` |
| 170 | + - :meth:`DataFrame.squeeze` / :meth:`Series.squeeze` |
| 171 | + - :meth:`DataFrame.swapaxes` |
| 172 | + - :meth:`DataFrame.swaplevel` / :meth:`Series.swaplevel` |
| 173 | + - :meth:`DataFrame.take` / :meth:`Series.take` |
| 174 | + - :meth:`DataFrame.to_timestamp` / :meth:`Series.to_timestamp` |
| 175 | + - :meth:`DataFrame.to_period` / :meth:`Series.to_period` |
| 176 | + - :meth:`DataFrame.truncate` |
| 177 | + - :meth:`DataFrame.iterrows` |
| 178 | + - :meth:`DataFrame.tz_convert` / :meth:`Series.tz_localize` |
| 179 | + - :meth:`DataFrame.fillna` / :meth:`Series.fillna` |
| 180 | + - :meth:`DataFrame.interpolate` / :meth:`Series.interpolate` |
| 181 | + - :meth:`DataFrame.ffill` / :meth:`Series.ffill` |
| 182 | + - :meth:`DataFrame.bfill` / :meth:`Series.bfill` |
| 183 | + - :meth:`DataFrame.where` / :meth:`Series.where` |
| 184 | + - :meth:`DataFrame.infer_objects` / :meth:`Series.infer_objects` |
| 185 | + - :meth:`DataFrame.astype` / :meth:`Series.astype` |
| 186 | + - :meth:`DataFrame.convert_dtypes` / :meth:`Series.convert_dtypes` |
| 187 | + - :meth:`DataFrame.join` |
| 188 | + - :func:`concat` |
| 189 | + - :func:`merge` |
| 190 | + |
| 191 | +These methods return views when Copy-on-Write is enabled, which provides a significant |
| 192 | +performance improvement compared to the regular execution. |
| 193 | + |
| 194 | +How to enable CoW |
| 195 | +----------------- |
| 196 | + |
| 197 | +Copy-on-Write can be enabled through the configuration option ``copy_on_write``. The option can |
| 198 | +be turned on __globally__ through either of the following: |
| 199 | + |
| 200 | +.. ipython:: python |
| 201 | +
|
| 202 | + pd.set_option("mode.copy_on_write", True) |
| 203 | +
|
| 204 | + pd.options.mode.copy_on_write = True |
| 205 | +
|
| 206 | +.. ipython:: python |
| 207 | + :suppress: |
| 208 | +
|
| 209 | + pd.options.mode.copy_on_write = False |
0 commit comments