diff --git a/doc/source/development/copy_on_write.rst b/doc/source/development/copy_on_write.rst index 34625ed645615..087e45fbb965b 100644 --- a/doc/source/development/copy_on_write.rst +++ b/doc/source/development/copy_on_write.rst @@ -1,4 +1,4 @@ -.. _copy_on_write: +.. _copy_on_write_dev: {{ header }} @@ -9,7 +9,8 @@ Copy on write Copy on Write is a mechanism to simplify the indexing API and improve performance through avoiding copies if possible. CoW means that any DataFrame or Series derived from another in any way always -behaves as a copy. +behaves as a copy. An explanation on how to use Copy on Write efficiently can be +found :ref:`here `. Reference tracking ------------------ diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst new file mode 100644 index 0000000000000..94dde9a6ffd70 --- /dev/null +++ b/doc/source/user_guide/copy_on_write.rst @@ -0,0 +1,209 @@ +.. _copy_on_write: + +{{ header }} + +******************* +Copy-on-Write (CoW) +******************* + +.. ipython:: python + :suppress: + + pd.options.mode.copy_on_write = True + +Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 most of the +optimizations that become possible through CoW are implemented and supported. A complete list +can be found at :ref:`Copy-on-Write optimizations `. + +We expect that CoW will be enabled by default in version 3.0. + +CoW will lead to more predictable behavior since it is not possible to update more than +one object with one statement, e.g. indexing operations or methods won't have side-effects. Additionally, through +delaying copies as long as possible, the average performance and memory usage will improve. + +Description +----------- + +CoW means that any DataFrame or Series derived from another in any way always +behaves as a copy. As a consequence, we can only change the values of an object +through modifying the object itself. CoW disallows updating a DataFrame or a Series +that shares data with another DataFrame or Series object inplace. + +This avoids side-effects when modifying values and hence, most methods can avoid +actually copying the data and only trigger a copy when necessary. + +The following example will operate inplace with CoW: + +.. ipython:: python + + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) + df.iloc[0, 0] = 100 + df + +The object ``df`` does not share any data with any other object and hence no +copy is triggered when updating the values. In contrast, the following operation +triggers a copy of the data under CoW: + + +.. ipython:: python + + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) + df2 = df.reset_index(drop=True) + df2.iloc[0, 0] = 100 + + df + df2 + +``reset_index`` returns a lazy copy with CoW while it copies the data without CoW. +Since both objects, ``df`` and ``df2`` share the same data, a copy is triggered +when modifying ``df2``. The object ``df`` still has the same values as initially +while ``df2`` was modified. + +If the object ``df`` isn't needed anymore after performing the ``reset_index`` operation, +you can emulate an inplace-like operation through assigning the output of ``reset_index`` +to the same variable: + +.. ipython:: python + + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) + df = df.reset_index(drop=True) + df.iloc[0, 0] = 100 + df + +The initial object gets out of scope as soon as the result of ``reset_index`` is +reassigned and hence ``df`` does not share data with any other object. No copy +is necessary when modifying the object. This is generally true for all methods +listed in :ref:`Copy-on-Write optimizations `. + +Previously, when operating on views, the view and the parent object was modified: + +.. ipython:: python + + with pd.option_context("mode.copy_on_write", False): + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) + view = df[:] + df.iloc[0, 0] = 100 + + df + view + +CoW triggers a copy when ``df`` is changed to avoid mutating ``view`` as well: + +.. ipython:: python + + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) + view = df[:] + df.iloc[0, 0] = 100 + + df + view + +Chained Assignment +------------------ + +Chained assignment references a technique where an object is updated through +two subsequent indexing operations, e.g. + +.. ipython:: python + + with pd.option_context("mode.copy_on_write", False): + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) + df["foo"][df["bar"] > 5] = 100 + df + +The column ``foo`` is updated where the column ``bar`` is greater than 5. +This violates the CoW principles though, because it would have to modify the +view ``df["foo"]`` and ``df`` in one step. Hence, chained assignment will +consistently never work and raise a ``ChainedAssignmentError`` with CoW enabled: + +.. ipython:: python + :okexcept: + + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) + df["foo"][df["bar"] > 5] = 100 + +With copy on write this can be done by using ``loc``. + +.. ipython:: python + + df.loc[df["bar"] > 5, "foo"] = 100 + +.. _copy_on_write.optimizations: + +Copy-on-Write optimizations +--------------------------- + +A new lazy copy mechanism that defers the copy until the object in question is modified +and only if this object shares data with another object. This mechanism was added to +following methods: + + - :meth:`DataFrame.reset_index` / :meth:`Series.reset_index` + - :meth:`DataFrame.set_index` + - :meth:`DataFrame.set_axis` / :meth:`Series.set_axis` + - :meth:`DataFrame.set_flags` / :meth:`Series.set_flags` + - :meth:`DataFrame.rename_axis` / :meth:`Series.rename_axis` + - :meth:`DataFrame.reindex` / :meth:`Series.reindex` + - :meth:`DataFrame.reindex_like` / :meth:`Series.reindex_like` + - :meth:`DataFrame.assign` + - :meth:`DataFrame.drop` + - :meth:`DataFrame.dropna` / :meth:`Series.dropna` + - :meth:`DataFrame.select_dtypes` + - :meth:`DataFrame.align` / :meth:`Series.align` + - :meth:`Series.to_frame` + - :meth:`DataFrame.rename` / :meth:`Series.rename` + - :meth:`DataFrame.add_prefix` / :meth:`Series.add_prefix` + - :meth:`DataFrame.add_suffix` / :meth:`Series.add_suffix` + - :meth:`DataFrame.drop_duplicates` / :meth:`Series.drop_duplicates` + - :meth:`DataFrame.droplevel` / :meth:`Series.droplevel` + - :meth:`DataFrame.reorder_levels` / :meth:`Series.reorder_levels` + - :meth:`DataFrame.between_time` / :meth:`Series.between_time` + - :meth:`DataFrame.filter` / :meth:`Series.filter` + - :meth:`DataFrame.head` / :meth:`Series.head` + - :meth:`DataFrame.tail` / :meth:`Series.tail` + - :meth:`DataFrame.isetitem` + - :meth:`DataFrame.pipe` / :meth:`Series.pipe` + - :meth:`DataFrame.pop` / :meth:`Series.pop` + - :meth:`DataFrame.replace` / :meth:`Series.replace` + - :meth:`DataFrame.shift` / :meth:`Series.shift` + - :meth:`DataFrame.sort_index` / :meth:`Series.sort_index` + - :meth:`DataFrame.sort_values` / :meth:`Series.sort_values` + - :meth:`DataFrame.squeeze` / :meth:`Series.squeeze` + - :meth:`DataFrame.swapaxes` + - :meth:`DataFrame.swaplevel` / :meth:`Series.swaplevel` + - :meth:`DataFrame.take` / :meth:`Series.take` + - :meth:`DataFrame.to_timestamp` / :meth:`Series.to_timestamp` + - :meth:`DataFrame.to_period` / :meth:`Series.to_period` + - :meth:`DataFrame.truncate` + - :meth:`DataFrame.iterrows` + - :meth:`DataFrame.tz_convert` / :meth:`Series.tz_localize` + - :meth:`DataFrame.fillna` / :meth:`Series.fillna` + - :meth:`DataFrame.interpolate` / :meth:`Series.interpolate` + - :meth:`DataFrame.ffill` / :meth:`Series.ffill` + - :meth:`DataFrame.bfill` / :meth:`Series.bfill` + - :meth:`DataFrame.where` / :meth:`Series.where` + - :meth:`DataFrame.infer_objects` / :meth:`Series.infer_objects` + - :meth:`DataFrame.astype` / :meth:`Series.astype` + - :meth:`DataFrame.convert_dtypes` / :meth:`Series.convert_dtypes` + - :meth:`DataFrame.join` + - :func:`concat` + - :func:`merge` + +These methods return views when Copy-on-Write is enabled, which provides a significant +performance improvement compared to the regular execution. + +How to enable CoW +----------------- + +Copy-on-Write can be enabled through the configuration option ``copy_on_write``. The option can +be turned on __globally__ through either of the following: + +.. ipython:: python + + pd.set_option("mode.copy_on_write", True) + + pd.options.mode.copy_on_write = True + +.. ipython:: python + :suppress: + + pd.options.mode.copy_on_write = False diff --git a/doc/source/user_guide/index.rst b/doc/source/user_guide/index.rst index e23396c9e5fd4..f0d6a76f0de5b 100644 --- a/doc/source/user_guide/index.rst +++ b/doc/source/user_guide/index.rst @@ -67,6 +67,7 @@ Guides pyarrow indexing advanced + copy_on_write merging reshaping text diff --git a/doc/source/whatsnew/v2.0.0.rst b/doc/source/whatsnew/v2.0.0.rst index 78422ec686da8..f0e7821c32bed 100644 --- a/doc/source/whatsnew/v2.0.0.rst +++ b/doc/source/whatsnew/v2.0.0.rst @@ -183,57 +183,8 @@ Copy-on-Write improvements ^^^^^^^^^^^^^^^^^^^^^^^^^^ - A new lazy copy mechanism that defers the copy until the object in question is modified - was added to the following methods: - - - :meth:`DataFrame.reset_index` / :meth:`Series.reset_index` - - :meth:`DataFrame.set_index` - - :meth:`DataFrame.set_axis` / :meth:`Series.set_axis` - - :meth:`DataFrame.set_flags` / :meth:`Series.set_flags` - - :meth:`DataFrame.rename_axis` / :meth:`Series.rename_axis` - - :meth:`DataFrame.reindex` / :meth:`Series.reindex` - - :meth:`DataFrame.reindex_like` / :meth:`Series.reindex_like` - - :meth:`DataFrame.assign` - - :meth:`DataFrame.drop` - - :meth:`DataFrame.dropna` / :meth:`Series.dropna` - - :meth:`DataFrame.select_dtypes` - - :meth:`DataFrame.align` / :meth:`Series.align` - - :meth:`Series.to_frame` - - :meth:`DataFrame.rename` / :meth:`Series.rename` - - :meth:`DataFrame.add_prefix` / :meth:`Series.add_prefix` - - :meth:`DataFrame.add_suffix` / :meth:`Series.add_suffix` - - :meth:`DataFrame.drop_duplicates` / :meth:`Series.drop_duplicates` - - :meth:`DataFrame.droplevel` / :meth:`Series.droplevel` - - :meth:`DataFrame.reorder_levels` / :meth:`Series.reorder_levels` - - :meth:`DataFrame.between_time` / :meth:`Series.between_time` - - :meth:`DataFrame.filter` / :meth:`Series.filter` - - :meth:`DataFrame.head` / :meth:`Series.head` - - :meth:`DataFrame.tail` / :meth:`Series.tail` - - :meth:`DataFrame.isetitem` - - :meth:`DataFrame.pipe` / :meth:`Series.pipe` - - :meth:`DataFrame.pop` / :meth:`Series.pop` - - :meth:`DataFrame.replace` / :meth:`Series.replace` - - :meth:`DataFrame.shift` / :meth:`Series.shift` - - :meth:`DataFrame.sort_index` / :meth:`Series.sort_index` - - :meth:`DataFrame.sort_values` / :meth:`Series.sort_values` - - :meth:`DataFrame.squeeze` / :meth:`Series.squeeze` - - :meth:`DataFrame.swapaxes` - - :meth:`DataFrame.swaplevel` / :meth:`Series.swaplevel` - - :meth:`DataFrame.take` / :meth:`Series.take` - - :meth:`DataFrame.to_timestamp` / :meth:`Series.to_timestamp` - - :meth:`DataFrame.to_period` / :meth:`Series.to_period` - - :meth:`DataFrame.truncate` - - :meth:`DataFrame.iterrows` - - :meth:`DataFrame.tz_convert` / :meth:`Series.tz_localize` - - :meth:`DataFrame.fillna` / :meth:`Series.fillna` - - :meth:`DataFrame.interpolate` / :meth:`Series.interpolate` - - :meth:`DataFrame.ffill` / :meth:`Series.ffill` - - :meth:`DataFrame.bfill` / :meth:`Series.bfill` - - :meth:`DataFrame.where` / :meth:`Series.where` - - :meth:`DataFrame.infer_objects` / :meth:`Series.infer_objects` - - :meth:`DataFrame.astype` / :meth:`Series.astype` - - :meth:`DataFrame.convert_dtypes` / :meth:`Series.convert_dtypes` - - :func:`concat` - + was added to the methods listed in + :ref:`Copy-on-Write optimizations `. These methods return views when Copy-on-Write is enabled, which provides a significant performance improvement compared to the regular execution (:issue:`49473`).