From cce223fbb881368e01a7877e9728b9f2da04e67a Mon Sep 17 00:00:00 2001 From: Patrick Hoefler Date: Fri, 17 Feb 2023 16:49:09 +0100 Subject: [PATCH 01/10] DOC: Add user guide section about copy on write --- doc/source/user_guide/copy_on_write.rst | 183 ++++++++++++++++++++++++ doc/source/user_guide/index.rst | 1 + doc/source/whatsnew/v2.0.0.rst | 53 +------ 3 files changed, 186 insertions(+), 51 deletions(-) create mode 100644 doc/source/user_guide/copy_on_write.rst diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst new file mode 100644 index 0000000000000..0ba2f6517cbcb --- /dev/null +++ b/doc/source/user_guide/copy_on_write.rst @@ -0,0 +1,183 @@ +.. _copy_on_write: + +{{ header }} + +******************* +Copy-on-Write (CoW) +******************* + +.. ipython:: python + :suppress: + + pd.options.mode.copy_on_write = True + +Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 most of the +optimizations that become possible through CoW are implemented and supported. A complete list +can be found at TODO + +We expect that CoW will be enabled per default in version 3.0 + +Description +----------- + +CoW means that any DataFrame or Series derived from another in any way always +behaves as a copy. As a consequence, we can only change the values of an object +through modifying the object itself. CoW disallows updating a DataFrame or a Series +that shares data with another DataFrame or Series object inplace. + +This avoids side-effects when modifying values and hence, most methods can avoid +actually copying the data and only trigger a copy when necessary. + +The following example will operate inplace with CoW: + +.. ipython:: python + + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) + df.iloc[0, 0] = 100 + df + +The object ``df`` does not share any data with any other object and hence no +copy is triggered when updating the values. In contrast, the following operation +triggers a copy of the data under CoW: + + +.. ipython:: python + + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) + df2 = df.reset_index(drop=True) + df2.iloc[0, 0] = 100 + + df + df2 + +``reset_index`` returns a lazy copy with CoW while it copies the data without CoW. +Since both objects, ``df`` and ``df2`` share the same data, a copy is triggered +when modifying ``df2``. The object ``df`` still has the same values as initially +while ``df2`` was modified. + +If the object ``df`` isn't needed anymore after performing the ``reset_index`` operation, +you can emulate an inplace-like operation through assigning the output of ``reset_index`` +to the same variable: + +.. ipython:: python + + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) + df = df.reset_index(drop=True) + df.iloc[0, 0] = 100 + df + +The initial object gets out of scope as soon as the result of ``reset_index`` is +reassigned and hence ``df`` does not share data with any other object. No copy +is necessary when modifying the object. This is generally true for all methods +listed in :ref:`Copy-on-Write optimizations `. + +Chained Assignment +------------------ + +Chained assignment references a technique where an object is updated through +two subsequent indexing operations, e.g. + +.. ipython:: python + + with pd.option_context("mode.copy_on_write", False): + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) + df["foo"][df["bar"] > 5] = 100 + df + +The column ``foo`` is updated where the column ``bar`` is greater than 5. +This violates the CoW principles though, because it would habe to modify the +view ``df["foo"]`` and ``df`` in one step. Hence, chained assignment will +consistently never work and raise a ``ChainedAssignmentError`` with CoW enabled: + +.. ipython:: python + :okexcept: + + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) + df["foo"][df["bar"] > 5] = 100 + +.. _copy_on_write.optimizations: + +Copy-on-Write optimizations +--------------------------- + +A new lazy copy mechanism that defers the copy until the object in question is modified +and only if this object shares data with another object. This mechanism was added to +following methods: + + - :meth:`DataFrame.reset_index` / :meth:`Series.reset_index` + - :meth:`DataFrame.set_index` + - :meth:`DataFrame.set_axis` / :meth:`Series.set_axis` + - :meth:`DataFrame.set_flags` / :meth:`Series.set_flags` + - :meth:`DataFrame.rename_axis` / :meth:`Series.rename_axis` + - :meth:`DataFrame.reindex` / :meth:`Series.reindex` + - :meth:`DataFrame.reindex_like` / :meth:`Series.reindex_like` + - :meth:`DataFrame.assign` + - :meth:`DataFrame.drop` + - :meth:`DataFrame.dropna` / :meth:`Series.dropna` + - :meth:`DataFrame.select_dtypes` + - :meth:`DataFrame.align` / :meth:`Series.align` + - :meth:`Series.to_frame` + - :meth:`DataFrame.rename` / :meth:`Series.rename` + - :meth:`DataFrame.add_prefix` / :meth:`Series.add_prefix` + - :meth:`DataFrame.add_suffix` / :meth:`Series.add_suffix` + - :meth:`DataFrame.drop_duplicates` / :meth:`Series.drop_duplicates` + - :meth:`DataFrame.droplevel` / :meth:`Series.droplevel` + - :meth:`DataFrame.reorder_levels` / :meth:`Series.reorder_levels` + - :meth:`DataFrame.between_time` / :meth:`Series.between_time` + - :meth:`DataFrame.filter` / :meth:`Series.filter` + - :meth:`DataFrame.head` / :meth:`Series.head` + - :meth:`DataFrame.tail` / :meth:`Series.tail` + - :meth:`DataFrame.isetitem` + - :meth:`DataFrame.pipe` / :meth:`Series.pipe` + - :meth:`DataFrame.pop` / :meth:`Series.pop` + - :meth:`DataFrame.replace` / :meth:`Series.replace` + - :meth:`DataFrame.shift` / :meth:`Series.shift` + - :meth:`DataFrame.sort_index` / :meth:`Series.sort_index` + - :meth:`DataFrame.sort_values` / :meth:`Series.sort_values` + - :meth:`DataFrame.squeeze` / :meth:`Series.squeeze` + - :meth:`DataFrame.swapaxes` + - :meth:`DataFrame.swaplevel` / :meth:`Series.swaplevel` + - :meth:`DataFrame.take` / :meth:`Series.take` + - :meth:`DataFrame.to_timestamp` / :meth:`Series.to_timestamp` + - :meth:`DataFrame.to_period` / :meth:`Series.to_period` + - :meth:`DataFrame.truncate` + - :meth:`DataFrame.iterrows` + - :meth:`DataFrame.tz_convert` / :meth:`Series.tz_localize` + - :meth:`DataFrame.fillna` / :meth:`Series.fillna` + - :meth:`DataFrame.interpolate` / :meth:`Series.interpolate` + - :meth:`DataFrame.ffill` / :meth:`Series.ffill` + - :meth:`DataFrame.bfill` / :meth:`Series.bfill` + - :meth:`DataFrame.where` / :meth:`Series.where` + - :meth:`DataFrame.infer_objects` / :meth:`Series.infer_objects` + - :meth:`DataFrame.astype` / :meth:`Series.astype` + - :meth:`DataFrame.convert_dtypes` / :meth:`Series.convert_dtypes` + - :meth:`DataFrame.join` + - :func:`concat` + - :func:`merge` + +These methods return views when Copy-on-Write is enabled, which provides a significant +performance improvement compared to the regular execution. + +How to enable CoW +----------------- + +Copy-on-Write can be enabled through the configuration option ``copy_on_write``. The option can +be turned on __globally__ through either of the following: + +.. ipython:: python + + pd.set_option("mode.copy_on_write", True) + + pd.options.mode.copy_on_write = True + +Alternatively, CoW can be enabled locally for testing purposes through: + +.. ipython:: python + + with pd.option_context("mode.copy_on_write", True): + ... + +.. ipython:: python + :suppress: + + pd.options.mode.copy_on_write = False diff --git a/doc/source/user_guide/index.rst b/doc/source/user_guide/index.rst index e23396c9e5fd4..f0d6a76f0de5b 100644 --- a/doc/source/user_guide/index.rst +++ b/doc/source/user_guide/index.rst @@ -67,6 +67,7 @@ Guides pyarrow indexing advanced + copy_on_write merging reshaping text diff --git a/doc/source/whatsnew/v2.0.0.rst b/doc/source/whatsnew/v2.0.0.rst index 78422ec686da8..f0e7821c32bed 100644 --- a/doc/source/whatsnew/v2.0.0.rst +++ b/doc/source/whatsnew/v2.0.0.rst @@ -183,57 +183,8 @@ Copy-on-Write improvements ^^^^^^^^^^^^^^^^^^^^^^^^^^ - A new lazy copy mechanism that defers the copy until the object in question is modified - was added to the following methods: - - - :meth:`DataFrame.reset_index` / :meth:`Series.reset_index` - - :meth:`DataFrame.set_index` - - :meth:`DataFrame.set_axis` / :meth:`Series.set_axis` - - :meth:`DataFrame.set_flags` / :meth:`Series.set_flags` - - :meth:`DataFrame.rename_axis` / :meth:`Series.rename_axis` - - :meth:`DataFrame.reindex` / :meth:`Series.reindex` - - :meth:`DataFrame.reindex_like` / :meth:`Series.reindex_like` - - :meth:`DataFrame.assign` - - :meth:`DataFrame.drop` - - :meth:`DataFrame.dropna` / :meth:`Series.dropna` - - :meth:`DataFrame.select_dtypes` - - :meth:`DataFrame.align` / :meth:`Series.align` - - :meth:`Series.to_frame` - - :meth:`DataFrame.rename` / :meth:`Series.rename` - - :meth:`DataFrame.add_prefix` / :meth:`Series.add_prefix` - - :meth:`DataFrame.add_suffix` / :meth:`Series.add_suffix` - - :meth:`DataFrame.drop_duplicates` / :meth:`Series.drop_duplicates` - - :meth:`DataFrame.droplevel` / :meth:`Series.droplevel` - - :meth:`DataFrame.reorder_levels` / :meth:`Series.reorder_levels` - - :meth:`DataFrame.between_time` / :meth:`Series.between_time` - - :meth:`DataFrame.filter` / :meth:`Series.filter` - - :meth:`DataFrame.head` / :meth:`Series.head` - - :meth:`DataFrame.tail` / :meth:`Series.tail` - - :meth:`DataFrame.isetitem` - - :meth:`DataFrame.pipe` / :meth:`Series.pipe` - - :meth:`DataFrame.pop` / :meth:`Series.pop` - - :meth:`DataFrame.replace` / :meth:`Series.replace` - - :meth:`DataFrame.shift` / :meth:`Series.shift` - - :meth:`DataFrame.sort_index` / :meth:`Series.sort_index` - - :meth:`DataFrame.sort_values` / :meth:`Series.sort_values` - - :meth:`DataFrame.squeeze` / :meth:`Series.squeeze` - - :meth:`DataFrame.swapaxes` - - :meth:`DataFrame.swaplevel` / :meth:`Series.swaplevel` - - :meth:`DataFrame.take` / :meth:`Series.take` - - :meth:`DataFrame.to_timestamp` / :meth:`Series.to_timestamp` - - :meth:`DataFrame.to_period` / :meth:`Series.to_period` - - :meth:`DataFrame.truncate` - - :meth:`DataFrame.iterrows` - - :meth:`DataFrame.tz_convert` / :meth:`Series.tz_localize` - - :meth:`DataFrame.fillna` / :meth:`Series.fillna` - - :meth:`DataFrame.interpolate` / :meth:`Series.interpolate` - - :meth:`DataFrame.ffill` / :meth:`Series.ffill` - - :meth:`DataFrame.bfill` / :meth:`Series.bfill` - - :meth:`DataFrame.where` / :meth:`Series.where` - - :meth:`DataFrame.infer_objects` / :meth:`Series.infer_objects` - - :meth:`DataFrame.astype` / :meth:`Series.astype` - - :meth:`DataFrame.convert_dtypes` / :meth:`Series.convert_dtypes` - - :func:`concat` - + was added to the methods listed in + :ref:`Copy-on-Write optimizations `. These methods return views when Copy-on-Write is enabled, which provides a significant performance improvement compared to the regular execution (:issue:`49473`). From fb1fcab98c64d43786714e1b69e022fb256d0cd0 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler Date: Fri, 17 Feb 2023 19:10:09 +0100 Subject: [PATCH 02/10] Adjust label --- doc/source/development/copy_on_write.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/development/copy_on_write.rst b/doc/source/development/copy_on_write.rst index 34625ed645615..0e7def11d9ce2 100644 --- a/doc/source/development/copy_on_write.rst +++ b/doc/source/development/copy_on_write.rst @@ -1,4 +1,4 @@ -.. _copy_on_write: +.. _copy_on_write_dev: {{ header }} From 5ff4f92ab20631d653193dbb27294be4e5b36a04 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler Date: Fri, 17 Feb 2023 19:42:27 +0100 Subject: [PATCH 03/10] Replace todo --- doc/source/user_guide/copy_on_write.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index 0ba2f6517cbcb..6b90bd548c851 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -13,7 +13,7 @@ Copy-on-Write (CoW) Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 most of the optimizations that become possible through CoW are implemented and supported. A complete list -can be found at TODO +can be found at :ref:`Copy-on-Write optimizations `. We expect that CoW will be enabled per default in version 3.0 From b8d3906436e0d068e92312db233cd97bfdb4c9c7 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Fri, 17 Feb 2023 19:59:11 +0100 Subject: [PATCH 04/10] Update doc/source/user_guide/copy_on_write.rst Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> --- doc/source/user_guide/copy_on_write.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index 6b90bd548c851..c671ab28abad9 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -85,7 +85,7 @@ two subsequent indexing operations, e.g. df The column ``foo`` is updated where the column ``bar`` is greater than 5. -This violates the CoW principles though, because it would habe to modify the +This violates the CoW principles though, because it would have to modify the view ``df["foo"]`` and ``df`` in one step. Hence, chained assignment will consistently never work and raise a ``ChainedAssignmentError`` with CoW enabled: From b9e863290df6b910169d32005e2961c174d83c26 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Fri, 17 Feb 2023 19:59:39 +0100 Subject: [PATCH 05/10] Update doc/source/user_guide/copy_on_write.rst Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> --- doc/source/user_guide/copy_on_write.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index c671ab28abad9..42b03e92a03d2 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -15,7 +15,7 @@ Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 m optimizations that become possible through CoW are implemented and supported. A complete list can be found at :ref:`Copy-on-Write optimizations `. -We expect that CoW will be enabled per default in version 3.0 +We expect that CoW will be enabled by default in version 3.0 Description ----------- From 8fee2a92fcdd6f8e619b5fa8bea8bdcbe1ed3502 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler Date: Fri, 17 Feb 2023 20:05:37 +0100 Subject: [PATCH 06/10] Update --- doc/source/development/copy_on_write.rst | 3 ++- doc/source/user_guide/copy_on_write.rst | 32 +++++++++++++++++++++++- 2 files changed, 33 insertions(+), 2 deletions(-) diff --git a/doc/source/development/copy_on_write.rst b/doc/source/development/copy_on_write.rst index 0e7def11d9ce2..087e45fbb965b 100644 --- a/doc/source/development/copy_on_write.rst +++ b/doc/source/development/copy_on_write.rst @@ -9,7 +9,8 @@ Copy on write Copy on Write is a mechanism to simplify the indexing API and improve performance through avoiding copies if possible. CoW means that any DataFrame or Series derived from another in any way always -behaves as a copy. +behaves as a copy. An explanation on how to use Copy on Write efficiently can be +found :ref:`here `. Reference tracking ------------------ diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index 42b03e92a03d2..5090f5b5545bb 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -15,7 +15,11 @@ Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 m optimizations that become possible through CoW are implemented and supported. A complete list can be found at :ref:`Copy-on-Write optimizations `. -We expect that CoW will be enabled by default in version 3.0 +We expect that CoW will be enabled by default in version 3.0. + +CoW will lead to more predictable behavior since it is not possible to update more than +one object with one statement, e.g. methods won't have side-effects. Additionally, through +delaying copies as long as possible, the average performance will improve. Description ----------- @@ -71,6 +75,29 @@ reassigned and hence ``df`` does not share data with any other object. No copy is necessary when modifying the object. This is generally true for all methods listed in :ref:`Copy-on-Write optimizations `. +Previously, when operating on views, the view and the parent object was modified: + +.. ipython:: python + + with pd.option_context("mode.copy_on_write", False): + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) + view = df[:] + df.iloc[0, 0] = 100 + + df + view + +CoW triggers a copy when ``df`` is changed to avoid mutating ``view`` as well: + +.. ipython:: python + + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) + view = df[:] + df.iloc[0, 0] = 100 + + df + view + Chained Assignment ------------------ @@ -95,6 +122,9 @@ consistently never work and raise a ``ChainedAssignmentError`` with CoW enabled: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) df["foo"][df["bar"] > 5] = 100 +With copy on write this can either be done by using ``loc`` or doing this +in multiple steps. + .. _copy_on_write.optimizations: Copy-on-Write optimizations From 8697ab8bce9a061ca9da5b4c9da38e50db1d65da Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Mon, 20 Feb 2023 11:28:08 +0000 Subject: [PATCH 07/10] Update doc/source/user_guide/copy_on_write.rst Co-authored-by: Joris Van den Bossche --- doc/source/user_guide/copy_on_write.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index 5090f5b5545bb..98da6fe9ea7aa 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -18,7 +18,7 @@ can be found at :ref:`Copy-on-Write optimizations ` We expect that CoW will be enabled by default in version 3.0. CoW will lead to more predictable behavior since it is not possible to update more than -one object with one statement, e.g. methods won't have side-effects. Additionally, through +one object with one statement, e.g. indexing operations or methods won't have side-effects. Additionally, through delaying copies as long as possible, the average performance will improve. Description From 3a3889046965803a0567cc782bd2abf29b47ee3c Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Mon, 20 Feb 2023 11:28:15 +0000 Subject: [PATCH 08/10] Update doc/source/user_guide/copy_on_write.rst Co-authored-by: Joris Van den Bossche --- doc/source/user_guide/copy_on_write.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index 98da6fe9ea7aa..cde8cd1537aea 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -19,7 +19,7 @@ We expect that CoW will be enabled by default in version 3.0. CoW will lead to more predictable behavior since it is not possible to update more than one object with one statement, e.g. indexing operations or methods won't have side-effects. Additionally, through -delaying copies as long as possible, the average performance will improve. +delaying copies as long as possible, the average performance and memory usage will improve. Description ----------- From 019b2d86cd7e1b7bc4779b8037f4e76506ba10b7 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler Date: Mon, 20 Feb 2023 11:30:49 +0000 Subject: [PATCH 09/10] Remove temporary enabling --- doc/source/user_guide/copy_on_write.rst | 7 ------- 1 file changed, 7 deletions(-) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index cde8cd1537aea..b3842734c78bc 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -200,13 +200,6 @@ be turned on __globally__ through either of the following: pd.options.mode.copy_on_write = True -Alternatively, CoW can be enabled locally for testing purposes through: - -.. ipython:: python - - with pd.option_context("mode.copy_on_write", True): - ... - .. ipython:: python :suppress: From ac0f959987ded2115d895acdecda4dd4527a4445 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler Date: Mon, 20 Feb 2023 11:33:02 +0000 Subject: [PATCH 10/10] Add loc --- doc/source/user_guide/copy_on_write.rst | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index b3842734c78bc..94dde9a6ffd70 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -122,8 +122,11 @@ consistently never work and raise a ``ChainedAssignmentError`` with CoW enabled: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) df["foo"][df["bar"] > 5] = 100 -With copy on write this can either be done by using ``loc`` or doing this -in multiple steps. +With copy on write this can be done by using ``loc``. + +.. ipython:: python + + df.loc[df["bar"] > 5, "foo"] = 100 .. _copy_on_write.optimizations: