From 5810dbbe5181721ec70193dff6e20a7f96cf54f3 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Sat, 2 Dec 2023 19:43:59 +0100 Subject: [PATCH 01/11] DOC: Start migration guide for Copy-on-Write --- doc/source/user_guide/copy_on_write.rst | 74 ++++++++++++++++++++++++- 1 file changed, 73 insertions(+), 1 deletion(-) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index fb0da70a0ea07..07def5ab7538a 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -16,7 +16,7 @@ Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 m optimizations that become possible through CoW are implemented and supported. All possible optimizations are supported starting from pandas 2.1. -We expect that CoW will be enabled by default in version 3.0. +CoW will be enabled by default in version 3.0. CoW will lead to more predictable behavior since it is not possible to update more than one object with one statement, e.g. indexing operations or methods won't have side-effects. Additionally, through @@ -52,6 +52,78 @@ it explicitly disallows this. With CoW enabled, ``df`` is unchanged: The following sections will explain what this means and how it impacts existing applications. +Migrating to Copy-on-Write +-------------------------- + +Copy-on-Write will be the default and only mode in pandas 3.0. This means that users +need to migrate their code to be compliant with CoW rules. + +The default mode in pandas will raises warnings for certain cases that will actively +change behavior and thus change user intended behavior. + +We added another mode, e.g. + +``` +pd.options.mode.copy_on_write = "warn" +``` + +that will warn for every operation that will change behavior with CoW. We expect this mode +to be very noisy, since many cases that we don't expect that they will influence users will +also emit a warning. We recommend checking this mode and analyzing the warnings, but it is +not necessary to address all of these warning. The first two items of the following lists +are the only cases that need to be addressed to make existing code work with CoW. + +The following few items describe the user visible changes: + +**ChainedAssignment will never work** + +``loc`` should be used as an alternative. Check the +:ref:`chained assignment section ` for more details. + +**Accessing the underlying array of a pandas object will return a read-only view** + + +.. ipython:: python + + ser = pd.Series([1, 2, 3]) + ser.to_numpy() + +This example returns a NumPy array that is a view of the Series object. This view can +be modified and thus also modify the pandas object. This is not compliant with CoW +rules. The returned array is set to non-writeable to protect against this behavior. +Creating a copy of this array allows modification. You can also make the array +writeable again if you don't care about the pandas object anymore. + +**Only one pandas object is updated at once** + +The following code snippet updates both ``df`` and ``view`` without CoW: + +.. ipython:: python + + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) + subset = df["foo"] + subset.iloc[0] = 100 + dfr + +This won't be possible anymore with CoW, since the CoW rules explicitly forbid this. +This statement can be rewritten into a single statement with ``loc`` or ``iloc`` if +this behavior is necessary. :meth:`DataFrame.where` is another suitable alternative +for this case. + +**Constructor now copy NumPy arrays by default** + +The Series and DataFrame constructors will now copy NumPy array by default when not +otherwise specified. This was changed to avoid mutating a pandas object when the +NumPy array is changed inplace outside of pandas. You can set ``copy=False`` to +avoid this copy. + +.. ipython:: python + + arr = np.array([1, 2, 3]) + ser = pd.Series(arr, copy=False) + ser + + Description ----------- From bbd682416fe5c3163e34d1f9a385dc3e784d92da Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Sat, 2 Dec 2023 20:11:05 +0100 Subject: [PATCH 02/11] Adjust --- doc/source/user_guide/copy_on_write.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index 07def5ab7538a..30e42b7fb6120 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -119,6 +119,8 @@ avoid this copy. .. ipython:: python + import numpy as np + arr = np.array([1, 2, 3]) ser = pd.Series(arr, copy=False) ser From b91d676e75755a1576a072bbf9cef3226603c3e6 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Sun, 3 Dec 2023 00:46:00 +0100 Subject: [PATCH 03/11] Update copy_on_write.rst --- doc/source/user_guide/copy_on_write.rst | 14 +++++--------- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index 30e42b7fb6120..d686cccd0c046 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -110,21 +110,15 @@ This statement can be rewritten into a single statement with ``loc`` or ``iloc`` this behavior is necessary. :meth:`DataFrame.where` is another suitable alternative for this case. -**Constructor now copy NumPy arrays by default** +**Constructors now copy NumPy arrays by default** The Series and DataFrame constructors will now copy NumPy array by default when not otherwise specified. This was changed to avoid mutating a pandas object when the NumPy array is changed inplace outside of pandas. You can set ``copy=False`` to avoid this copy. -.. ipython:: python - - import numpy as np - - arr = np.array([1, 2, 3]) - ser = pd.Series(arr, copy=False) - ser - +See the section about :ref:`read-only NumPy arrays ` +for more details. Description ----------- @@ -237,6 +231,8 @@ With copy on write this can be done by using ``loc``. df.loc[df["bar"] > 5, "foo"] = 100 +.. _copy_on_write_read_only_na: + Read-only NumPy arrays ---------------------- From b47bfd2b65ff9a30159ed4738283451aa671d3dd Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Fri, 8 Dec 2023 23:14:27 +0100 Subject: [PATCH 04/11] Update doc/source/user_guide/copy_on_write.rst Co-authored-by: Joris Van den Bossche --- doc/source/user_guide/copy_on_write.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index d686cccd0c046..cdf9706c42e07 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -58,7 +58,7 @@ Migrating to Copy-on-Write Copy-on-Write will be the default and only mode in pandas 3.0. This means that users need to migrate their code to be compliant with CoW rules. -The default mode in pandas will raises warnings for certain cases that will actively +The default mode in pandas will raise warnings for certain cases that will actively change behavior and thus change user intended behavior. We added another mode, e.g. From 8d2ea1e0b2b311f5c6116b118b4ec9209a2048e8 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Fri, 8 Dec 2023 23:14:32 +0100 Subject: [PATCH 05/11] Update doc/source/user_guide/copy_on_write.rst Co-authored-by: Joris Van den Bossche --- doc/source/user_guide/copy_on_write.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index cdf9706c42e07..d2ad5681fca77 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -63,9 +63,9 @@ change behavior and thus change user intended behavior. We added another mode, e.g. -``` -pd.options.mode.copy_on_write = "warn" -``` +.. code-block:: python + + pd.options.mode.copy_on_write = "warn" that will warn for every operation that will change behavior with CoW. We expect this mode to be very noisy, since many cases that we don't expect that they will influence users will From 10ea0eb2ecf090b4bbf78356774dcac37324958d Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Fri, 8 Dec 2023 23:14:38 +0100 Subject: [PATCH 06/11] Update doc/source/user_guide/copy_on_write.rst Co-authored-by: Joris Van den Bossche --- doc/source/user_guide/copy_on_write.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index d2ad5681fca77..47166e08f2ec9 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -75,7 +75,7 @@ are the only cases that need to be addressed to make existing code work with CoW The following few items describe the user visible changes: -**ChainedAssignment will never work** +**Chained assignment will never work** ``loc`` should be used as an alternative. Check the :ref:`chained assignment section ` for more details. From 5ef8b68ba7bd9fa3c5a5f1d8bd02206a286bdfac Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Fri, 8 Dec 2023 23:14:44 +0100 Subject: [PATCH 07/11] Update doc/source/user_guide/copy_on_write.rst Co-authored-by: Joris Van den Bossche --- doc/source/user_guide/copy_on_write.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index 47166e08f2ec9..2c9216c2afffc 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -103,7 +103,7 @@ The following code snippet updates both ``df`` and ``view`` without CoW: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) subset = df["foo"] subset.iloc[0] = 100 - dfr + df This won't be possible anymore with CoW, since the CoW rules explicitly forbid this. This statement can be rewritten into a single statement with ``loc`` or ``iloc`` if From ae8ec25c3bb07cdfb2cdb5d48d7edfad26985b0b Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Fri, 8 Dec 2023 23:14:51 +0100 Subject: [PATCH 08/11] Update doc/source/user_guide/copy_on_write.rst Co-authored-by: Joris Van den Bossche --- doc/source/user_guide/copy_on_write.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index 2c9216c2afffc..6400acbe1da31 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -96,7 +96,7 @@ writeable again if you don't care about the pandas object anymore. **Only one pandas object is updated at once** -The following code snippet updates both ``df`` and ``view`` without CoW: +The following code snippet updates both ``df`` and ``subset`` without CoW: .. ipython:: python From 8581b1144927409c1a11327a20d99579a42bf25a Mon Sep 17 00:00:00 2001 From: Patrick Hoefler Date: Fri, 8 Dec 2023 23:43:08 +0100 Subject: [PATCH 09/11] Update --- doc/source/user_guide/copy_on_write.rst | 34 ++++++++++++++++++++++--- 1 file changed, 31 insertions(+), 3 deletions(-) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index 6400acbe1da31..8c7b427e9d687 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -94,6 +94,9 @@ rules. The returned array is set to non-writeable to protect against this behavi Creating a copy of this array allows modification. You can also make the array writeable again if you don't care about the pandas object anymore. +See the section about :ref:`read-only NumPy arrays ` +for more details. + **Only one pandas object is updated at once** The following code snippet updates both ``df`` and ``subset`` without CoW: @@ -106,10 +109,38 @@ The following code snippet updates both ``df`` and ``subset`` without CoW: df This won't be possible anymore with CoW, since the CoW rules explicitly forbid this. +This includes updating a single column as a :class:`Series` and relying on the change +propagating back to the parent :class:`DataFrame`. This statement can be rewritten into a single statement with ``loc`` or ``iloc`` if this behavior is necessary. :meth:`DataFrame.where` is another suitable alternative for this case. +Updating a column selected from a :class:`DataFrame` with an inplace method will +also not work anymore. + +.. ipython:: python + + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) + df["foo"].replace(1, 5, inplace=True) + df + +This is another form of chained assignment. This can generally be rewritten in 2 +different forms: + +.. ipython:: python + + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) + df.replace({"foo": 1}, {"foo": 5}, inplace=True) + df + +A different alternative would be to not use ``inplace``: + +.. ipython:: python + + df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) + df["foo"] = df["foo"].replace(1, 5) + df + **Constructors now copy NumPy arrays by default** The Series and DataFrame constructors will now copy NumPy array by default when not @@ -117,9 +148,6 @@ otherwise specified. This was changed to avoid mutating a pandas object when the NumPy array is changed inplace outside of pandas. You can set ``copy=False`` to avoid this copy. -See the section about :ref:`read-only NumPy arrays ` -for more details. - Description ----------- From ab8c2b9ffeb06b7c6a03c49175ea718ec833aeea Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Sun, 10 Dec 2023 22:15:22 +0100 Subject: [PATCH 10/11] Update copy_on_write.rst --- doc/source/user_guide/copy_on_write.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index 8c7b427e9d687..839e3b2a21bb9 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -119,6 +119,7 @@ Updating a column selected from a :class:`DataFrame` with an inplace method will also not work anymore. .. ipython:: python + :okwarning: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) df["foo"].replace(1, 5, inplace=True) From 4d391709e6e28ebd69a30ffb4e3b48394942b364 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Thu, 14 Dec 2023 22:16:06 +0100 Subject: [PATCH 11/11] Update doc/source/user_guide/copy_on_write.rst Co-authored-by: Joris Van den Bossche --- doc/source/user_guide/copy_on_write.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index 839e3b2a21bb9..bc233f4323e2a 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -131,7 +131,7 @@ different forms: .. ipython:: python df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) - df.replace({"foo": 1}, {"foo": 5}, inplace=True) + df.replace({"foo": {1: 5}}, inplace=True) df A different alternative would be to not use ``inplace``: