Skip to content

DOC: Start migration guide for Copy-on-Write #56298

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Dec 14, 2023
72 changes: 71 additions & 1 deletion doc/source/user_guide/copy_on_write.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 m
optimizations that become possible through CoW are implemented and supported. All possible
optimizations are supported starting from pandas 2.1.

We expect that CoW will be enabled by default in version 3.0.
CoW will be enabled by default in version 3.0.

CoW will lead to more predictable behavior since it is not possible to update more than
one object with one statement, e.g. indexing operations or methods won't have side-effects. Additionally, through
Expand Down Expand Up @@ -52,6 +52,74 @@ it explicitly disallows this. With CoW enabled, ``df`` is unchanged:
The following sections will explain what this means and how it impacts existing
applications.

Migrating to Copy-on-Write
--------------------------
Comment on lines 52 to +56
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should move the migration guide further in the file, as right now this section assumes somewhat knowledge about what CoW is, but that is only explained "Description" section that now comes afterwards

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had similar thoughts, but I wanted to put it as prominent as possible


Copy-on-Write will be the default and only mode in pandas 3.0. This means that users
need to migrate their code to be compliant with CoW rules.

The default mode in pandas will raises warnings for certain cases that will actively
change behavior and thus change user intended behavior.

We added another mode, e.g.

```
pd.options.mode.copy_on_write = "warn"
```

that will warn for every operation that will change behavior with CoW. We expect this mode
to be very noisy, since many cases that we don't expect that they will influence users will
also emit a warning. We recommend checking this mode and analyzing the warnings, but it is
not necessary to address all of these warning. The first two items of the following lists
are the only cases that need to be addressed to make existing code work with CoW.

The following few items describe the user visible changes:

**ChainedAssignment will never work**

``loc`` should be used as an alternative. Check the
:ref:`chained assignment section <copy_on_write_chained_assignment>` for more details.

**Accessing the underlying array of a pandas object will return a read-only view**


.. ipython:: python

ser = pd.Series([1, 2, 3])
ser.to_numpy()

This example returns a NumPy array that is a view of the Series object. This view can
be modified and thus also modify the pandas object. This is not compliant with CoW
rules. The returned array is set to non-writeable to protect against this behavior.
Creating a copy of this array allows modification. You can also make the array
writeable again if you don't care about the pandas object anymore.

**Only one pandas object is updated at once**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code example you use is already a practical example of it, but I wonder if we should explicitly call out "modifying a column as a Series no longer works"? As I think this will be one of the main use cases where right now the user intended the propagation of the modification (while with dataframe row slices, I think that is much less the case)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep added


The following code snippet updates both ``df`` and ``view`` without CoW:

.. ipython:: python

df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
subset = df["foo"]
subset.iloc[0] = 100
dfr

This won't be possible anymore with CoW, since the CoW rules explicitly forbid this.
This statement can be rewritten into a single statement with ``loc`` or ``iloc`` if
this behavior is necessary. :meth:`DataFrame.where` is another suitable alternative
for this case.

**Constructors now copy NumPy arrays by default**

The Series and DataFrame constructors will now copy NumPy array by default when not
otherwise specified. This was changed to avoid mutating a pandas object when the
NumPy array is changed inplace outside of pandas. You can set ``copy=False`` to
avoid this copy.

See the section about :ref:`read-only NumPy arrays <copy_on_write_read_only_na>`
for more details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "see also" probably belongs in the section above "Accessing the underlying array of a pandas object will return a read-only view"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah good point


Description
-----------

Expand Down Expand Up @@ -163,6 +231,8 @@ With copy on write this can be done by using ``loc``.

df.loc[df["bar"] > 5, "foo"] = 100

.. _copy_on_write_read_only_na:

Read-only NumPy arrays
----------------------

Expand Down