-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: Add user guide section about copy on write #51454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
cce223f
fb1fcab
5ff4f92
b8d3906
b9e8632
8fee2a9
8697ab8
3a38890
019b2d8
ac0f959
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,209 @@ | ||
.. _copy_on_write: | ||
|
||
{{ header }} | ||
|
||
******************* | ||
Copy-on-Write (CoW) | ||
******************* | ||
|
||
.. ipython:: python | ||
:suppress: | ||
|
||
pd.options.mode.copy_on_write = True | ||
|
||
Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 most of the | ||
optimizations that become possible through CoW are implemented and supported. A complete list | ||
can be found at :ref:`Copy-on-Write optimizations <copy_on_write.optimizations>`. | ||
|
||
We expect that CoW will be enabled by default in version 3.0. | ||
|
||
CoW will lead to more predictable behavior since it is not possible to update more than | ||
one object with one statement, e.g. indexing operations or methods won't have side-effects. Additionally, through | ||
delaying copies as long as possible, the average performance and memory usage will improve. | ||
|
||
Description | ||
----------- | ||
|
||
CoW means that any DataFrame or Series derived from another in any way always | ||
behaves as a copy. As a consequence, we can only change the values of an object | ||
through modifying the object itself. CoW disallows updating a DataFrame or a Series | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would be valuable to show a before/after code example of CoW where modifying a derived result modifies the parent/doesn't modify the parent respectively There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll make a specific section for this. Since reset_index copies right now, this does not fit in here very well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think "disallows" could be a bit confusing, as strictly speaking we don't "disallow" it in a sense of raising an error to the user if one would try to do that, but CoW "avoids"(?) updating inplace by triggering a copy under the hood. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, ok this is somewhat explained in the next sentence I see now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I think we need something stronger than avoids here |
||
that shares data with another DataFrame or Series object inplace. | ||
|
||
This avoids side-effects when modifying values and hence, most methods can avoid | ||
actually copying the data and only trigger a copy when necessary. | ||
|
||
The following example will operate inplace with CoW: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not necessarily for this PR, but general comment: I think it might be easier to first show an example of the general idea of "not updating other dataframes" (i.e. the no-side-effects part), like you have a dataframe, select a subset, modify that subset -> with CoW that doesn't update the parent. I think that is something more common (and more essential to understand by out users), and easier to understand than the concept of whether the operation actually happened in place or not (depending on whether there are references). I would consider this as a more advanced part of the explanation, as you already need to understand the distinction between updating the object inplace (which still happens in either case) and updating the underlying values inplace. (And this distinction is currently also not really explained, eg the "will operate inplace with CoW" in the sentence above is actually ambiguous, as this operation is always "inplace" for the object) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah good point. Should do as a follow up? |
||
|
||
.. ipython:: python | ||
|
||
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) | ||
df.iloc[0, 0] = 100 | ||
df | ||
|
||
The object ``df`` does not share any data with any other object and hence no | ||
copy is triggered when updating the values. In contrast, the following operation | ||
triggers a copy of the data under CoW: | ||
|
||
|
||
.. ipython:: python | ||
|
||
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) | ||
df2 = df.reset_index(drop=True) | ||
df2.iloc[0, 0] = 100 | ||
|
||
df | ||
df2 | ||
|
||
``reset_index`` returns a lazy copy with CoW while it copies the data without CoW. | ||
Since both objects, ``df`` and ``df2`` share the same data, a copy is triggered | ||
when modifying ``df2``. The object ``df`` still has the same values as initially | ||
while ``df2`` was modified. | ||
|
||
If the object ``df`` isn't needed anymore after performing the ``reset_index`` operation, | ||
you can emulate an inplace-like operation through assigning the output of ``reset_index`` | ||
to the same variable: | ||
|
||
.. ipython:: python | ||
|
||
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) | ||
df = df.reset_index(drop=True) | ||
df.iloc[0, 0] = 100 | ||
df | ||
|
||
The initial object gets out of scope as soon as the result of ``reset_index`` is | ||
reassigned and hence ``df`` does not share data with any other object. No copy | ||
is necessary when modifying the object. This is generally true for all methods | ||
listed in :ref:`Copy-on-Write optimizations <copy_on_write.optimizations>`. | ||
|
||
Previously, when operating on views, the view and the parent object was modified: | ||
|
||
.. ipython:: python | ||
|
||
with pd.option_context("mode.copy_on_write", False): | ||
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) | ||
view = df[:] | ||
df.iloc[0, 0] = 100 | ||
|
||
df | ||
view | ||
|
||
CoW triggers a copy when ``df`` is changed to avoid mutating ``view`` as well: | ||
|
||
.. ipython:: python | ||
|
||
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) | ||
view = df[:] | ||
df.iloc[0, 0] = 100 | ||
|
||
df | ||
view | ||
|
||
Chained Assignment | ||
------------------ | ||
|
||
Chained assignment references a technique where an object is updated through | ||
two subsequent indexing operations, e.g. | ||
|
||
.. ipython:: python | ||
|
||
with pd.option_context("mode.copy_on_write", False): | ||
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) | ||
df["foo"][df["bar"] > 5] = 100 | ||
df | ||
|
||
The column ``foo`` is updated where the column ``bar`` is greater than 5. | ||
This violates the CoW principles though, because it would have to modify the | ||
view ``df["foo"]`` and ``df`` in one step. Hence, chained assignment will | ||
consistently never work and raise a ``ChainedAssignmentError`` with CoW enabled: | ||
|
||
.. ipython:: python | ||
:okexcept: | ||
|
||
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) | ||
df["foo"][df["bar"] > 5] = 100 | ||
mroeschke marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
With copy on write this can be done by using ``loc``. | ||
|
||
.. ipython:: python | ||
|
||
df.loc[df["bar"] > 5, "foo"] = 100 | ||
|
||
.. _copy_on_write.optimizations: | ||
|
||
Copy-on-Write optimizations | ||
--------------------------- | ||
|
||
A new lazy copy mechanism that defers the copy until the object in question is modified | ||
and only if this object shares data with another object. This mechanism was added to | ||
following methods: | ||
|
||
- :meth:`DataFrame.reset_index` / :meth:`Series.reset_index` | ||
- :meth:`DataFrame.set_index` | ||
- :meth:`DataFrame.set_axis` / :meth:`Series.set_axis` | ||
- :meth:`DataFrame.set_flags` / :meth:`Series.set_flags` | ||
- :meth:`DataFrame.rename_axis` / :meth:`Series.rename_axis` | ||
- :meth:`DataFrame.reindex` / :meth:`Series.reindex` | ||
- :meth:`DataFrame.reindex_like` / :meth:`Series.reindex_like` | ||
- :meth:`DataFrame.assign` | ||
- :meth:`DataFrame.drop` | ||
- :meth:`DataFrame.dropna` / :meth:`Series.dropna` | ||
- :meth:`DataFrame.select_dtypes` | ||
- :meth:`DataFrame.align` / :meth:`Series.align` | ||
- :meth:`Series.to_frame` | ||
- :meth:`DataFrame.rename` / :meth:`Series.rename` | ||
- :meth:`DataFrame.add_prefix` / :meth:`Series.add_prefix` | ||
- :meth:`DataFrame.add_suffix` / :meth:`Series.add_suffix` | ||
- :meth:`DataFrame.drop_duplicates` / :meth:`Series.drop_duplicates` | ||
- :meth:`DataFrame.droplevel` / :meth:`Series.droplevel` | ||
- :meth:`DataFrame.reorder_levels` / :meth:`Series.reorder_levels` | ||
- :meth:`DataFrame.between_time` / :meth:`Series.between_time` | ||
- :meth:`DataFrame.filter` / :meth:`Series.filter` | ||
- :meth:`DataFrame.head` / :meth:`Series.head` | ||
- :meth:`DataFrame.tail` / :meth:`Series.tail` | ||
- :meth:`DataFrame.isetitem` | ||
- :meth:`DataFrame.pipe` / :meth:`Series.pipe` | ||
- :meth:`DataFrame.pop` / :meth:`Series.pop` | ||
- :meth:`DataFrame.replace` / :meth:`Series.replace` | ||
- :meth:`DataFrame.shift` / :meth:`Series.shift` | ||
- :meth:`DataFrame.sort_index` / :meth:`Series.sort_index` | ||
- :meth:`DataFrame.sort_values` / :meth:`Series.sort_values` | ||
- :meth:`DataFrame.squeeze` / :meth:`Series.squeeze` | ||
- :meth:`DataFrame.swapaxes` | ||
- :meth:`DataFrame.swaplevel` / :meth:`Series.swaplevel` | ||
- :meth:`DataFrame.take` / :meth:`Series.take` | ||
- :meth:`DataFrame.to_timestamp` / :meth:`Series.to_timestamp` | ||
- :meth:`DataFrame.to_period` / :meth:`Series.to_period` | ||
- :meth:`DataFrame.truncate` | ||
- :meth:`DataFrame.iterrows` | ||
- :meth:`DataFrame.tz_convert` / :meth:`Series.tz_localize` | ||
- :meth:`DataFrame.fillna` / :meth:`Series.fillna` | ||
- :meth:`DataFrame.interpolate` / :meth:`Series.interpolate` | ||
- :meth:`DataFrame.ffill` / :meth:`Series.ffill` | ||
- :meth:`DataFrame.bfill` / :meth:`Series.bfill` | ||
- :meth:`DataFrame.where` / :meth:`Series.where` | ||
- :meth:`DataFrame.infer_objects` / :meth:`Series.infer_objects` | ||
- :meth:`DataFrame.astype` / :meth:`Series.astype` | ||
- :meth:`DataFrame.convert_dtypes` / :meth:`Series.convert_dtypes` | ||
- :meth:`DataFrame.join` | ||
- :func:`concat` | ||
- :func:`merge` | ||
|
||
These methods return views when Copy-on-Write is enabled, which provides a significant | ||
performance improvement compared to the regular execution. | ||
|
||
How to enable CoW | ||
----------------- | ||
|
||
Copy-on-Write can be enabled through the configuration option ``copy_on_write``. The option can | ||
be turned on __globally__ through either of the following: | ||
|
||
.. ipython:: python | ||
|
||
pd.set_option("mode.copy_on_write", True) | ||
|
||
pd.options.mode.copy_on_write = True | ||
|
||
.. ipython:: python | ||
:suppress: | ||
|
||
pd.options.mode.copy_on_write = False |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -67,6 +67,7 @@ Guides | |
pyarrow | ||
indexing | ||
advanced | ||
copy_on_write | ||
merging | ||
reshaping | ||
text | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO I think the information on this page is value enough to be in the user guide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm not sure, I think this should be considered an implementation detail? Especially since we have to change aspects of it to better accommodate for block splitting logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could these two guides be cross linked at least?
(I just generally have a preference for consolidating documentation.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a link from development to user guide. As long as we aren't done I'd like to keep casual readers away from it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with Patrick that this are implementation details and users don't have to care about, but explain the inner details of how it works that is useful for people working on the pandas code base.
So I would also keep this separate, but of course can always put a link from the user to here for people that want to know more.