-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: Add user guide section about copy on write #51454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -1,4 +1,4 @@ | |||
.. _copy_on_write: | |||
.. _copy_on_write_dev: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO I think the information on this page is value enough to be in the user guide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm not sure, I think this should be considered an implementation detail? Especially since we have to change aspects of it to better accommodate for block splitting logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could these two guides be cross linked at least?
(I just generally have a preference for consolidating documentation.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a link from development to user guide. As long as we aren't done I'd like to keep casual readers away from it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with Patrick that this are implementation details and users don't have to care about, but explain the inner details of how it works that is useful for people working on the pandas code base.
So I would also keep this separate, but of course can always put a link from the user to here for people that want to know more.
|
||
Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 most of the | ||
optimizations that become possible through CoW are implemented and supported. A complete list | ||
can be found at TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guessing TODO needs replacing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh thx
optimizations that become possible through CoW are implemented and supported. A complete list | ||
can be found at TODO | ||
|
||
We expect that CoW will be enabled per default in version 3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We expect that CoW will be enabled per default in version 3.0 | |
We expect that CoW will be enabled by default in version 3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also in this section would be good to highlight the quick benefits of CoW
- More predictable behavior
- Improved performance through deferred copying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
|
||
CoW means that any DataFrame or Series derived from another in any way always | ||
behaves as a copy. As a consequence, we can only change the values of an object | ||
through modifying the object itself. CoW disallows updating a DataFrame or a Series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be valuable to show a before/after code example of CoW where modifying a derived result modifies the parent/doesn't modify the parent respectively
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make a specific section for this. Since reset_index copies right now, this does not fit in here very well.
Co-authored-by: Matthew Roeschke <[email protected]>
Co-authored-by: Matthew Roeschke <[email protected]>
Thx for reviewing. We will have do explain this more extensively in follow-ups I guess, but I wanted to move the optimization out of the whatsnew before the rc is due |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for starting this!
|
||
CoW means that any DataFrame or Series derived from another in any way always | ||
behaves as a copy. As a consequence, we can only change the values of an object | ||
through modifying the object itself. CoW disallows updating a DataFrame or a Series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "disallows" could be a bit confusing, as strictly speaking we don't "disallow" it in a sense of raising an error to the user if one would try to do that, but CoW "avoids"(?) updating inplace by triggering a copy under the hood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, ok this is somewhat explained in the next sentence I see now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think we need something stronger than avoids here
This avoids side-effects when modifying values and hence, most methods can avoid | ||
actually copying the data and only trigger a copy when necessary. | ||
|
||
The following example will operate inplace with CoW: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily for this PR, but general comment: I think it might be easier to first show an example of the general idea of "not updating other dataframes" (i.e. the no-side-effects part), like you have a dataframe, select a subset, modify that subset -> with CoW that doesn't update the parent.
I think that is something more common (and more essential to understand by out users), and easier to understand than the concept of whether the operation actually happened in place or not (depending on whether there are references). I would consider this as a more advanced part of the explanation, as you already need to understand the distinction between updating the object inplace (which still happens in either case) and updating the underlying values inplace.
(And this distinction is currently also not really explained, eg the "will operate inplace with CoW" in the sentence above is actually ambiguous, as this operation is always "inplace" for the object)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point. Should do as a follow up?
df["foo"][df["bar"] > 5] = 100 | ||
|
||
With copy on write this can either be done by using ``loc`` or doing this | ||
in multiple steps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"doing this in multiple steps" actually will also never work?
At least if you mean doing the above literally in multiple steps like:
sub = df["foo"]
sub[df["bar"] > 5] = 100
(which is how I would interpret it)
Maybe showing the actual code for loc (df.loc[df["bar"] > 5, "foo"] = 100
) could also help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add the loc part.
But this would work in the sense of CoW? Updating sub but not df?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I get what you are referring to. I removed the comment.
with pd.option_context("mode.copy_on_write", True): | ||
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we generally track references always (and so enabling this after data has been created should generally work), but I think there are a few exceptions? (eg the mgr.add_references
in #51430, or the constructors, ..) So I would either not mention this here, or if we keep it add a warning that for reliable results all data should be created within the context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was already on the fence about including it, lets leave it out then
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comments have been addressed
merging, will add another part about the side-effect things in a follow up |
… copy on write) (#51719) Backport PR #51454: DOC: Add user guide section about copy on write Co-authored-by: Patrick Hoefler <[email protected]>
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.