Skip to content

DOC: Add user guide section about copy on write #51454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 1, 2023
Merged

Conversation

phofl
Copy link
Member

@phofl phofl commented Feb 17, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@@ -1,4 +1,4 @@
.. _copy_on_write:
.. _copy_on_write_dev:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO I think the information on this page is value enough to be in the user guide

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm not sure, I think this should be considered an implementation detail? Especially since we have to change aspects of it to better accommodate for block splitting logic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these two guides be cross linked at least?

(I just generally have a preference for consolidating documentation.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a link from development to user guide. As long as we aren't done I'd like to keep casual readers away from it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with Patrick that this are implementation details and users don't have to care about, but explain the inner details of how it works that is useful for people working on the pandas code base.

So I would also keep this separate, but of course can always put a link from the user to here for people that want to know more.


Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 most of the
optimizations that become possible through CoW are implemented and supported. A complete list
can be found at TODO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guessing TODO needs replacing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thx

optimizations that become possible through CoW are implemented and supported. A complete list
can be found at TODO

We expect that CoW will be enabled per default in version 3.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We expect that CoW will be enabled per default in version 3.0
We expect that CoW will be enabled by default in version 3.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in this section would be good to highlight the quick benefits of CoW

  1. More predictable behavior
  2. Improved performance through deferred copying

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added


CoW means that any DataFrame or Series derived from another in any way always
behaves as a copy. As a consequence, we can only change the values of an object
through modifying the object itself. CoW disallows updating a DataFrame or a Series
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be valuable to show a before/after code example of CoW where modifying a derived result modifies the parent/doesn't modify the parent respectively

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make a specific section for this. Since reset_index copies right now, this does not fit in here very well.

@phofl
Copy link
Member Author

phofl commented Feb 17, 2023

Thx for reviewing. We will have do explain this more extensively in follow-ups I guess, but I wanted to move the optimization out of the whatsnew before the rc is due

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting this!


CoW means that any DataFrame or Series derived from another in any way always
behaves as a copy. As a consequence, we can only change the values of an object
through modifying the object itself. CoW disallows updating a DataFrame or a Series
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "disallows" could be a bit confusing, as strictly speaking we don't "disallow" it in a sense of raising an error to the user if one would try to do that, but CoW "avoids"(?) updating inplace by triggering a copy under the hood.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, ok this is somewhat explained in the next sentence I see now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think we need something stronger than avoids here

This avoids side-effects when modifying values and hence, most methods can avoid
actually copying the data and only trigger a copy when necessary.

The following example will operate inplace with CoW:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily for this PR, but general comment: I think it might be easier to first show an example of the general idea of "not updating other dataframes" (i.e. the no-side-effects part), like you have a dataframe, select a subset, modify that subset -> with CoW that doesn't update the parent.

I think that is something more common (and more essential to understand by out users), and easier to understand than the concept of whether the operation actually happened in place or not (depending on whether there are references). I would consider this as a more advanced part of the explanation, as you already need to understand the distinction between updating the object inplace (which still happens in either case) and updating the underlying values inplace.

(And this distinction is currently also not really explained, eg the "will operate inplace with CoW" in the sentence above is actually ambiguous, as this operation is always "inplace" for the object)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point. Should do as a follow up?

df["foo"][df["bar"] > 5] = 100

With copy on write this can either be done by using ``loc`` or doing this
in multiple steps.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"doing this in multiple steps" actually will also never work?
At least if you mean doing the above literally in multiple steps like:

sub = df["foo"]
sub[df["bar"] > 5] = 100

(which is how I would interpret it)

Maybe showing the actual code for loc (df.loc[df["bar"] > 5, "foo"] = 100) could also help.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add the loc part.

But this would work in the sense of CoW? Updating sub but not df?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I get what you are referring to. I removed the comment.

Comment on lines 207 to 208
with pd.option_context("mode.copy_on_write", True):
...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we generally track references always (and so enabling this after data has been created should generally work), but I think there are a few exceptions? (eg the mgr.add_references in #51430, or the constructors, ..) So I would either not mention this here, or if we keep it add a warning that for reliable results all data should be created within the context.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was already on the fence about including it, lets leave it out then

@phofl phofl added this to the 2.0 milestone Feb 20, 2023
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments have been addressed

@phofl
Copy link
Member Author

phofl commented Mar 1, 2023

merging, will add another part about the side-effect things in a follow up

@phofl phofl merged commit d89f162 into pandas-dev:main Mar 1, 2023
@phofl phofl deleted the cow_docs branch March 1, 2023 16:23
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Mar 1, 2023
phofl added a commit that referenced this pull request Mar 2, 2023
… copy on write) (#51719)

Backport PR #51454: DOC: Add user guide section about copy on write

Co-authored-by: Patrick Hoefler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants