DOC: Add user guide section about copy on write #51454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

phofl merged 10 commits into pandas-dev:main from phofl:cow_docs

Mar 1, 2023

Member

phofl commented Feb 17, 2023

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.


          DOC: Add user guide section about copy on write

cce223f

phofl requested review from jorisvandenbossche and lithomas1

February 17, 2023 15:49


          Adjust label

fb1fcab

mroeschke reviewed

View reviewed changes

doc/source/development/copy_on_write.rst

		@@ -1,4 +1,4 @@
		.. _copy_on_write:
		.. _copy_on_write_dev:

Member

mroeschke Feb 17, 2023

IMO I think the information on this page is value enough to be in the user guide

Member Author

phofl Feb 17, 2023

Hm not sure, I think this should be considered an implementation detail? Especially since we have to change aspects of it to better accommodate for block splitting logic

Member

mroeschke Feb 17, 2023

Could these two guides be cross linked at least?

(I just generally have a preference for consolidating documentation.)

Member Author

phofl Feb 17, 2023

I added a link from development to user guide. As long as we aren't done I'd like to keep casual readers away from it

Member

jorisvandenbossche Feb 17, 2023

Agreed with Patrick that this are implementation details and users don't have to care about, but explain the inner details of how it works that is useful for people working on the pandas code base.

So I would also keep this separate, but of course can always put a link from the user to here for people that want to know more.

mroeschke reviewed

View reviewed changes

doc/source/user_guide/copy_on_write.rst Outdated

+              Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 most of the
+              optimizations that become possible through CoW are implemented and supported. A complete list
+              can be found at TODO

Member

mroeschke Feb 17, 2023

Guessing TODO needs replacing

Member Author

phofl Feb 17, 2023

Oh thx

mroeschke reviewed

View reviewed changes

doc/source/user_guide/copy_on_write.rst Outdated

+              optimizations that become possible through CoW are implemented and supported. A complete list
+              can be found at TODO
+              We expect that CoW will be enabled per default in version 3.0

Member

mroeschke Feb 17, 2023

Suggested change

      
            We expect that CoW will be enabled per default in version 3.0
          
            We expect that CoW will be enabled by default in version 3.0

Member

mroeschke Feb 17, 2023

Also in this section would be good to highlight the quick benefits of CoW

More predictable behavior
Improved performance through deferred copying

Member Author

phofl Feb 17, 2023

Added


          Replace todo

5ff4f92

mroeschke reviewed

View reviewed changes

doc/source/user_guide/copy_on_write.rst

+              CoW means that any DataFrame or Series derived from another in any way always
+              behaves as a copy. As a consequence, we can only change the values of an object
+              through modifying the object itself. CoW disallows updating a DataFrame or a Series

Member

mroeschke Feb 17, 2023

I think it would be valuable to show a before/after code example of CoW where modifying a derived result modifies the parent/doesn't modify the parent respectively

Member Author

phofl Feb 17, 2023

I'll make a specific section for this. Since reset_index copies right now, this does not fit in here very well.

mroeschke reviewed

View reviewed changes

doc/source/user_guide/copy_on_write.rst Outdated Show resolved Hide resolved

mroeschke reviewed

View reviewed changes

doc/source/user_guide/copy_on_write.rst Show resolved Hide resolved

mroeschke added Docs Copy / view semantics labels

phofl and others added 3 commits

February 17, 2023 19:59


          Update doc/source/user_guide/copy_on_write.rst

b8d3906

Co-authored-by: Matthew Roeschke <[email protected]>


          Update doc/source/user_guide/copy_on_write.rst

b9e8632

Co-authored-by: Matthew Roeschke <[email protected]>


          Update

8fee2a9

Member Author

phofl commented Feb 17, 2023

Thx for reviewing. We will have do explain this more extensively in follow-ups I guess, but I wanted to move the optimization out of the whatsnew before the rc is due

jorisvandenbossche reviewed

View reviewed changes

Member

jorisvandenbossche left a comment

Thanks for starting this!

doc/source/user_guide/copy_on_write.rst Outdated Show resolved Hide resolved

doc/source/user_guide/copy_on_write.rst Outdated Show resolved Hide resolved

doc/source/user_guide/copy_on_write.rst

+              CoW means that any DataFrame or Series derived from another in any way always
+              behaves as a copy. As a consequence, we can only change the values of an object
+              through modifying the object itself. CoW disallows updating a DataFrame or a Series

Member

jorisvandenbossche Feb 20, 2023

I think "disallows" could be a bit confusing, as strictly speaking we don't "disallow" it in a sense of raising an error to the user if one would try to do that, but CoW "avoids"(?) updating inplace by triggering a copy under the hood.

Member

jorisvandenbossche Feb 20, 2023

Hmm, ok this is somewhat explained in the next sentence I see now.

Member Author

phofl Feb 20, 2023

Yeah I think we need something stronger than avoids here

doc/source/user_guide/copy_on_write.rst

+              This avoids side-effects when modifying values and hence, most methods can avoid
+              actually copying the data and only trigger a copy when necessary.
+              The following example will operate inplace with CoW:

Member

jorisvandenbossche Feb 20, 2023

Not necessarily for this PR, but general comment: I think it might be easier to first show an example of the general idea of "not updating other dataframes" (i.e. the no-side-effects part), like you have a dataframe, select a subset, modify that subset -> with CoW that doesn't update the parent.

I think that is something more common (and more essential to understand by out users), and easier to understand than the concept of whether the operation actually happened in place or not (depending on whether there are references). I would consider this as a more advanced part of the explanation, as you already need to understand the distinction between updating the object inplace (which still happens in either case) and updating the underlying values inplace.

(And this distinction is currently also not really explained, eg the "will operate inplace with CoW" in the sentence above is actually ambiguous, as this operation is always "inplace" for the object)

Member Author

phofl Feb 20, 2023

Yeah good point. Should do as a follow up?

doc/source/user_guide/copy_on_write.rst Outdated

+                  df["foo"][df["bar"] > 5] = 100
+              With copy on write this can either be done by using ``loc`` or doing this
+              in multiple steps.

Member

jorisvandenbossche Feb 20, 2023

"doing this in multiple steps" actually will also never work?
At least if you mean doing the above literally in multiple steps like:

sub = df["foo"]
sub[df["bar"] > 5] = 100

(which is how I would interpret it)

Maybe showing the actual code for loc (df.loc[df["bar"] > 5, "foo"] = 100) could also help.

Member Author

phofl Feb 20, 2023

I'll add the loc part.

But this would work in the sense of CoW? Updating sub but not df?

Member Author

phofl Feb 20, 2023

Ah I get what you are referring to. I removed the comment.

doc/source/user_guide/copy_on_write.rst Outdated

Comment on lines 207 to 208

		with pd.option_context("mode.copy_on_write", True):
		...

Member

jorisvandenbossche Feb 20, 2023

So we generally track references always (and so enabling this after data has been created should generally work), but I think there are a few exceptions? (eg the mgr.add_references in #51430, or the constructors, ..) So I would either not mention this here, or if we keep it add a warning that for reliable results all data should be created within the context.

Member Author

phofl Feb 20, 2023

I was already on the fence about including it, lets leave it out then

phofl and others added 4 commits

February 20, 2023 11:28


          Update doc/source/user_guide/copy_on_write.rst

8697ab8

Co-authored-by: Joris Van den Bossche <[email protected]>


          Update doc/source/user_guide/copy_on_write.rst

3a38890

Co-authored-by: Joris Van den Bossche <[email protected]>


          Remove temporary enabling

019b2d8


          Add loc

ac0f959

phofl added this to the 2.0 milestone

mroeschke approved these changes

View reviewed changes

Member

mroeschke left a comment

My comments have been addressed

Member Author

phofl commented Mar 1, 2023

merging, will add another part about the side-effect things in a follow up

phofl merged commit d89f162 into pandas-dev:main

phofl deleted the cow_docs branch

March 1, 2023 16:23

meeseeksmachine mentioned this pull request

Backport PR #51454 on branch 2.0.x (DOC: Add user guide section about copy on write) #51719

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request


          Backport PR pandas-dev#51454: DOC: Add user guide section about copy …

ed4dfd2

…on write

phofl added a commit that referenced this pull request


          Backport PR #51454 on branch 2.0.x (DOC: Add user guide section about…

984390d

… copy on write) (#51719)

Backport PR #51454: DOC: Add user guide section about copy on write

Co-authored-by: Patrick Hoefler <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Copy / view semantics Docs