-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
CoW: Push reference tracking down to the block level #51144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
d968cb3
6a41cdb
b4779d2
c7fe560
47269ce
6034053
4d4f856
7101322
189c7ea
b796869
4712c0b
f3f425c
1c65bd8
896b6a8
1122d0c
cc1ad65
8c656a8
f44d9f2
9694d33
a6fbcb9
fc71add
a45a70a
f0c3b32
b05d3aa
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
.. _copy_on_write: | ||
|
||
{{ header }} | ||
|
||
************* | ||
Copy on write | ||
************* | ||
|
||
Copy on Write is a mechanism to simplify the indexing API and improve | ||
performance through avoiding copies if possible. | ||
CoW means that any DataFrame or Series derived from another in any way always | ||
behaves as a copy. | ||
|
||
Reference tracking | ||
------------------ | ||
|
||
To be able to determine, if we have to make a copy when writing into a DataFrame, | ||
we have to be aware, if the values are shared with another DataFrame. pandas | ||
keeps track of all ``Blocks`` that share values with another block internally to | ||
be able to tell when a copy needs to be triggered. The reference tracking | ||
mechanism is implemented on the Block level. | ||
|
||
We use a custom reference tracker object, ``BlockValuesRefs``, that keeps | ||
track of every block, whose values share memory with each other. The reference | ||
is held through a weak-reference. Every two blocks that share some memory should | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. two block -> pair of blocks There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thx for the comments. opened #51552 |
||
point to the same ``BlockValuesRefs`` object. If one block goes out of | ||
scope, the reference to this block dies. As a consequence, the reference tracker | ||
object always knows how many blocks are alive and share memory. | ||
|
||
Whenever a :class:`DataFrame` or :class:`Series` object is sharing data with another | ||
object, it is required that each of those objects have its own BlockManager and Block | ||
objects. Thus, in other words, one Block instance (that is held by a DataFrame, not | ||
necessarily for intermediate objects) should always be uniquely used for only | ||
a single DataFrame/Series object. For example, when you want to use the same | ||
Block for another object, you can create a shallow copy of the Block instance | ||
with ``block.copy(deep=False)`` (which will create a new Block instance with | ||
the same underlying values and which will correctly set up the references). | ||
|
||
We can ask the reference tracking object if there is another block alive that shares | ||
data with us before writing into the values. We can trigger a copy before | ||
writing if there is in fact another block alive. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,6 +4,7 @@ from typing import ( | |
final, | ||
overload, | ||
) | ||
import weakref | ||
|
||
import numpy as np | ||
|
||
|
@@ -59,8 +60,13 @@ class SharedBlock: | |
_mgr_locs: BlockPlacement | ||
ndim: int | ||
values: ArrayLike | ||
refs: BlockValuesRefs | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Minor naming nit: if we want to simplify this (so it "speaks" better), could also be "BlockRefs", since referencing the block vs the values is kind of equivalent? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, that's maybe a good question: is it equivalent? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Theoretically this is possible (having two references to the same block). I'd prefer using shallow copies though, the way it is implemented tracking the new refs works automatically when you create a shallow copy, while you would have to explicitly update the references with your own block again. I think this would look a bit hacky There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My thinking was, we actually want to reference the values, e.g. same values living in different blocks, hence this name. But not too happy with it either, so not opposed to renaming. BlockRefs did not capture everything to me, that's why I ended up with this name There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, agreed that we certainly intent to always use shallow copies (new blocks)
Makes sense. Then another alternative is to leave out the Block and Values altogether, and do something with "Refs" (like the RefTracker). But that is very much name bikeshedding at this point and not too important ;) |
||
def __init__( | ||
self, values: ArrayLike, placement: BlockPlacement, ndim: int | ||
self, | ||
values: ArrayLike, | ||
placement: BlockPlacement, | ||
ndim: int, | ||
refs: BlockValuesRefs | None = ..., | ||
) -> None: ... | ||
|
||
class NumpyBlock(SharedBlock): | ||
|
@@ -87,3 +93,9 @@ class BlockManager: | |
) -> None: ... | ||
def get_slice(self: T, slobj: slice, axis: int = ...) -> T: ... | ||
def _rebuild_blknos_and_blklocs(self) -> None: ... | ||
|
||
class BlockValuesRefs: | ||
referenced_blocks: list[weakref.ref] | ||
def __init__(self, blk: SharedBlock) -> None: ... | ||
def add_reference(self, blk: SharedBlock) -> None: ... | ||
def has_reference(self) -> bool: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comma after "aware" is unnecessary. also after "determine" on L17