-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
CoW: Push reference tracking down to the block level #51144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 20 commits
d968cb3
6a41cdb
b4779d2
c7fe560
47269ce
6034053
4d4f856
7101322
189c7ea
b796869
4712c0b
f3f425c
1c65bd8
896b6a8
1122d0c
cc1ad65
8c656a8
f44d9f2
9694d33
a6fbcb9
fc71add
a45a70a
f0c3b32
b05d3aa
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
.. _copy_on_write: | ||
|
||
{{ header }} | ||
|
||
************* | ||
Copy on write | ||
************* | ||
|
||
Copy on Write is a mechanism to simplify the indexing API and improve | ||
performance through avoiding copies if possible. | ||
CoW means that any DataFrame or Series derived from another in any way always | ||
behaves as a copy. | ||
|
||
Reference tracking | ||
------------------ | ||
|
||
To be able to determine, if we have to make a copy when writing into a DataFrame, | ||
we have to be aware, if the values are shared with another DataFrame. pandas | ||
keeps track of all ``Blocks`` that share values with another block internally to | ||
be able to tell when a copy needs to be triggered. The reference tracking | ||
mechanism is implemented on the Block level. | ||
|
||
We use a custom reference tracker object, ``BlockValuesRefs``, that keeps | ||
track of every block, whose values share memory with each other. The reference | ||
is held through a weak-reference. Every two blocks that share some memory should | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. two block -> pair of blocks There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thx for the comments. opened #51552 |
||
point to the same ``BlockValuesRefs`` object. If one block goes out of | ||
scope, the reference to this block dies. As a consequence, the reference tracker | ||
object always knows how many blocks are alive and share memory. | ||
|
||
We can ask the reference tracking object if there is another block alive that shares | ||
data with us before writing into the values. We can trigger a copy before | ||
writing if there is in fact another block alive. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,6 +4,7 @@ from typing import ( | |
final, | ||
overload, | ||
) | ||
import weakref | ||
|
||
import numpy as np | ||
|
||
|
@@ -59,8 +60,13 @@ class SharedBlock: | |
_mgr_locs: BlockPlacement | ||
ndim: int | ||
values: ArrayLike | ||
refs: BlockValuesRefs | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Minor naming nit: if we want to simplify this (so it "speaks" better), could also be "BlockRefs", since referencing the block vs the values is kind of equivalent? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, that's maybe a good question: is it equivalent? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Theoretically this is possible (having two references to the same block). I'd prefer using shallow copies though, the way it is implemented tracking the new refs works automatically when you create a shallow copy, while you would have to explicitly update the references with your own block again. I think this would look a bit hacky There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My thinking was, we actually want to reference the values, e.g. same values living in different blocks, hence this name. But not too happy with it either, so not opposed to renaming. BlockRefs did not capture everything to me, that's why I ended up with this name There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, agreed that we certainly intent to always use shallow copies (new blocks)
Makes sense. Then another alternative is to leave out the Block and Values altogether, and do something with "Refs" (like the RefTracker). But that is very much name bikeshedding at this point and not too important ;) |
||
def __init__( | ||
self, values: ArrayLike, placement: BlockPlacement, ndim: int | ||
self, | ||
values: ArrayLike, | ||
placement: BlockPlacement, | ||
ndim: int, | ||
refs: BlockValuesRefs | None = ..., | ||
) -> None: ... | ||
|
||
class NumpyBlock(SharedBlock): | ||
|
@@ -87,3 +93,9 @@ class BlockManager: | |
) -> None: ... | ||
def get_slice(self: T, slobj: slice, axis: int = ...) -> T: ... | ||
def _rebuild_blknos_and_blklocs(self) -> None: ... | ||
|
||
class BlockValuesRefs: | ||
referenced_blocks: list[weakref.ref] | ||
def __init__(self, blk: SharedBlock) -> None: ... | ||
def add_reference(self, blk: SharedBlock) -> None: ... | ||
def has_reference(self) -> bool: ... |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -580,9 +580,16 @@ cdef class SharedBlock: | |
""" | ||
cdef: | ||
public BlockPlacement _mgr_locs | ||
public BlockValuesRefs refs | ||
readonly int ndim | ||
|
||
def __cinit__(self, values, placement: BlockPlacement, ndim: int): | ||
def __cinit__( | ||
self, | ||
values, | ||
placement: BlockPlacement, | ||
ndim: int, | ||
refs: BlockValuesRefs | None = None, | ||
): | ||
""" | ||
Parameters | ||
---------- | ||
|
@@ -591,9 +598,22 @@ cdef class SharedBlock: | |
placement : BlockPlacement | ||
ndim : int | ||
1 for SingleBlockManager/Series, 2 for BlockManager/DataFrame | ||
refs: BlockValuesRefs, optional | ||
Ref tracking object or None if block does not have any refs. | ||
""" | ||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||
self._mgr_locs = placement | ||
self.ndim = ndim | ||
if refs is None: | ||
phofl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# if no refs are passed, that means we are creating a Block from | ||
# new values that it uniquely owns -> start a new BlockValuesRefs | ||
# object that only references this block | ||
self.refs = BlockValuesRefs(self) | ||
else: | ||
phofl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# if refs are passed, this is the BlockValuesRefs object that is shared | ||
# with the parent blocks which share the values, and a reference to this | ||
# new block is added | ||
refs.add_reference(self) | ||
self.refs = refs | ||
|
||
cpdef __reduce__(self): | ||
args = (self.values, self.mgr_locs.indexer, self.ndim) | ||
|
@@ -619,9 +639,15 @@ cdef class NumpyBlock(SharedBlock): | |
cdef: | ||
public ndarray values | ||
|
||
def __cinit__(self, ndarray values, BlockPlacement placement, int ndim): | ||
def __cinit__( | ||
self, | ||
ndarray values, | ||
BlockPlacement placement, | ||
int ndim, | ||
refs: BlockValuesRefs | None = None, | ||
): | ||
# set values here; the (implicit) call to SharedBlock.__cinit__ will | ||
# set placement and ndim | ||
# set placement, ndim and refs | ||
self.values = values | ||
|
||
cpdef NumpyBlock getitem_block_index(self, slice slicer): | ||
|
@@ -631,7 +657,7 @@ cdef class NumpyBlock(SharedBlock): | |
Assumes self.ndim == 2 | ||
""" | ||
new_values = self.values[..., slicer] | ||
return type(self)(new_values, self._mgr_locs, ndim=self.ndim) | ||
return type(self)(new_values, self._mgr_locs, ndim=self.ndim, refs=self.refs) | ||
|
||
|
||
cdef class NDArrayBackedBlock(SharedBlock): | ||
|
@@ -641,9 +667,15 @@ cdef class NDArrayBackedBlock(SharedBlock): | |
cdef public: | ||
NDArrayBacked values | ||
|
||
def __cinit__(self, NDArrayBacked values, BlockPlacement placement, int ndim): | ||
def __cinit__( | ||
self, | ||
NDArrayBacked values, | ||
BlockPlacement placement, | ||
int ndim, | ||
refs: BlockValuesRefs | None = None, | ||
): | ||
# set values here; the (implicit) call to SharedBlock.__cinit__ will | ||
# set placement and ndim | ||
# set placement, ndim and refs | ||
self.values = values | ||
|
||
cpdef NDArrayBackedBlock getitem_block_index(self, slice slicer): | ||
|
@@ -653,16 +685,22 @@ cdef class NDArrayBackedBlock(SharedBlock): | |
Assumes self.ndim == 2 | ||
""" | ||
new_values = self.values[..., slicer] | ||
return type(self)(new_values, self._mgr_locs, ndim=self.ndim) | ||
return type(self)(new_values, self._mgr_locs, ndim=self.ndim, refs=self.refs) | ||
|
||
|
||
cdef class Block(SharedBlock): | ||
cdef: | ||
public object values | ||
|
||
def __cinit__(self, object values, BlockPlacement placement, int ndim): | ||
def __cinit__( | ||
self, | ||
object values, | ||
BlockPlacement placement, | ||
int ndim, | ||
refs: BlockValuesRefs | None = None, | ||
): | ||
# set values here; the (implicit) call to SharedBlock.__cinit__ will | ||
# set placement and ndim | ||
# set placement, ndim and refs | ||
self.values = values | ||
|
||
|
||
|
@@ -673,15 +711,11 @@ cdef class BlockManager: | |
public list axes | ||
public bint _known_consolidated, _is_consolidated | ||
public ndarray _blknos, _blklocs | ||
public list refs | ||
public object parent | ||
|
||
def __cinit__( | ||
self, | ||
blocks=None, | ||
axes=None, | ||
refs=None, | ||
parent=None, | ||
verify_integrity=True, | ||
): | ||
# None as defaults for unpickling GH#42345 | ||
|
@@ -695,8 +729,6 @@ cdef class BlockManager: | |
|
||
self.blocks = blocks | ||
self.axes = axes.copy() # copy to make sure we are not remotely-mutable | ||
self.refs = refs | ||
self.parent = parent | ||
|
||
# Populate known_consolidate, blknos, and blklocs lazily | ||
self._known_consolidated = False | ||
|
@@ -805,16 +837,12 @@ cdef class BlockManager: | |
ndarray blknos, blklocs | ||
|
||
nbs = [] | ||
nrefs = [] | ||
for blk in self.blocks: | ||
nb = blk.getitem_block_index(slobj) | ||
nbs.append(nb) | ||
nrefs.append(weakref.ref(blk)) | ||
|
||
new_axes = [self.axes[0], self.axes[1]._getitem_slice(slobj)] | ||
mgr = type(self)( | ||
tuple(nbs), new_axes, nrefs, parent=self, verify_integrity=False | ||
) | ||
mgr = type(self)(tuple(nbs), new_axes, verify_integrity=False) | ||
|
||
# We can avoid having to rebuild blklocs/blknos | ||
blklocs = self._blklocs | ||
|
@@ -827,7 +855,7 @@ cdef class BlockManager: | |
def get_slice(self, slobj: slice, axis: int = 0) -> BlockManager: | ||
|
||
if axis == 0: | ||
new_blocks, new_refs = self._slice_take_blocks_ax0(slobj) | ||
new_blocks = self._slice_take_blocks_ax0(slobj) | ||
elif axis == 1: | ||
return self._get_index_slice(slobj) | ||
else: | ||
|
@@ -836,6 +864,43 @@ cdef class BlockManager: | |
new_axes = list(self.axes) | ||
new_axes[axis] = new_axes[axis]._getitem_slice(slobj) | ||
|
||
return type(self)( | ||
tuple(new_blocks), new_axes, new_refs, parent=self, verify_integrity=False | ||
) | ||
return type(self)(tuple(new_blocks), new_axes, verify_integrity=False) | ||
|
||
|
||
cdef class BlockValuesRefs: | ||
"""Tracks all references to a given array. | ||
|
||
Keeps track of all blocks (through weak references) that reference the same | ||
data. | ||
""" | ||
cdef: | ||
public list referenced_blocks | ||
|
||
def __cinit__(self, blk: SharedBlock) -> None: | ||
self.referenced_blocks = [weakref.ref(blk)] | ||
|
||
def add_reference(self, blk: SharedBlock) -> None: | ||
"""Adds a new reference to our reference collection. | ||
|
||
Parameters | ||
---------- | ||
blk: SharedBlock | ||
The block that the new references should point to. | ||
""" | ||
self.referenced_blocks.append(weakref.ref(blk)) | ||
|
||
def has_reference(self) -> bool: | ||
"""Checks if block has foreign references. | ||
|
||
A reference is only relevant if it is still alive. The reference to | ||
ourselves does not count. | ||
|
||
Returns | ||
------- | ||
bool | ||
""" | ||
self.referenced_blocks = [ | ||
obj for obj in self.referenced_blocks if obj() is not None | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe a naive implementation question: Could a
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should work, yes. When I implemented it I wasn't sure if we potentially want to use the same block in 2 different objects, which would have made a list necessary. But we agreed above that this is not a good idea in general. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
While we agree we should ideally always use separate Block objects, changing this makes that a requirement, right? (because before you could have two identical Blocks in that list, although which means they would always cause the data to be referenced, even if the Manager that references one of either blocks would go out of scope. But at least it would ensure to not incorrectly not trigger CoW in such cases) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it would be a requirement. But right now we are doing this in all cases where the CoW optimisations are enabled. If we decide this is no longer a good idea, we could change the implementation back to a list since this has no impact outside of the class. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, and to be clear I am fine with it being a requirement (I think that makes it clearer), but then we should maybe add this expectation to the docs you started. Maybe something like:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah got you, that's a good idea. Added your suggestion to the docs |
||
] | ||
# Checking for more references than block pointing to itself | ||
return len(self.referenced_blocks) > 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comma after "aware" is unnecessary. also after "determine" on L17