Skip to content

CoW: Push reference tracking down to the block level #51144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Feb 8, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions doc/source/development/copy_on_write.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
.. _copy_on_write:

{{ header }}

*************
Copy on write
*************

Copy on Write is a mechanism to simplify the indexing API and improve
performance through avoiding copies if possible.
CoW means that any DataFrame or Series derived from another in any way always
behaves as a copy.

Reference tracking
------------------

To be able to determine, if we have to make a copy when writing into a DataFrame,
we have to be aware, if the values are shared with another DataFrame. pandas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comma after "aware" is unnecessary. also after "determine" on L17

keeps track of all ``Blocks`` that share values with another block internally to
be able to tell when a copy needs to be triggered. The reference tracking
mechanism is implemented on the Block level.

We use a custom reference tracker object, ``BlockValuesRefs``, that keeps
track of every block, whose values share memory with each other. The reference
is held through a weak-reference. Every two blocks that share some memory should
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two block -> pair of blocks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for the comments. opened #51552

point to the same ``BlockValuesRefs`` object. If one block goes out of
scope, the reference to this block dies. As a consequence, the reference tracker
object always knows how many blocks are alive and share memory.

Whenever a :class:`DataFrame` or :class:`Series` object is sharing data with another
object, it is required that each of those objects have its own BlockManager and Block
objects. Thus, in other words, one Block instance (that is held by a DataFrame, not
necessarily for intermediate objects) should always be uniquely used for only
a single DataFrame/Series object. For example, when you want to use the same
Block for another object, you can create a shallow copy of the Block instance
with ``block.copy(deep=False)`` (which will create a new Block instance with
the same underlying values and which will correctly set up the references).

We can ask the reference tracking object if there is another block alive that shares
data with us before writing into the values. We can trigger a copy before
writing if there is in fact another block alive.
1 change: 1 addition & 0 deletions doc/source/development/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ Development
contributing_codebase
maintaining
internals
copy_on_write
debugging_extensions
extending
developer
Expand Down
14 changes: 13 additions & 1 deletion pandas/_libs/internals.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ from typing import (
final,
overload,
)
import weakref

import numpy as np

Expand Down Expand Up @@ -59,8 +60,13 @@ class SharedBlock:
_mgr_locs: BlockPlacement
ndim: int
values: ArrayLike
refs: BlockValuesRefs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor naming nit: if we want to simplify this (so it "speaks" better), could also be "BlockRefs", since referencing the block vs the values is kind of equivalent?

Copy link
Member

@jorisvandenbossche jorisvandenbossche Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, that's maybe a good question: is it equivalent?
With the Blockmanager tracking, the rule was that each DataFrame had its own unique manager, so one manager was never shared between different dataframes. I assume similarly here, we always create new Blocks for new Managers/DataFrames (potentially just shallow copies). So if a certain values is shared (referenced) by 3 DataFrames, there are 3 unique Block objects that hold the values? Or differently stated, the referenced_blocks list, it's always all weakrefs to unique blocks in there? Or can there ever be two weakrefs to the same Block instance? Now this is done at the Block level, that can actually be OK to have?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically this is possible (having two references to the same block). I'd prefer using shallow copies though, the way it is implemented tracking the new refs works automatically when you create a shallow copy, while you would have to explicitly update the references with your own block again. I think this would look a bit hacky

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was, we actually want to reference the values, e.g. same values living in different blocks, hence this name. But not too happy with it either, so not opposed to renaming. BlockRefs did not capture everything to me, that's why I ended up with this name

Copy link
Member

@jorisvandenbossche jorisvandenbossche Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically this is possible (having two references to the same block). I'd prefer using shallow copies though,

Yes, agreed that we certainly intent to always use shallow copies (new blocks)

My thinking was, we actually want to reference the values, e.g. same values living in different blocks, hence this name. But not too happy with it either, so not opposed to renaming

Makes sense. Then another alternative is to leave out the Block and Values altogether, and do something with "Refs" (like the RefTracker). But that is very much name bikeshedding at this point and not too important ;)

def __init__(
self, values: ArrayLike, placement: BlockPlacement, ndim: int
self,
values: ArrayLike,
placement: BlockPlacement,
ndim: int,
refs: BlockValuesRefs | None = ...,
) -> None: ...

class NumpyBlock(SharedBlock):
Expand All @@ -87,3 +93,9 @@ class BlockManager:
) -> None: ...
def get_slice(self: T, slobj: slice, axis: int = ...) -> T: ...
def _rebuild_blknos_and_blklocs(self) -> None: ...

class BlockValuesRefs:
referenced_blocks: list[weakref.ref]
def __init__(self, blk: SharedBlock) -> None: ...
def add_reference(self, blk: SharedBlock) -> None: ...
def has_reference(self) -> bool: ...
110 changes: 86 additions & 24 deletions pandas/_libs/internals.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -580,9 +580,16 @@ cdef class SharedBlock:
"""
cdef:
public BlockPlacement _mgr_locs
public BlockValuesRefs refs
readonly int ndim

def __cinit__(self, values, placement: BlockPlacement, ndim: int):
def __cinit__(
self,
values,
placement: BlockPlacement,
ndim: int,
refs: BlockValuesRefs | None = None,
):
"""
Parameters
----------
Expand All @@ -591,9 +598,22 @@ cdef class SharedBlock:
placement : BlockPlacement
ndim : int
1 for SingleBlockManager/Series, 2 for BlockManager/DataFrame
refs: BlockValuesRefs, optional
Ref tracking object or None if block does not have any refs.
"""
self._mgr_locs = placement
self.ndim = ndim
if refs is None:
# if no refs are passed, that means we are creating a Block from
# new values that it uniquely owns -> start a new BlockValuesRefs
# object that only references this block
self.refs = BlockValuesRefs(self)
else:
# if refs are passed, this is the BlockValuesRefs object that is shared
# with the parent blocks which share the values, and a reference to this
# new block is added
refs.add_reference(self)
self.refs = refs

cpdef __reduce__(self):
args = (self.values, self.mgr_locs.indexer, self.ndim)
Expand All @@ -619,9 +639,15 @@ cdef class NumpyBlock(SharedBlock):
cdef:
public ndarray values

def __cinit__(self, ndarray values, BlockPlacement placement, int ndim):
def __cinit__(
self,
ndarray values,
BlockPlacement placement,
int ndim,
refs: BlockValuesRefs | None = None,
):
# set values here; the (implicit) call to SharedBlock.__cinit__ will
# set placement and ndim
# set placement, ndim and refs
self.values = values

cpdef NumpyBlock getitem_block_index(self, slice slicer):
Expand All @@ -631,7 +657,7 @@ cdef class NumpyBlock(SharedBlock):
Assumes self.ndim == 2
"""
new_values = self.values[..., slicer]
return type(self)(new_values, self._mgr_locs, ndim=self.ndim)
return type(self)(new_values, self._mgr_locs, ndim=self.ndim, refs=self.refs)


cdef class NDArrayBackedBlock(SharedBlock):
Expand All @@ -641,9 +667,15 @@ cdef class NDArrayBackedBlock(SharedBlock):
cdef public:
NDArrayBacked values

def __cinit__(self, NDArrayBacked values, BlockPlacement placement, int ndim):
def __cinit__(
self,
NDArrayBacked values,
BlockPlacement placement,
int ndim,
refs: BlockValuesRefs | None = None,
):
# set values here; the (implicit) call to SharedBlock.__cinit__ will
# set placement and ndim
# set placement, ndim and refs
self.values = values

cpdef NDArrayBackedBlock getitem_block_index(self, slice slicer):
Expand All @@ -653,16 +685,22 @@ cdef class NDArrayBackedBlock(SharedBlock):
Assumes self.ndim == 2
"""
new_values = self.values[..., slicer]
return type(self)(new_values, self._mgr_locs, ndim=self.ndim)
return type(self)(new_values, self._mgr_locs, ndim=self.ndim, refs=self.refs)


cdef class Block(SharedBlock):
cdef:
public object values

def __cinit__(self, object values, BlockPlacement placement, int ndim):
def __cinit__(
self,
object values,
BlockPlacement placement,
int ndim,
refs: BlockValuesRefs | None = None,
):
# set values here; the (implicit) call to SharedBlock.__cinit__ will
# set placement and ndim
# set placement, ndim and refs
self.values = values


Expand All @@ -673,15 +711,11 @@ cdef class BlockManager:
public list axes
public bint _known_consolidated, _is_consolidated
public ndarray _blknos, _blklocs
public list refs
public object parent

def __cinit__(
self,
blocks=None,
axes=None,
refs=None,
parent=None,
verify_integrity=True,
):
# None as defaults for unpickling GH#42345
Expand All @@ -695,8 +729,6 @@ cdef class BlockManager:

self.blocks = blocks
self.axes = axes.copy() # copy to make sure we are not remotely-mutable
self.refs = refs
self.parent = parent

# Populate known_consolidate, blknos, and blklocs lazily
self._known_consolidated = False
Expand Down Expand Up @@ -805,16 +837,12 @@ cdef class BlockManager:
ndarray blknos, blklocs

nbs = []
nrefs = []
for blk in self.blocks:
nb = blk.getitem_block_index(slobj)
nbs.append(nb)
nrefs.append(weakref.ref(blk))

new_axes = [self.axes[0], self.axes[1]._getitem_slice(slobj)]
mgr = type(self)(
tuple(nbs), new_axes, nrefs, parent=self, verify_integrity=False
)
mgr = type(self)(tuple(nbs), new_axes, verify_integrity=False)

# We can avoid having to rebuild blklocs/blknos
blklocs = self._blklocs
Expand All @@ -827,7 +855,7 @@ cdef class BlockManager:
def get_slice(self, slobj: slice, axis: int = 0) -> BlockManager:

if axis == 0:
new_blocks, new_refs = self._slice_take_blocks_ax0(slobj)
new_blocks = self._slice_take_blocks_ax0(slobj)
elif axis == 1:
return self._get_index_slice(slobj)
else:
Expand All @@ -836,6 +864,40 @@ cdef class BlockManager:
new_axes = list(self.axes)
new_axes[axis] = new_axes[axis]._getitem_slice(slobj)

return type(self)(
tuple(new_blocks), new_axes, new_refs, parent=self, verify_integrity=False
)
return type(self)(tuple(new_blocks), new_axes, verify_integrity=False)


cdef class BlockValuesRefs:
"""Tracks all references to a given array.

Keeps track of all blocks (through weak references) that reference the same
data.
"""
cdef:
public object referenced_blocks

def __cinit__(self, blk: SharedBlock) -> None:
self.referenced_blocks = weakref.WeakSet([blk])

def add_reference(self, blk: SharedBlock) -> None:
"""Adds a new reference to our reference collection.

Parameters
----------
blk: SharedBlock
The block that the new references should point to.
"""
self.referenced_blocks.add(blk)

def has_reference(self) -> bool:
"""Checks if block has foreign references.

A reference is only relevant if it is still alive. The reference to
ourselves does not count.

Returns
-------
bool
"""
# Checking for more references than block pointing to itself
return len(self.referenced_blocks) > 1
2 changes: 1 addition & 1 deletion pandas/compat/pickle_compat.py
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,7 @@ def load_newobj(self) -> None:
arr = np.array([], dtype="m8[ns]")
obj = cls.__new__(cls, arr, arr.dtype)
elif cls is BlockManager and not args:
obj = cls.__new__(cls, (), [], None, False)
obj = cls.__new__(cls, (), [], False)
else:
obj = cls.__new__(cls, *args)

Expand Down
6 changes: 4 additions & 2 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -768,8 +768,10 @@ def swapaxes(
)
assert isinstance(new_mgr, BlockManager)
assert isinstance(self._mgr, BlockManager)
new_mgr.parent = self._mgr
new_mgr.refs = [weakref.ref(self._mgr.blocks[0])]
new_mgr.blocks[0].refs = self._mgr.blocks[0].refs
new_mgr.blocks[0].refs.add_reference(
new_mgr.blocks[0] # type: ignore[arg-type]
)
return self._constructor(new_mgr).__finalize__(self, method="swapaxes")

elif (copy or copy is None) and self._mgr.is_single_block:
Expand Down
Loading