Skip to content

CoW: Push reference tracking down to the block level #51144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Feb 8, 2023
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions doc/source/development/copy_on_write.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
.. _copy_on_write:

{{ header }}

*************
Copy on write
*************

Copy on Write is a mechanism to simplify the indexing API and improve
performance through avoiding copies if possible.
CoW means that any DataFrame or Series derived from another in any way always
behaves as a copy.

Reference tracking
------------------

To be able to determine, if we have to make a copy when writing into a DataFrame,
we have to be aware, if the values are shared with another DataFrame. pandas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comma after "aware" is unnecessary. also after "determine" on L17

keeps track of all ``Blocks`` that share values with another block internally to
be able to tell when a copy needs to be triggered. The reference tracking
mechanism is implemented on the Block level.

We use a custom reference tracker object, e.g. ``BlockValuesRefs`` that keeps
track of every block, whose values share memory with each other. The reference
is held through a weak-reference. Every two blocks that share some memory should
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two block -> pair of blocks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for the comments. opened #51552

point to the same ``BlockValuesRefs`` object. If one block goes out of
scope, the reference to this block dies. As a consequence, the reference tracker
object always knows how many blocks are alive and share memory.

We can ask the reference tracking object if there is another block alive that shares
data with us before writing into the values. We can trigger a copy before
writing if there is in fact another block alive.
1 change: 1 addition & 0 deletions doc/source/development/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ Development
contributing_codebase
maintaining
internals
copy_on_write
debugging_extensions
extending
developer
Expand Down
14 changes: 13 additions & 1 deletion pandas/_libs/internals.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ from typing import (
final,
overload,
)
import weakref

import numpy as np

Expand Down Expand Up @@ -59,8 +60,13 @@ class SharedBlock:
_mgr_locs: BlockPlacement
ndim: int
values: ArrayLike
refs: BlockValuesRefs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor naming nit: if we want to simplify this (so it "speaks" better), could also be "BlockRefs", since referencing the block vs the values is kind of equivalent?

Copy link
Member

@jorisvandenbossche jorisvandenbossche Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, that's maybe a good question: is it equivalent?
With the Blockmanager tracking, the rule was that each DataFrame had its own unique manager, so one manager was never shared between different dataframes. I assume similarly here, we always create new Blocks for new Managers/DataFrames (potentially just shallow copies). So if a certain values is shared (referenced) by 3 DataFrames, there are 3 unique Block objects that hold the values? Or differently stated, the referenced_blocks list, it's always all weakrefs to unique blocks in there? Or can there ever be two weakrefs to the same Block instance? Now this is done at the Block level, that can actually be OK to have?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically this is possible (having two references to the same block). I'd prefer using shallow copies though, the way it is implemented tracking the new refs works automatically when you create a shallow copy, while you would have to explicitly update the references with your own block again. I think this would look a bit hacky

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was, we actually want to reference the values, e.g. same values living in different blocks, hence this name. But not too happy with it either, so not opposed to renaming. BlockRefs did not capture everything to me, that's why I ended up with this name

Copy link
Member

@jorisvandenbossche jorisvandenbossche Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically this is possible (having two references to the same block). I'd prefer using shallow copies though,

Yes, agreed that we certainly intent to always use shallow copies (new blocks)

My thinking was, we actually want to reference the values, e.g. same values living in different blocks, hence this name. But not too happy with it either, so not opposed to renaming

Makes sense. Then another alternative is to leave out the Block and Values altogether, and do something with "Refs" (like the RefTracker). But that is very much name bikeshedding at this point and not too important ;)

def __init__(
self, values: ArrayLike, placement: BlockPlacement, ndim: int
self,
values: ArrayLike,
placement: BlockPlacement,
ndim: int,
refs: BlockValuesRefs | None = ...,
) -> None: ...

class NumpyBlock(SharedBlock):
Expand All @@ -87,3 +93,9 @@ class BlockManager:
) -> None: ...
def get_slice(self: T, slobj: slice, axis: int = ...) -> T: ...
def _rebuild_blknos_and_blklocs(self) -> None: ...

class BlockValuesRefs:
referenced_blocks: list[weakref.ref]
def __init__(self, blk: SharedBlock) -> None: ...
def add_reference(self, blk: SharedBlock) -> None: ...
def has_reference(self) -> bool: ...
105 changes: 81 additions & 24 deletions pandas/_libs/internals.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -580,9 +580,16 @@ cdef class SharedBlock:
"""
cdef:
public BlockPlacement _mgr_locs
public BlockValuesRefs refs
readonly int ndim

def __cinit__(self, values, placement: BlockPlacement, ndim: int):
def __cinit__(
self,
values,
placement: BlockPlacement,
ndim: int,
refs: BlockValuesRefs | None = None,
):
"""
Parameters
----------
Expand All @@ -594,6 +601,11 @@ cdef class SharedBlock:
"""
self._mgr_locs = placement
self.ndim = ndim
if refs is None:
self.refs = BlockValuesRefs(self)
else:
refs.add_reference(self)
self.refs = refs

cpdef __reduce__(self):
args = (self.values, self.mgr_locs.indexer, self.ndim)
Expand All @@ -619,9 +631,15 @@ cdef class NumpyBlock(SharedBlock):
cdef:
public ndarray values

def __cinit__(self, ndarray values, BlockPlacement placement, int ndim):
def __cinit__(
self,
ndarray values,
BlockPlacement placement,
int ndim,
refs: BlockValuesRefs | None = None,
):
# set values here; the (implicit) call to SharedBlock.__cinit__ will
# set placement and ndim
# set placement, ndim and refs
self.values = values

cpdef NumpyBlock getitem_block_index(self, slice slicer):
Expand All @@ -631,7 +649,7 @@ cdef class NumpyBlock(SharedBlock):
Assumes self.ndim == 2
"""
new_values = self.values[..., slicer]
return type(self)(new_values, self._mgr_locs, ndim=self.ndim)
return type(self)(new_values, self._mgr_locs, ndim=self.ndim, refs=self.refs)


cdef class NDArrayBackedBlock(SharedBlock):
Expand All @@ -641,9 +659,15 @@ cdef class NDArrayBackedBlock(SharedBlock):
cdef public:
NDArrayBacked values

def __cinit__(self, NDArrayBacked values, BlockPlacement placement, int ndim):
def __cinit__(
self,
NDArrayBacked values,
BlockPlacement placement,
int ndim,
refs: BlockValuesRefs | None = None,
):
# set values here; the (implicit) call to SharedBlock.__cinit__ will
# set placement and ndim
# set placement, ndim and refs
self.values = values

cpdef NDArrayBackedBlock getitem_block_index(self, slice slicer):
Expand All @@ -653,16 +677,22 @@ cdef class NDArrayBackedBlock(SharedBlock):
Assumes self.ndim == 2
"""
new_values = self.values[..., slicer]
return type(self)(new_values, self._mgr_locs, ndim=self.ndim)
return type(self)(new_values, self._mgr_locs, ndim=self.ndim, refs=self.refs)


cdef class Block(SharedBlock):
cdef:
public object values

def __cinit__(self, object values, BlockPlacement placement, int ndim):
def __cinit__(
self,
object values,
BlockPlacement placement,
int ndim,
refs: BlockValuesRefs | None = None,
):
# set values here; the (implicit) call to SharedBlock.__cinit__ will
# set placement and ndim
# set placement, ndim and refs
self.values = values


Expand All @@ -673,15 +703,11 @@ cdef class BlockManager:
public list axes
public bint _known_consolidated, _is_consolidated
public ndarray _blknos, _blklocs
public list refs
public object parent

def __cinit__(
self,
blocks=None,
axes=None,
refs=None,
parent=None,
verify_integrity=True,
):
# None as defaults for unpickling GH#42345
Expand All @@ -695,8 +721,6 @@ cdef class BlockManager:

self.blocks = blocks
self.axes = axes.copy() # copy to make sure we are not remotely-mutable
self.refs = refs
self.parent = parent

# Populate known_consolidate, blknos, and blklocs lazily
self._known_consolidated = False
Expand Down Expand Up @@ -805,16 +829,12 @@ cdef class BlockManager:
ndarray blknos, blklocs

nbs = []
nrefs = []
for blk in self.blocks:
nb = blk.getitem_block_index(slobj)
nbs.append(nb)
nrefs.append(weakref.ref(blk))

new_axes = [self.axes[0], self.axes[1]._getitem_slice(slobj)]
mgr = type(self)(
tuple(nbs), new_axes, nrefs, parent=self, verify_integrity=False
)
mgr = type(self)(tuple(nbs), new_axes, verify_integrity=False)

# We can avoid having to rebuild blklocs/blknos
blklocs = self._blklocs
Expand All @@ -827,7 +847,7 @@ cdef class BlockManager:
def get_slice(self, slobj: slice, axis: int = 0) -> BlockManager:

if axis == 0:
new_blocks, new_refs = self._slice_take_blocks_ax0(slobj)
new_blocks = self._slice_take_blocks_ax0(slobj)
elif axis == 1:
return self._get_index_slice(slobj)
else:
Expand All @@ -836,6 +856,43 @@ cdef class BlockManager:
new_axes = list(self.axes)
new_axes[axis] = new_axes[axis]._getitem_slice(slobj)

return type(self)(
tuple(new_blocks), new_axes, new_refs, parent=self, verify_integrity=False
)
return type(self)(tuple(new_blocks), new_axes, verify_integrity=False)


cdef class BlockValuesRefs:
"""Tracks all references to a given array.

Keeps track of all blocks (through weak references) that reference the same
data.
"""
cdef:
public list referenced_blocks

def __cinit__(self, blk: SharedBlock) -> None:
self.referenced_blocks = [weakref.ref(blk)]

def add_reference(self, blk: SharedBlock) -> None:
"""Adds a new reference to our reference collection.

Parameters
----------
blk: SharedBlock
The block that the new references should point to.
"""
self.referenced_blocks.append(weakref.ref(blk))

def has_reference(self) -> bool:
"""Checks if block has foreign references.

A reference is only relevant if it is still alive. The reference to
ourselves does not count.

Returns
-------
bool
"""
self.referenced_blocks = [
obj for obj in self.referenced_blocks if obj() is not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a naive implementation question: Could a weakref.WeakSet be use to track the references such that

  1. We don't unnecessarily add more than one weakref to the same block
  2. The set will prune itself of refs that don't exist anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work, yes. When I implemented it I wasn't sure if we potentially want to use the same block in 2 different objects, which would have made a list necessary. But we agreed above that this is not a good idea in general.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we agreed above that this is not a good idea in general.

While we agree we should ideally always use separate Block objects, changing this makes that a requirement, right? (because before you could have two identical Blocks in that list, although which means they would always cause the data to be referenced, even if the Manager that references one of either blocks would go out of scope. But at least it would ensure to not incorrectly not trigger CoW in such cases)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be a requirement.

But right now we are doing this in all cases where the CoW optimisations are enabled. If we decide this is no longer a good idea, we could change the implementation back to a list since this has no impact outside of the class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, and to be clear I am fine with it being a requirement (I think that makes it clearer), but then we should maybe add this expectation to the docs you started. Maybe something like:

Whenever a DataFrame or Series object is sharing data with another object, it is required that each of those objects have its own BlockManager and Block objects. Thus, in other words, one Block instance (that is hold by a DataFrame, not necessarily for intermediate objects) should always be uniquely used for only a single DataFrame/Series object. For example, when you want to use the same Block for another object, you can create a shallow copy of the Block instance with block.copy(deep=False) (which will create a new Block instance with the same underlying values and which will correctly set up the references).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got you, that's a good idea. Added your suggestion to the docs

]
# Checking for more references than block pointing to itself
return len(self.referenced_blocks) > 1
2 changes: 1 addition & 1 deletion pandas/compat/pickle_compat.py
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,7 @@ def load_newobj(self) -> None:
arr = np.array([], dtype="m8[ns]")
obj = cls.__new__(cls, arr, arr.dtype)
elif cls is BlockManager and not args:
obj = cls.__new__(cls, (), [], None, False)
obj = cls.__new__(cls, (), [], False)
else:
obj = cls.__new__(cls, *args)

Expand Down
6 changes: 4 additions & 2 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -768,8 +768,10 @@ def swapaxes(
)
assert isinstance(new_mgr, BlockManager)
assert isinstance(self._mgr, BlockManager)
new_mgr.parent = self._mgr
new_mgr.refs = [weakref.ref(self._mgr.blocks[0])]
new_mgr.blocks[0].refs = self._mgr.blocks[0].refs
new_mgr.blocks[0].refs.add_reference(
new_mgr.blocks[0] # type: ignore[arg-type]
)
return self._constructor(new_mgr).__finalize__(self, method="swapaxes")

elif (copy or copy is None) and self._mgr.is_single_block:
Expand Down
Loading