CoW: Push reference tracking down to the block level #51144

phofl · 2023-02-03T16:13:46Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This pr removes the reference tracking from the manager level and adds it to the block level. Tests are passing locally.
Right now we are tracking the references also in non-CoW mode, but we are only accessing the object in CoW mode, so should not be harmful. Can also disable this, but wanted to avoid accessing using_copy_on_write() constantly.

Two things that the previous mechanism couldn't handle:

df = DataFrame({"a": [1, 2, 3]})
df = df.reset_index()
df.iloc[0, 1]

This triggered a copy, because the reference to the original df was kept alive. Same example: We were not able to free the memory of old object, if the result of an operation was assigned to the same variable.

jorisvandenbossche

This is looking great! And a nice simplification ;)

I will only have time on Monday (during the day) for a thorough review, just a few quick comments

One thing I am wondering: because it is now at the Block level, it might be more dtype / Block-subclass specific, and we should maybe parametrize our basic copy_view tests more with different categories of dtypes (eg at least numpy based on extension/nullable) (something we should have done anyway, can also be done in a follow-up/precursor, happy to take a look at that)

jorisvandenbossche · 2023-02-03T19:53:20Z

pandas/_libs/internals.pyi

@@ -59,8 +60,13 @@ class SharedBlock:
    _mgr_locs: BlockPlacement
    ndim: int
    values: ArrayLike
+    refs: BlockValuesRefs


Minor naming nit: if we want to simplify this (so it "speaks" better), could also be "BlockRefs", since referencing the block vs the values is kind of equivalent?

Actually, that's maybe a good question: is it equivalent?
With the Blockmanager tracking, the rule was that each DataFrame had its own unique manager, so one manager was never shared between different dataframes. I assume similarly here, we always create new Blocks for new Managers/DataFrames (potentially just shallow copies). So if a certain values is shared (referenced) by 3 DataFrames, there are 3 unique Block objects that hold the values? Or differently stated, the referenced_blocks list, it's always all weakrefs to unique blocks in there? Or can there ever be two weakrefs to the same Block instance? Now this is done at the Block level, that can actually be OK to have?

Theoretically this is possible (having two references to the same block). I'd prefer using shallow copies though, the way it is implemented tracking the new refs works automatically when you create a shallow copy, while you would have to explicitly update the references with your own block again. I think this would look a bit hacky

My thinking was, we actually want to reference the values, e.g. same values living in different blocks, hence this name. But not too happy with it either, so not opposed to renaming. BlockRefs did not capture everything to me, that's why I ended up with this name

Theoretically this is possible (having two references to the same block). I'd prefer using shallow copies though,

Yes, agreed that we certainly intent to always use shallow copies (new blocks)

My thinking was, we actually want to reference the values, e.g. same values living in different blocks, hence this name. But not too happy with it either, so not opposed to renaming

Makes sense. Then another alternative is to leave out the Block and Values altogether, and do something with "Refs" (like the RefTracker). But that is very much name bikeshedding at this point and not too important ;)

jorisvandenbossche · 2023-02-03T20:02:53Z

pandas/core/internals/blocks.py

+        # TODO(CoW): This is tricky, if parent block goes out of scope
+        # all split blocks are referencing each other even though they
+        # don't share data
+        refs = self.refs if self.refs.has_reference() else None


Can we create kind of a "clone" of the refs (duplicate it)? Create a new refs that creates new weakrefs to the same objects. But that ensures that both lists are independent, and the two split blocks don't reference each other

This is doable (in theory), but not in this pr (would make things way more complicated). The thing that is not clear to me with this solution is how we would update all parent blocks that share this reference object
You could have:

a -> b -> c -> split happens ->d1, d2, d3, ...

When you clone the reference object, you would have to make sure that a, b and c are all aware of the new object. The simplest solution would probably be, that a reference object can point to another reference object which would live with the split blocks. These chains can get arbitrarily deep though, so we have to think the exact approach through.

Good thing with this class is, that we can hide the implementation inside it without having to worry about it in other areas.

The thing that is not clear to me with this solution is how we would update all parent blocks that share this reference object

Ah, yes, that's an aspect that I didn't consider when writing the above comment, and that indeed makes it quite a bit more complicated ..

Shorter term, we could maybe implement a small optimization in case the original dataframe doesn't have any references, then each of the sliced blocks will still have no foreign references?

In [1]: df = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd']) In [3]: del df['b'] In [4]: df._mgr Out[4]: BlockManager Items: Index(['a', 'c', 'd'], dtype='object') Axis 1: RangeIndex(start=0, stop=10, step=1) NumericBlock: slice(0, 1, 1), 1 x 10, dtype: float64 NumericBlock: slice(1, 2, 1), 1 x 10, dtype: float64 NumericBlock: slice(2, 3, 1), 1 x 10, dtype: float64 In [9]: df._mgr.blocks[0].refs.referenced_blocks Out[9]: [<weakref at 0x7fe48cc6ec20; dead>, <weakref at 0x7fe48caaa0e0; to 'NumericBlock' at 0x7fe48cb8d9a0>, <weakref at 0x7fe48cb82040; to 'NumericBlock' at 0x7fe48cb87160>, <weakref at 0x7fe48cb82680; to 'NumericBlock' at 0x7fe48cb6f100>]

I'd prefer doing this in a follow up. We are calling _slice_take_blocks_ax0 under the hood which would have to avoid tracking references via a keyword. Not super difficult but would make the pr messier than necessary. But will get a pr up before 2.0

Yep, to be clear my "shorter term" was meant as "after this PR and if we don't directly have a better general solution for 2.0" ;)

jorisvandenbossche · 2023-02-03T20:16:16Z

pandas/core/internals/managers.py


        if unfit_val_locs:
            unfit_idxr = np.concatenate(unfit_mgr_locs)
            unfit_count = len(unfit_idxr)

            new_blocks: list[Block] = []
+            # TODO(CoW) is this always correct to assume that the new_blocks
+            # are not referencing anything else?


There are some tests in copy_view/test_setitem.py that checks we do copy the array-like values we are setting (not sure how complete they are though, given this is a beast of a method). And so because we copy upfront, that currently is fine regarding CoW (for setting full columns, which I think is what iset is used for?). Although a future optimization we can add is that we can actually delay the copy for Series/DataFrame being set as new column(s). And then we will have to passthrough its refs somehow to this method.

EDIT: Ah, I see it was my own comment that was moved ;)

Yep :) This is on my todo list as well, just moved it to a better place after the modification

pandas/tests/copy_view/test_methods.py

jorisvandenbossche · 2023-02-03T20:22:45Z

pandas/tests/copy_view/test_methods.py

+
+    df["b"] = 100
+    arr = get_array(df, "a")
+    df.iloc[0, 0] = 100  # Check that we correctly track reference


the comment is a bit confusing because it's testing we are not tracking a reference

Adjusted the comment

jorisvandenbossche · 2023-02-03T20:36:23Z

Some additional test ideas:

def test_out_of_scope():
    def func():
        df = DataFrame(..)
        # create some subset
        result = df[[cols]]
        return result
    df = func()
    # assert that df has no refs

def test_delete():
    df = DataFrame(np.random.randn(4, 3), columns=['a', 'b', 'c'])
    del df['b']
    # at this point, 'a' and 'c' each should only have 1 reference (in addition to itself)
    df = df[['a']]
    # 'a' should should be owned, and 'c' should have gone out of references (not sure if there is an easy way to test that)

phofl · 2023-02-03T20:57:39Z

Agree with parametrising the tests, this was one my todo list as well. Would prefer to do in a separate pr though.

phofl · 2023-02-03T21:14:11Z

Added the tests of your two cases.

The del right now sets up references so that we think that there is another reference alive, because we are creating shallow copies under the hood.

Is there a way that delete would actually require us to keep references alive? Otherwise we could fix this in idelete on the manager level

jorisvandenbossche

All my comments are relatively minor things, I think we can certainly try to merge this shortly so other CoW PRs don't need to be hold up.

Some comments I couldn't make inline:

Did you check all occurrences of Block.make_block() that the values are newly created and no refs need to be tracked?
For example _maybe_downcast doesn't necessarily copy the data (but didn't check in detail where _maybe_downcast itself is being used; it might do that only in cases where we already have a copy). It's also used in astype, so this might something to handle in an updated version of the PR that handles astype. From a quick look the rest seems OK.
Block.convert() can currently return [self]. Should that be [self.copy(deep=False)]? Again it might depend on where this is actually called. For example it is called from infer_objects, but that's also something for which we have a follow-up PR. Other cases where this is called from within other Block methods might already have copied beforehand.
Can you add a comment to Block.iget() that it returns a view and that the caller needs to take care of keeping a reference if needed (I did something similar for BlockManager.iget_values, can copy from there)
Just a question: do we (eventually, later, not for this PR) move some of the CoW-triggers from the manager to block level as well?

Before merging: do we want to do the ref tracking only if CoW is enabled? I think that should be fine to keep as is (also before this PR, we were creating the refs list of weakrefs on BlockManager in all cases, while only using it when CoW is enabled (parent was only set if CoW was actually enabled, but that's because it was a hard reference. In this PR we only have weakrefs, so that should be fine)

And thanks for starting some developer-oriented documentation!

jorisvandenbossche · 2023-02-06T19:18:39Z

pandas/_libs/internals.pyi

@@ -59,8 +60,13 @@ class SharedBlock:
    _mgr_locs: BlockPlacement
    ndim: int
    values: ArrayLike
+    refs: BlockValuesRefs


Theoretically this is possible (having two references to the same block). I'd prefer using shallow copies though,

Yes, agreed that we certainly intent to always use shallow copies (new blocks)

My thinking was, we actually want to reference the values, e.g. same values living in different blocks, hence this name. But not too happy with it either, so not opposed to renaming

Makes sense. Then another alternative is to leave out the Block and Values altogether, and do something with "Refs" (like the RefTracker). But that is very much name bikeshedding at this point and not too important ;)

pandas/_libs/internals.pyx

doc/source/development/copy_on_write.rst

jorisvandenbossche · 2023-02-06T20:22:32Z

pandas/core/internals/blocks.py

+        # TODO(CoW): This is tricky, if parent block goes out of scope
+        # all split blocks are referencing each other even though they
+        # don't share data
+        refs = self.refs if self.refs.has_reference() else None


The thing that is not clear to me with this solution is how we would update all parent blocks that share this reference object

Ah, yes, that's an aspect that I didn't consider when writing the above comment, and that indeed makes it quite a bit more complicated ..

Shorter term, we could maybe implement a small optimization in case the original dataframe doesn't have any references, then each of the sliced blocks will still have no foreign references?

In [1]: df = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd']) In [3]: del df['b'] In [4]: df._mgr Out[4]: BlockManager Items: Index(['a', 'c', 'd'], dtype='object') Axis 1: RangeIndex(start=0, stop=10, step=1) NumericBlock: slice(0, 1, 1), 1 x 10, dtype: float64 NumericBlock: slice(1, 2, 1), 1 x 10, dtype: float64 NumericBlock: slice(2, 3, 1), 1 x 10, dtype: float64 In [9]: df._mgr.blocks[0].refs.referenced_blocks Out[9]: [<weakref at 0x7fe48cc6ec20; dead>, <weakref at 0x7fe48caaa0e0; to 'NumericBlock' at 0x7fe48cb8d9a0>, <weakref at 0x7fe48cb82040; to 'NumericBlock' at 0x7fe48cb87160>, <weakref at 0x7fe48cb82680; to 'NumericBlock' at 0x7fe48cb6f100>]

jorisvandenbossche · 2023-02-06T20:34:51Z

pandas/core/internals/managers.py

@@ -663,7 +626,7 @@ def consolidate(self: T) -> T:
        if self.is_consolidated():
            return self

-        bm = type(self)(self.blocks, self.axes, self.refs, verify_integrity=False)
+        bm = type(self)(self.blocks, self.axes, verify_integrity=False)


Should we pass shallow copies of the blocks in self.blocks here to keep track of references?

_consolidate_inplace will not update all blocks, so this might create a new BlockManager with blocks that shares data with the original BlockManager.

There isn't a call where this would matter right now. It's either coming from copy or from a function where we did a copy of the blocks anyway. At least I couldn't find any. Could add it though if you prefer.

jorisvandenbossche · 2023-02-06T20:54:32Z

pandas/tests/copy_view/test_core_functionalities.py

+def test_assigning_to_same_variable_removes_references(using_copy_on_write):
+    df = DataFrame({"a": [1, 2, 3]})
+    df = df.reset_index()
+    arr = get_array(df, "a")


Suggested change

arr = get_array(df, "a")

assert df._mgr.blocks[0].refs.has_reference() is False

arr = get_array(df, "a")

That's more explicit, but maybe redundant with the mutation check? (and also dependent on the order of the blocks in the result)

(and also dependent on the order of the blocks in the result)

Could also be assert df._mgr._has_no_reference(1) to be robust for that

I like the last one, will add that

jorisvandenbossche · 2023-02-06T20:59:11Z

pandas/tests/copy_view/test_core_functionalities.py

+    df["b"] = 100
+    arr = get_array(df, "a")
+    df.iloc[0, 0] = 100  # Check that we correctly removed the reference
+    assert np.shares_memory(arr, get_array(df, "a"))


It's actually not super clear to me what this test is targeting. Does the implementation of df.__setitem__ create an intermediary Block that might incorrectly have a reference? Or what's the trigger to explicitly test this?
(not that anything about the test is "wrong", just wondering ;))

Ah, setitem splits the blocks under the hood to avoid calling np.delete(). This is basically what we are also testing in the del df["a"] test above.

I want to make sure that the spit blocks don't have any references on them if the parent block did not have any reference. Parent block went out of scope, so the new blocks should be clear of references as well.

Ah, OK, thanks for the explanation, wasn't thinking about the side effect of the non-inplace __setitem__ of splitting the block

jorisvandenbossche · 2023-02-06T21:01:36Z

pandas/tests/copy_view/test_core_functionalities.py

+    df.iloc[0, 0] = 100  # Check that we correctly release the reference
+    if using_copy_on_write:
+        mark = pytest.mark.xfail(
+            reason="blk.delete does not track references correctly"


How blk.delete involved here?

Same explanation as above. blk.delete splits the block during setitem -> all blocks are referencing each other and view -> view goes out of scope -> theoretically the split blocks shouldn't have references anymore because they don't share data with another block.

This is what we will have to optimise later.

I'll adjust the comment to better explain it

phofl · 2023-02-06T21:13:58Z

Quick comments to your top-comment:

I ensured that every function that was called behind a ref-update on the Manager level updates the refs correctly now. I went through all possible code paths down to the block level. The rest should be hidden behind general copies for CoW which we optimise later.
Same for the 2nd point, I think we make general copies there without checking anything
will do
yes definitely. Most of the checks are more specific on the block level and easier as well I guess.

Yeah I mentioned the tracking In my top post as well. Would make things unnecessarily complex (and slower) imo if we only do this for CoW. Accessing it is what could cause a slowdown, which is not done without CoW enabled

Co-authored-by: Joris Van den Bossche <[email protected]>

mroeschke · 2023-02-07T21:26:54Z

pandas/_libs/internals.pyx

+        bool
+        """
+        self.referenced_blocks = [
+            obj for obj in self.referenced_blocks if obj() is not None


Maybe a naive implementation question: Could a weakref.WeakSet be use to track the references such that

We don't unnecessarily add more than one weakref to the same block

The set will prune itself of refs that don't exist anymore?

This should work, yes. When I implemented it I wasn't sure if we potentially want to use the same block in 2 different objects, which would have made a list necessary. But we agreed above that this is not a good idea in general.

But we agreed above that this is not a good idea in general.

While we agree we should ideally always use separate Block objects, changing this makes that a requirement, right? (because before you could have two identical Blocks in that list, although which means they would always cause the data to be referenced, even if the Manager that references one of either blocks would go out of scope. But at least it would ensure to not incorrectly not trigger CoW in such cases)

Yes, it would be a requirement.

But right now we are doing this in all cases where the CoW optimisations are enabled. If we decide this is no longer a good idea, we could change the implementation back to a list since this has no impact outside of the class.

Yep, and to be clear I am fine with it being a requirement (I think that makes it clearer), but then we should maybe add this expectation to the docs you started. Maybe something like:

Whenever a DataFrame or Series object is sharing data with another object, it is required that each of those objects have its own BlockManager and Block objects. Thus, in other words, one Block instance (that is hold by a DataFrame, not necessarily for intermediate objects) should always be uniquely used for only a single DataFrame/Series object. For example, when you want to use the same Block for another object, you can create a shallow copy of the Block instance with block.copy(deep=False) (which will create a new Block instance with the same underlying values and which will correctly set up the references).

Ah got you, that's a good idea. Added your suggestion to the docs

…_for_cow # Conflicts: # pandas/core/internals/blocks.py

phofl · 2023-02-08T09:00:58Z

Merging now to unblock other stuff, can still optimise later if necessary

jbrockmendel · 2023-02-22T04:05:37Z

doc/source/development/copy_on_write.rst

+------------------
+
+To be able to determine, if we have to make a copy when writing into a DataFrame,
+we have to be aware, if the values are shared with another DataFrame. pandas


comma after "aware" is unnecessary. also after "determine" on L17

jbrockmendel · 2023-02-22T04:06:12Z

doc/source/development/copy_on_write.rst

+
+We use a custom reference tracker object, ``BlockValuesRefs``, that keeps
+track of every block, whose values share memory with each other. The reference
+is held through a weak-reference. Every two blocks that share some memory should


two block -> pair of blocks

thx for the comments. opened #51552

phofl added 6 commits February 3, 2023 11:10

Add ref tracking to Block

d968cb3

Add typing and poc in copy

6a41cdb

Change reference tracking

b4779d2

Fix mypy

c7fe560

Add docs

47269ce

Add tests

6034053

phofl added the Copy / view semantics label Feb 3, 2023

phofl added 2 commits February 3, 2023 19:14

Fix pylint

4d4f856

Add dev docs for CoW

7101322

phofl requested review from jorisvandenbossche and lithomas1 February 3, 2023 18:16

phofl mentioned this pull request Feb 3, 2023

BUG: interpolate not respecting CoW #51147

Merged

5 tasks

Update copy_on_write.rst

189c7ea

jorisvandenbossche reviewed Feb 3, 2023

View reviewed changes

jorisvandenbossche added this to the 2.0 milestone Feb 3, 2023

Add tests

b796869

phofl added 2 commits February 4, 2023 00:57

Fix xfail

4712c0b

Fix xfail

f3f425c

jorisvandenbossche reviewed Feb 6, 2023

View reviewed changes

phofl and others added 6 commits February 6, 2023 22:15

Update pandas/_libs/internals.pyx

1c65bd8

Co-authored-by: Joris Van den Bossche <[email protected]>

Update pandas/_libs/internals.pyx

896b6a8

Co-authored-by: Joris Van den Bossche <[email protected]>

Update doc/source/development/copy_on_write.rst

1122d0c

Co-authored-by: Joris Van den Bossche <[email protected]>

Update pandas/_libs/internals.pyx

cc1ad65

Co-authored-by: Joris Van den Bossche <[email protected]>

Add comment

8c656a8

Adjust docstring

f44d9f2

jorisvandenbossche mentioned this pull request Feb 7, 2023

TST: parametrize CoW indexing tests for extension (nullable) dtypes #51208

Merged

phofl and others added 2 commits February 7, 2023 18:57

Merge branch 'main' into change_ref_tracking_for_cow

9694d33

Fix array manager test

a6fbcb9

mroeschke reviewed Feb 7, 2023

View reviewed changes

phofl added 4 commits February 7, 2023 22:36

Switch to WeakSet

fc71add

Switch to WeakSet

a45a70a

Merge remote-tracking branch 'upstream/main' into change_ref_tracking…

f0c3b32

…_for_cow # Conflicts: # pandas/core/internals/blocks.py

Add doc explanation about weakset

b05d3aa

phofl merged commit 5d17d73 into pandas-dev:main Feb 8, 2023

phofl deleted the change_ref_tracking_for_cow branch February 8, 2023 09:01

jorisvandenbossche mentioned this pull request Feb 8, 2023

Handle CoW in BlockManager.apply #50948

Closed

jbrockmendel reviewed Feb 22, 2023

View reviewed changes

	arr = get_array(df, "a")
	assert df._mgr.blocks[0].refs.has_reference() is False
	arr = get_array(df, "a")

CoW: Push reference tracking down to the block level #51144

CoW: Push reference tracking down to the block level #51144

Conversation

phofl commented Feb 3, 2023 • edited Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Feb 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Feb 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl Feb 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Feb 3, 2023

phofl commented Feb 3, 2023

phofl commented Feb 3, 2023 • edited Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Feb 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl Feb 6, 2023 • edited Loading

Choose a reason for hiding this comment

phofl commented Feb 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl commented Feb 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl commented Feb 3, 2023 •

edited

Loading

jorisvandenbossche Feb 3, 2023 •

edited

Loading

jorisvandenbossche Feb 6, 2023 •

edited

Loading

phofl Feb 3, 2023 •

edited

Loading

phofl commented Feb 3, 2023 •

edited

Loading

jorisvandenbossche Feb 6, 2023 •

edited

Loading

phofl Feb 6, 2023 •

edited

Loading