-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
CoW: Push reference tracking down to the block level #51144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great! And a nice simplification ;)
I will only have time on Monday (during the day) for a thorough review, just a few quick comments
One thing I am wondering: because it is now at the Block level, it might be more dtype / Block-subclass specific, and we should maybe parametrize our basic copy_view tests more with different categories of dtypes (eg at least numpy based on extension/nullable) (something we should have done anyway, can also be done in a follow-up/precursor, happy to take a look at that)
@@ -59,8 +60,13 @@ class SharedBlock: | |||
_mgr_locs: BlockPlacement | |||
ndim: int | |||
values: ArrayLike | |||
refs: BlockValuesRefs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor naming nit: if we want to simplify this (so it "speaks" better), could also be "BlockRefs", since referencing the block vs the values is kind of equivalent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, that's maybe a good question: is it equivalent?
With the Blockmanager tracking, the rule was that each DataFrame had its own unique manager, so one manager was never shared between different dataframes. I assume similarly here, we always create new Blocks for new Managers/DataFrames (potentially just shallow copies). So if a certain values
is shared (referenced) by 3 DataFrames, there are 3 unique Block objects that hold the values? Or differently stated, the referenced_blocks
list, it's always all weakrefs to unique blocks in there? Or can there ever be two weakrefs to the same Block instance? Now this is done at the Block level, that can actually be OK to have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically this is possible (having two references to the same block). I'd prefer using shallow copies though, the way it is implemented tracking the new refs works automatically when you create a shallow copy, while you would have to explicitly update the references with your own block again. I think this would look a bit hacky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was, we actually want to reference the values, e.g. same values living in different blocks, hence this name. But not too happy with it either, so not opposed to renaming. BlockRefs did not capture everything to me, that's why I ended up with this name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically this is possible (having two references to the same block). I'd prefer using shallow copies though,
Yes, agreed that we certainly intent to always use shallow copies (new blocks)
My thinking was, we actually want to reference the values, e.g. same values living in different blocks, hence this name. But not too happy with it either, so not opposed to renaming
Makes sense. Then another alternative is to leave out the Block and Values altogether, and do something with "Refs" (like the RefTracker). But that is very much name bikeshedding at this point and not too important ;)
# TODO(CoW): This is tricky, if parent block goes out of scope | ||
# all split blocks are referencing each other even though they | ||
# don't share data | ||
refs = self.refs if self.refs.has_reference() else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we create kind of a "clone" of the refs
(duplicate it)? Create a new refs
that creates new weakrefs to the same objects. But that ensures that both lists are independent, and the two split blocks don't reference each other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is doable (in theory), but not in this pr (would make things way more complicated). The thing that is not clear to me with this solution is how we would update all parent blocks that share this reference object
You could have:
a -> b -> c -> split happens ->d1, d2, d3, ...
When you clone the reference object, you would have to make sure that a, b and c are all aware of the new object. The simplest solution would probably be, that a reference object can point to another reference object which would live with the split blocks. These chains can get arbitrarily deep though, so we have to think the exact approach through.
Good thing with this class is, that we can hide the implementation inside it without having to worry about it in other areas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing that is not clear to me with this solution is how we would update all parent blocks that share this reference object
Ah, yes, that's an aspect that I didn't consider when writing the above comment, and that indeed makes it quite a bit more complicated ..
Shorter term, we could maybe implement a small optimization in case the original dataframe doesn't have any references, then each of the sliced blocks will still have no foreign references?
In [1]: df = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])
In [3]: del df['b']
In [4]: df._mgr
Out[4]:
BlockManager
Items: Index(['a', 'c', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=10, step=1)
NumericBlock: slice(0, 1, 1), 1 x 10, dtype: float64
NumericBlock: slice(1, 2, 1), 1 x 10, dtype: float64
NumericBlock: slice(2, 3, 1), 1 x 10, dtype: float64
In [9]: df._mgr.blocks[0].refs.referenced_blocks
Out[9]:
[<weakref at 0x7fe48cc6ec20; dead>,
<weakref at 0x7fe48caaa0e0; to 'NumericBlock' at 0x7fe48cb8d9a0>,
<weakref at 0x7fe48cb82040; to 'NumericBlock' at 0x7fe48cb87160>,
<weakref at 0x7fe48cb82680; to 'NumericBlock' at 0x7fe48cb6f100>]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer doing this in a follow up. We are calling _slice_take_blocks_ax0
under the hood which would have to avoid tracking references via a keyword. Not super difficult but would make the pr messier than necessary. But will get a pr up before 2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, to be clear my "shorter term" was meant as "after this PR and if we don't directly have a better general solution for 2.0" ;)
|
||
if unfit_val_locs: | ||
unfit_idxr = np.concatenate(unfit_mgr_locs) | ||
unfit_count = len(unfit_idxr) | ||
|
||
new_blocks: list[Block] = [] | ||
# TODO(CoW) is this always correct to assume that the new_blocks | ||
# are not referencing anything else? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some tests in copy_view/test_setitem.py
that checks we do copy the array-like values we are setting (not sure how complete they are though, given this is a beast of a method). And so because we copy upfront, that currently is fine regarding CoW (for setting full columns, which I think is what iset
is used for?). Although a future optimization we can add is that we can actually delay the copy for Series/DataFrame being set as new column(s). And then we will have to passthrough its refs somehow to this method.
EDIT: Ah, I see it was my own comment that was moved ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep :) This is on my todo list as well, just moved it to a better place after the modification
|
||
df["b"] = 100 | ||
arr = get_array(df, "a") | ||
df.iloc[0, 0] = 100 # Check that we correctly track reference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment is a bit confusing because it's testing we are not tracking a reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted the comment
Some additional test ideas:
|
Agree with parametrising the tests, this was one my todo list as well. Would prefer to do in a separate pr though. |
Added the tests of your two cases. The del right now sets up references so that we think that there is another reference alive, because we are creating shallow copies under the hood. Is there a way that delete would actually require us to keep references alive? Otherwise we could fix this in idelete on the manager level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All my comments are relatively minor things, I think we can certainly try to merge this shortly so other CoW PRs don't need to be hold up.
Some comments I couldn't make inline:
- Did you check all occurrences of
Block.make_block()
that the values are newly created and no refs need to be tracked?
For example_maybe_downcast
doesn't necessarily copy the data (but didn't check in detail where_maybe_downcast
itself is being used; it might do that only in cases where we already have a copy). It's also used inastype
, so this might something to handle in an updated version of the PR that handlesastype
. From a quick look the rest seems OK. Block.convert()
can currently return[self]
. Should that be[self.copy(deep=False)]
? Again it might depend on where this is actually called. For example it is called frominfer_objects
, but that's also something for which we have a follow-up PR. Other cases where this is called from within other Block methods might already have copied beforehand.- Can you add a comment to
Block.iget()
that it returns a view and that the caller needs to take care of keeping a reference if needed (I did something similar forBlockManager.iget_values
, can copy from there) - Just a question: do we (eventually, later, not for this PR) move some of the CoW-triggers from the manager to block level as well?
Before merging: do we want to do the ref tracking only if CoW is enabled? I think that should be fine to keep as is (also before this PR, we were creating the refs
list of weakrefs on BlockManager in all cases, while only using it when CoW is enabled (parent
was only set if CoW was actually enabled, but that's because it was a hard reference. In this PR we only have weakrefs, so that should be fine)
And thanks for starting some developer-oriented documentation!
@@ -59,8 +60,13 @@ class SharedBlock: | |||
_mgr_locs: BlockPlacement | |||
ndim: int | |||
values: ArrayLike | |||
refs: BlockValuesRefs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically this is possible (having two references to the same block). I'd prefer using shallow copies though,
Yes, agreed that we certainly intent to always use shallow copies (new blocks)
My thinking was, we actually want to reference the values, e.g. same values living in different blocks, hence this name. But not too happy with it either, so not opposed to renaming
Makes sense. Then another alternative is to leave out the Block and Values altogether, and do something with "Refs" (like the RefTracker). But that is very much name bikeshedding at this point and not too important ;)
# TODO(CoW): This is tricky, if parent block goes out of scope | ||
# all split blocks are referencing each other even though they | ||
# don't share data | ||
refs = self.refs if self.refs.has_reference() else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing that is not clear to me with this solution is how we would update all parent blocks that share this reference object
Ah, yes, that's an aspect that I didn't consider when writing the above comment, and that indeed makes it quite a bit more complicated ..
Shorter term, we could maybe implement a small optimization in case the original dataframe doesn't have any references, then each of the sliced blocks will still have no foreign references?
In [1]: df = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])
In [3]: del df['b']
In [4]: df._mgr
Out[4]:
BlockManager
Items: Index(['a', 'c', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=10, step=1)
NumericBlock: slice(0, 1, 1), 1 x 10, dtype: float64
NumericBlock: slice(1, 2, 1), 1 x 10, dtype: float64
NumericBlock: slice(2, 3, 1), 1 x 10, dtype: float64
In [9]: df._mgr.blocks[0].refs.referenced_blocks
Out[9]:
[<weakref at 0x7fe48cc6ec20; dead>,
<weakref at 0x7fe48caaa0e0; to 'NumericBlock' at 0x7fe48cb8d9a0>,
<weakref at 0x7fe48cb82040; to 'NumericBlock' at 0x7fe48cb87160>,
<weakref at 0x7fe48cb82680; to 'NumericBlock' at 0x7fe48cb6f100>]
@@ -663,7 +626,7 @@ def consolidate(self: T) -> T: | |||
if self.is_consolidated(): | |||
return self | |||
|
|||
bm = type(self)(self.blocks, self.axes, self.refs, verify_integrity=False) | |||
bm = type(self)(self.blocks, self.axes, verify_integrity=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we pass shallow copies of the blocks in self.blocks
here to keep track of references?
_consolidate_inplace
will not update all blocks, so this might create a new BlockManager with blocks that shares data with the original BlockManager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't a call where this would matter right now. It's either coming from copy or from a function where we did a copy of the blocks anyway. At least I couldn't find any. Could add it though if you prefer.
def test_assigning_to_same_variable_removes_references(using_copy_on_write): | ||
df = DataFrame({"a": [1, 2, 3]}) | ||
df = df.reset_index() | ||
arr = get_array(df, "a") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arr = get_array(df, "a") | |
assert df._mgr.blocks[0].refs.has_reference() is False | |
arr = get_array(df, "a") |
That's more explicit, but maybe redundant with the mutation check? (and also dependent on the order of the blocks in the result)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and also dependent on the order of the blocks in the result)
Could also be assert df._mgr._has_no_reference(1)
to be robust for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the last one, will add that
df["b"] = 100 | ||
arr = get_array(df, "a") | ||
df.iloc[0, 0] = 100 # Check that we correctly removed the reference | ||
assert np.shares_memory(arr, get_array(df, "a")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually not super clear to me what this test is targeting. Does the implementation of df.__setitem__
create an intermediary Block that might incorrectly have a reference? Or what's the trigger to explicitly test this?
(not that anything about the test is "wrong", just wondering ;))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, setitem splits the blocks under the hood to avoid calling np.delete(). This is basically what we are also testing in the del df["a"] test above.
I want to make sure that the spit blocks don't have any references on them if the parent block did not have any reference. Parent block went out of scope, so the new blocks should be clear of references as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK, thanks for the explanation, wasn't thinking about the side effect of the non-inplace __setitem__
of splitting the block
df.iloc[0, 0] = 100 # Check that we correctly release the reference | ||
if using_copy_on_write: | ||
mark = pytest.mark.xfail( | ||
reason="blk.delete does not track references correctly" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How blk.delete
involved here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same explanation as above. blk.delete splits the block during setitem -> all blocks are referencing each other and view -> view goes out of scope -> theoretically the split blocks shouldn't have references anymore because they don't share data with another block.
This is what we will have to optimise later.
I'll adjust the comment to better explain it
Quick comments to your top-comment:
Yeah I mentioned the tracking In my top post as well. Would make things unnecessarily complex (and slower) imo if we only do this for CoW. Accessing it is what could cause a slowdown, which is not done without CoW enabled |
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
pandas/_libs/internals.pyx
Outdated
bool | ||
""" | ||
self.referenced_blocks = [ | ||
obj for obj in self.referenced_blocks if obj() is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a naive implementation question: Could a weakref.WeakSet
be use to track the references such that
- We don't unnecessarily add more than one weakref to the same block
- The set will prune itself of refs that don't exist anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should work, yes. When I implemented it I wasn't sure if we potentially want to use the same block in 2 different objects, which would have made a list necessary. But we agreed above that this is not a good idea in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we agreed above that this is not a good idea in general.
While we agree we should ideally always use separate Block objects, changing this makes that a requirement, right? (because before you could have two identical Blocks in that list, although which means they would always cause the data to be referenced, even if the Manager that references one of either blocks would go out of scope. But at least it would ensure to not incorrectly not trigger CoW in such cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would be a requirement.
But right now we are doing this in all cases where the CoW optimisations are enabled. If we decide this is no longer a good idea, we could change the implementation back to a list since this has no impact outside of the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, and to be clear I am fine with it being a requirement (I think that makes it clearer), but then we should maybe add this expectation to the docs you started. Maybe something like:
Whenever a DataFrame or Series object is sharing data with another object, it is required that each of those objects have its own BlockManager and Block objects. Thus, in other words, one Block instance (that is hold by a DataFrame, not necessarily for intermediate objects) should always be uniquely used for only a single DataFrame/Series object. For example, when you want to use the same Block for another object, you can create a shallow copy of the Block instance with
block.copy(deep=False)
(which will create a new Block instance with the same underlying values and which will correctly set up the references).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah got you, that's a good idea. Added your suggestion to the docs
…_for_cow # Conflicts: # pandas/core/internals/blocks.py
Merging now to unblock other stuff, can still optimise later if necessary |
------------------ | ||
|
||
To be able to determine, if we have to make a copy when writing into a DataFrame, | ||
we have to be aware, if the values are shared with another DataFrame. pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comma after "aware" is unnecessary. also after "determine" on L17
|
||
We use a custom reference tracker object, ``BlockValuesRefs``, that keeps | ||
track of every block, whose values share memory with each other. The reference | ||
is held through a weak-reference. Every two blocks that share some memory should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two block -> pair of blocks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx for the comments. opened #51552
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.This pr removes the reference tracking from the manager level and adds it to the block level. Tests are passing locally.
Right now we are tracking the references also in non-CoW mode, but we are only accessing the object in CoW mode, so should not be harmful. Can also disable this, but wanted to avoid accessing
using_copy_on_write()
constantly.Two things that the previous mechanism couldn't handle:
This triggered a copy, because the reference to the original df was kept alive. Same example: We were not able to free the memory of old object, if the result of an operation was assigned to the same variable.