Skip to content

ENH: Use lazy copy in infer objects #50428

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Feb 8, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Dec 24, 2022

@phofl phofl marked this pull request as draft December 24, 2022 17:33
@phofl phofl marked this pull request as ready for review January 8, 2023 10:42
@phofl
Copy link
Member Author

phofl commented Jan 8, 2023

@jorisvandenbossche This is somehow tricky to set up because we are splitting the blocks under the hood so there is no option to check if they are the same as the initial blocks. I pushed the reference tracking down to the block level, to make it work for now

Comment on lines 445 to 448
if not copy and using_copy_on_write():
copy = False
elif copy is None:
copy = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not copy and using_copy_on_write():
copy = False
elif copy is None:
copy = True
if copy is None:
if using_copy_on_write():
copy = False
else:
copy = True

I know this does exactly the same in the end and is not shorter, but personally I find it more readable (it's clearer that only copy=None case is handled to translate it to True/False depending on the setting, while in the current code you are basically also overwriting copy=False with copy=False).

(and this is also the same pattern as used elsewhere, so ideally keep things consistent either way or the other)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

copy = True

if self.is_single_block:
original_blocks = [self.blocks[0]] * self.shape[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the * self.shape[0] needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are splitting blocks so we need the original blocks references * the number of columns

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only in case of object dtype, right? (although I suppose this list multiplication is not necessarily expensive, so maybe not worth to avoid doing when not needed)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's not that easy unfortunately,

Non object blocks are never copied, hence we need the references especially in this case. While object blocks might get copied (depends on the inference), so we might need the references there.

if using_copy_on_write:
result = self.copy(deep=False)
result._ref = weakref.ref( # type: ignore[attr-defined]
original_blocks[self.mgr_locs.as_array[0]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to say: "isn't original_blocks[..] just self?". But that's not the case because of maybe_split?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the block splitting is the problem here. self is a temporary block that is dead the moment we get back to the manager level.

@phofl phofl force-pushed the cow_infer_objects branch from c44d222 to 2e2ed0f Compare January 13, 2023 14:10
@@ -266,6 +268,7 @@ def getitem_block(self, slicer: slice | npt.NDArray[np.intp]) -> Block:

new_values = self._slice(slicer)
refs = self.refs if isinstance(slicer, slice) else None
refs = self.refs if isinstance(slicer, slice) else None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small leftover from the merge

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, thx, I am already going over them. There seem to be a couple of things off

@@ -431,7 +433,9 @@ def _maybe_downcast(self, blocks: list[Block], downcast=None) -> list[Block]:
if downcast is None:
return blocks

return extend_blocks([b._downcast_2d(downcast) for b in blocks])
return extend_blocks(
[b._downcast_2d(downcast) for b in blocks] # type: ignore[call-arg]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did something change here, or also a merge leftover?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same yes, Updated everything now, should be good to go.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice ;)

@@ -748,6 +748,57 @@ def test_head_tail(method, using_copy_on_write):
tm.assert_frame_equal(df, df_orig)


def test_infer_objects(using_copy_on_write):
df = DataFrame({"a": [1, 2], "b": "c", "c": 1, "d": "x"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth adding a test where one of the object dtype columns actually gets converted? (eg change d to "d": np.array([0, 1], dtype=object)) To ensure this column owns its memory (and cover that part of the code additions)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, adjusted d in the two tests below to cover both cases.

Good to go apart from that? convert needs cow for a couple of other things

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jorisvandenbossche jorisvandenbossche merged commit 7e88122 into pandas-dev:main Feb 8, 2023
@phofl phofl deleted the cow_infer_objects branch February 8, 2023 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants