-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Use lazy copy in infer objects #50428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Conflicts: # pandas/tests/copy_view/test_methods.py
@jorisvandenbossche This is somehow tricky to set up because we are splitting the blocks under the hood so there is no option to check if they are the same as the initial blocks. I pushed the reference tracking down to the block level, to make it work for now |
pandas/core/internals/managers.py
Outdated
if not copy and using_copy_on_write(): | ||
copy = False | ||
elif copy is None: | ||
copy = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not copy and using_copy_on_write(): | |
copy = False | |
elif copy is None: | |
copy = True | |
if copy is None: | |
if using_copy_on_write(): | |
copy = False | |
else: | |
copy = True |
I know this does exactly the same in the end and is not shorter, but personally I find it more readable (it's clearer that only copy=None case is handled to translate it to True/False depending on the setting, while in the current code you are basically also overwriting copy=False with copy=False).
(and this is also the same pattern as used elsewhere, so ideally keep things consistent either way or the other)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
pandas/core/internals/managers.py
Outdated
copy = True | ||
|
||
if self.is_single_block: | ||
original_blocks = [self.blocks[0]] * self.shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the * self.shape[0]
needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are splitting blocks so we need the original blocks references * the number of columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only in case of object dtype, right? (although I suppose this list multiplication is not necessarily expensive, so maybe not worth to avoid doing when not needed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it's not that easy unfortunately,
Non object blocks are never copied, hence we need the references especially in this case. While object blocks might get copied (depends on the inference), so we might need the references there.
pandas/core/internals/blocks.py
Outdated
if using_copy_on_write: | ||
result = self.copy(deep=False) | ||
result._ref = weakref.ref( # type: ignore[attr-defined] | ||
original_blocks[self.mgr_locs.as_array[0]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to say: "isn't original_blocks[..]
just self
?". But that's not the case because of maybe_split
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the block splitting is the problem here. self is a temporary block that is dead the moment we get back to the manager level.
c44d222
to
2e2ed0f
Compare
# Conflicts: # doc/source/development/copy_on_write.rst # pandas/_libs/internals.pyx # pandas/core/internals/blocks.py
pandas/core/internals/blocks.py
Outdated
@@ -266,6 +268,7 @@ def getitem_block(self, slicer: slice | npt.NDArray[np.intp]) -> Block: | |||
|
|||
new_values = self._slice(slicer) | |||
refs = self.refs if isinstance(slicer, slice) else None | |||
refs = self.refs if isinstance(slicer, slice) else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small leftover from the merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, thx, I am already going over them. There seem to be a couple of things off
pandas/core/internals/blocks.py
Outdated
@@ -431,7 +433,9 @@ def _maybe_downcast(self, blocks: list[Block], downcast=None) -> list[Block]: | |||
if downcast is None: | |||
return blocks | |||
|
|||
return extend_blocks([b._downcast_2d(downcast) for b in blocks]) | |||
return extend_blocks( | |||
[b._downcast_2d(downcast) for b in blocks] # type: ignore[call-arg] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did something change here, or also a merge leftover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same yes, Updated everything now, should be good to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice ;)
@@ -748,6 +748,57 @@ def test_head_tail(method, using_copy_on_write): | |||
tm.assert_frame_equal(df, df_orig) | |||
|
|||
|
|||
def test_infer_objects(using_copy_on_write): | |||
df = DataFrame({"a": [1, 2], "b": "c", "c": 1, "d": "x"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth adding a test where one of the object dtype columns actually gets converted? (eg change d to "d": np.array([0, 1], dtype=object)
) To ensure this column owns its memory (and cover that part of the code additions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, adjusted d in the two tests below to cover both cases.
Good to go apart from that? convert
needs cow for a couple of other things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.