ENH: Use lazy copy in infer objects #50428

phofl · 2022-12-24T14:19:28Z

xref ENH / CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate #49473 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

# Conflicts: # pandas/tests/copy_view/test_methods.py

phofl · 2023-01-08T11:54:48Z

@jorisvandenbossche This is somehow tricky to set up because we are splitting the blocks under the hood so there is no option to check if they are the same as the initial blocks. I pushed the reference tracking down to the block level, to make it work for now

pandas/core/internals/array_manager.py

jorisvandenbossche · 2023-01-11T09:54:42Z

pandas/core/internals/managers.py

+        if not copy and using_copy_on_write():
+            copy = False
+        elif copy is None:
+            copy = True


Suggested change

if not copy and using_copy_on_write():

copy = False

elif copy is None:

copy = True

if copy is None:

if using_copy_on_write():

copy = False

else:

copy = True

I know this does exactly the same in the end and is not shorter, but personally I find it more readable (it's clearer that only copy=None case is handled to translate it to True/False depending on the setting, while in the current code you are basically also overwriting copy=False with copy=False).

(and this is also the same pattern as used elsewhere, so ideally keep things consistent either way or the other)

jorisvandenbossche · 2023-01-11T11:10:52Z

pandas/core/internals/managers.py

+            copy = True
+
+        if self.is_single_block:
+            original_blocks = [self.blocks[0]] * self.shape[0]


Why is the * self.shape[0] needed?

We are splitting blocks so we need the original blocks references * the number of columns

Only in case of object dtype, right? (although I suppose this list multiplication is not necessarily expensive, so maybe not worth to avoid doing when not needed)

No it's not that easy unfortunately,

Non object blocks are never copied, hence we need the references especially in this case. While object blocks might get copied (depends on the inference), so we might need the references there.

jorisvandenbossche · 2023-01-11T11:10:57Z

pandas/core/internals/blocks.py

+            if using_copy_on_write:
+                result = self.copy(deep=False)
+                result._ref = weakref.ref(  # type: ignore[attr-defined]
+                    original_blocks[self.mgr_locs.as_array[0]]


I was going to say: "isn't original_blocks[..] just self?". But that's not the case because of maybe_split?

Yeah the block splitting is the problem here. self is a temporary block that is dead the moment we get back to the manager level.

# Conflicts: # doc/source/development/copy_on_write.rst # pandas/_libs/internals.pyx # pandas/core/internals/blocks.py

jorisvandenbossche · 2023-02-08T09:07:09Z

pandas/core/internals/blocks.py

@@ -266,6 +268,7 @@ def getitem_block(self, slicer: slice | npt.NDArray[np.intp]) -> Block:

        new_values = self._slice(slicer)
        refs = self.refs if isinstance(slicer, slice) else None
+        refs = self.refs if isinstance(slicer, slice) else None


small leftover from the merge

Yep, thx, I am already going over them. There seem to be a couple of things off

jorisvandenbossche · 2023-02-08T09:10:20Z

pandas/core/internals/blocks.py

@@ -431,7 +433,9 @@ def _maybe_downcast(self, blocks: list[Block], downcast=None) -> list[Block]:
        if downcast is None:
            return blocks

-        return extend_blocks([b._downcast_2d(downcast) for b in blocks])
+        return extend_blocks(
+            [b._downcast_2d(downcast) for b in blocks]  # type: ignore[call-arg]


Did something change here, or also a merge leftover?

Same yes, Updated everything now, should be good to go.

jorisvandenbossche

This is nice ;)

jorisvandenbossche · 2023-02-08T09:19:58Z

pandas/tests/copy_view/test_methods.py

@@ -748,6 +748,57 @@ def test_head_tail(method, using_copy_on_write):
    tm.assert_frame_equal(df, df_orig)


+def test_infer_objects(using_copy_on_write):
+    df = DataFrame({"a": [1, 2], "b": "c", "c": 1, "d": "x"})


It might be worth adding a test where one of the object dtype columns actually gets converted? (eg change d to "d": np.array([0, 1], dtype=object)) To ensure this column owns its memory (and cover that part of the code additions)

Good point, adjusted d in the two tests below to cover both cases.

Good to go apart from that? convert needs cow for a couple of other things

jorisvandenbossche

Thanks!

phofl and others added 2 commits December 24, 2022 15:18

ENH: Use lazy copy in infer objects

3c4cee3

Update test_methods.py

db5dba1

phofl marked this pull request as draft December 24, 2022 17:33

jorisvandenbossche approved these changes Jan 3, 2023

View reviewed changes

jorisvandenbossche added the Copy / view semantics label Jan 3, 2023

jorisvandenbossche added this to the 2.0 milestone Jan 3, 2023

jorisvandenbossche mentioned this pull request Jan 3, 2023

ENH / CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate #49473

Closed

73 tasks

phofl added 2 commits January 8, 2023 11:35

Merge remote-tracking branch 'upstream/main' into cow_infer_objects

9eb4d85

# Conflicts: # pandas/tests/copy_view/test_methods.py

Convert copy

081dd29

phofl marked this pull request as ready for review January 8, 2023 10:42

phofl added 2 commits January 8, 2023 12:51

Convert copy

76c443b

Remove setting

f7724ff

phofl added 6 commits January 8, 2023 13:38

Fix

a7b4e27

Fix typing

1d4f726

Revert unrelated change

018cfe6

Fix

1696d8a

Fix array manager

1782fbd

Fix type

8cb6355

jorisvandenbossche reviewed Jan 11, 2023

View reviewed changes

phofl and others added 4 commits January 12, 2023 17:00

Merge branch 'main' into cow_infer_objects

cead228

Fix copy array manager

47d85b3

Fix manager

a3d0a2b

Merge remote-tracking branch 'upstream/main' into cow_inf_obj

2e2ed0f

phofl force-pushed the cow_infer_objects branch from c44d222 to 2e2ed0f Compare January 13, 2023 14:10

Merge branch 'main' into cow_infer_objects

3a84382

jorisvandenbossche mentioned this pull request Jan 23, 2023

Handle CoW in BlockManager.apply #50948

Closed

phofl added 2 commits January 26, 2023 19:39

Merge remote-tracking branch 'upstream/main' into cow_infer_objects

64d550b

Refactor

716cef8

phofl mentioned this pull request Jan 31, 2023

ENH: Optimize replace to avoid copying when not necessary #50918

Merged

5 tasks

phofl added 7 commits January 31, 2023 21:09

Fix mypy

f693829

Move to cython

97fa214

Merge remote-tracking branch 'upstream/main' into cow_infer_objects

c3e4e66

Merge CoW ref tracking

5d687dd

Refactor for new ref tracking logic

f5bf65f

Merge remote-tracking branch 'upstream/main' into cow_infer_objects

af8af95

# Conflicts: # doc/source/development/copy_on_write.rst # pandas/_libs/internals.pyx # pandas/core/internals/blocks.py

Fix array manager diff

f47f6dd

jorisvandenbossche reviewed Feb 8, 2023

View reviewed changes

Fix merge conflicts

e357b64

jorisvandenbossche reviewed Feb 8, 2023

View reviewed changes

phofl added 2 commits February 8, 2023 10:10

Add todo

bf1bb3b

Add test

5624983

jorisvandenbossche reviewed Feb 8, 2023

View reviewed changes

phofl added 4 commits February 8, 2023 10:22

Adjust test

9a6d516

Adjust test

8b0e2b0

Adjust test

3f832ba

Fixup

9f11f0a

jorisvandenbossche approved these changes Feb 8, 2023

View reviewed changes

jorisvandenbossche merged commit 7e88122 into pandas-dev:main Feb 8, 2023

phofl deleted the cow_infer_objects branch February 8, 2023 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Use lazy copy in infer objects #50428

ENH: Use lazy copy in infer objects #50428

phofl commented Dec 24, 2022

phofl commented Jan 8, 2023

jorisvandenbossche Jan 11, 2023

phofl Jan 13, 2023

jorisvandenbossche Jan 11, 2023

phofl Jan 11, 2023

jorisvandenbossche Jan 13, 2023

phofl Jan 13, 2023

jorisvandenbossche Jan 11, 2023

phofl Jan 11, 2023

jorisvandenbossche Feb 8, 2023

phofl Feb 8, 2023

jorisvandenbossche Feb 8, 2023

phofl Feb 8, 2023

jorisvandenbossche left a comment

jorisvandenbossche Feb 8, 2023

phofl Feb 8, 2023

jorisvandenbossche left a comment

ENH: Use lazy copy in infer objects #50428

ENH: Use lazy copy in infer objects #50428

Conversation

phofl commented Dec 24, 2022

phofl commented Jan 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment