ENH: Avoid copying unnecessary columns in setitem by splitting blocks for CoW #51031

phofl · 2023-01-27T23:34:02Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

# Conflicts: # pandas/core/internals/managers.py # pandas/tests/internals/test_internals.py

# Conflicts: # pandas/core/internals/managers.py

phofl · 2023-02-09T14:29:55Z

I think this is ready, will look into the single block case separately. The split allows us to only copy columns that are modified

jorisvandenbossche

I haven't looked (previously) at the _iset_split_block logic in detail, but a few minor comments/questions

jorisvandenbossche · 2023-02-09T23:07:31Z

pandas/tests/copy_view/test_indexing.py

+)
+def test_set_value_loc_copy_only_necessary_column(using_copy_on_write, method, val):
+    # Only copy column that is modified, multi-block only for now
+    df = DataFrame({"a": [1, 2, 3], "b": 1, "c": 1.5})


Suggested change

df = DataFrame({"a": [1, 2, 3], "b": 1, "c": 1.5})

df = DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [0.1, 0.2, 0.3]})

(small nitpick, but almost every test in this file is already using that)

jorisvandenbossche · 2023-02-10T09:37:38Z

pandas/core/internals/managers.py

+                values = values.copy()
+            else:
+                values = values[[blk_loc]]
+            self._iset_split_block(blkno, [blk_loc], values)


Do we still need to call this for the values.ndim == 1 case that already copied the values, or does this just also does the "wrapping in block and updating self.blocks" for that case?

And _iset_split_block assumes that values is always already copied, right?

Yes and Yes

Not calling it would duplicate the logic

pandas/core/internals/managers.py

jorisvandenbossche · 2023-02-10T09:43:13Z

pandas/tests/copy_view/test_indexing.py

+def test_set_value_loc_copy_only_necessary_column(using_copy_on_write, method, val):
+    # Only copy column that is modified, multi-block only for now


Suggested change

def test_set_value_loc_copy_only_necessary_column(using_copy_on_write, method, val):

# Only copy column that is modified, multi-block only for now

def test_set_value_copy_only_necessary_column(using_copy_on_write, method, val):

# When setting inplace, only copy column that is modified instead of the whole

# block (by splitting the block)

# TODO multi-block only for now

jorisvandenbossche · 2023-02-10T09:46:32Z

pandas/tests/copy_view/test_indexing.py

+    df_orig = df.copy()
+    view = df[:]
+
+    df.loc[0, "a"] = val


Should this use method ? Hmm, but that doesn't work for setitem. So you will have to parametrize with tuples of (method, indexer), so you can do method(df)[indexer] = val?

Yeah that's probably why I missed it. That's a pita...

pandas/core/internals/managers.py

pandas/tests/indexing/test_iloc.py

jorisvandenbossche

Apart from the 2 inline suggestions, looks good to me!

… for CoW (pandas-dev#51031)

phofl added 4 commits January 20, 2023 15:39

CLN: Refactor block splitting into its own method

b0812c2

Add docs

0ee8783

ENH: Avoid copying unnecessary columns in indexing for CoW

3944655

Merge remote-tracking branch 'upstream/main' into loc_avoid_copy

23b5fbe

# Conflicts: # pandas/core/internals/managers.py # pandas/tests/internals/test_internals.py

phofl added the Copy / view semantics label Jan 27, 2023

phofl added 6 commits January 27, 2023 20:22

Fix mypy

33eb278

Merge remote-tracking branch 'upstream/main' into loc_avoid_copy

297c08f

# Conflicts: # pandas/core/internals/managers.py

Merge new ref tracking

331232e

Remove elif

cae0170

Merge remote-tracking branch 'upstream/main' into loc_avoid_copy

24b58ab

Fix

26f1322

phofl requested a review from jorisvandenbossche February 9, 2023 14:29

jorisvandenbossche changed the title ~~ENH: Avoid copying unnecessary columns in indexing for CoW~~ ENH: Avoid copying unnecessary columns in setitem by splitting blocks for CoW Feb 10, 2023

jorisvandenbossche reviewed Feb 10, 2023

View reviewed changes

jorisvandenbossche mentioned this pull request Feb 10, 2023

Copy-on-Write (PDEP-7) follow-up overview issue #48998

Open

38 tasks

jorisvandenbossche added this to the 2.0 milestone Feb 10, 2023

jorisvandenbossche reviewed Feb 10, 2023

View reviewed changes

phofl added 2 commits February 10, 2023 12:21

Clean up

697ddb8

Clean

6b6eb63

jorisvandenbossche reviewed Feb 10, 2023

View reviewed changes

pandas/core/internals/managers.py Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Feb 10, 2023

View reviewed changes

pandas/tests/indexing/test_iloc.py Outdated Show resolved Hide resolved

jorisvandenbossche approved these changes Feb 10, 2023

View reviewed changes

Fix

fb623cc

phofl merged commit c74d057 into pandas-dev:main Feb 10, 2023

phofl deleted the loc_avoid_copy branch February 10, 2023 17:55

phofl added a commit to phofl/pandas that referenced this pull request Feb 10, 2023

ENH: Avoid copying unnecessary columns in setitem by splitting blocks…

f5125c9

… for CoW (pandas-dev#51031)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Avoid copying unnecessary columns in setitem by splitting blocks for CoW #51031

ENH: Avoid copying unnecessary columns in setitem by splitting blocks for CoW #51031

Uh oh!

phofl commented Jan 27, 2023 •

edited by jorisvandenbossche

Loading

Uh oh!

phofl commented Feb 9, 2023

Uh oh!

jorisvandenbossche left a comment

Uh oh!

jorisvandenbossche Feb 9, 2023

Uh oh!

phofl Feb 10, 2023

Uh oh!

jorisvandenbossche Feb 10, 2023

Uh oh!

phofl Feb 10, 2023

Uh oh!

Uh oh!

jorisvandenbossche Feb 10, 2023 •

edited

Loading

Uh oh!

phofl Feb 10, 2023

Uh oh!

jorisvandenbossche Feb 10, 2023

Uh oh!

phofl Feb 10, 2023

Uh oh!

phofl Feb 10, 2023

Uh oh!

Uh oh!

Uh oh!

jorisvandenbossche left a comment

Uh oh!

Uh oh!

	df = DataFrame({"a": [1, 2, 3], "b": 1, "c": 1.5})
	df = DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [0.1, 0.2, 0.3]})

		def test_set_value_loc_copy_only_necessary_column(using_copy_on_write, method, val):
		# Only copy column that is modified, multi-block only for now

-def test_set_value_loc_copy_only_necessary_column(using_copy_on_write, method, val):
-    # Only copy column that is modified, multi-block only for now
+def test_set_value_copy_only_necessary_column(using_copy_on_write, method, val):
+    # When setting inplace, only copy column that is modified instead of the whole
+    # block (by splitting the block)
+    # TODO multi-block only for now

Uh oh!

ENH: Avoid copying unnecessary columns in setitem by splitting blocks for CoW #51031

ENH: Avoid copying unnecessary columns in setitem by splitting blocks for CoW #51031

Uh oh!

Conversation

phofl commented Jan 27, 2023 • edited by jorisvandenbossche Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phofl commented Feb 9, 2023

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jorisvandenbossche Feb 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

phofl commented Jan 27, 2023 •

edited by jorisvandenbossche

Loading

jorisvandenbossche Feb 10, 2023 •

edited

Loading