CoW: Delay copy when setting Series into DataFrame #51698

phofl · 2023-02-28T22:41:58Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

# Conflicts: # pandas/core/frame.py

phofl · 2023-03-15T15:18:24Z

This tackles one case for now, will add other cases step by step based on the strategy here.

Would not backport this, since it's unlikely that we'll finish everything till 2.0 comes out

jorisvandenbossche

Looks good to me!
(it will be passing through this refs object in quite some places, but I don't think there is any way around that)

jorisvandenbossche · 2023-03-15T21:31:38Z

pandas/tests/copy_view/test_indexing.py

@@ -977,7 +977,7 @@ def test_column_as_series_no_item_cache(
 # TODO add tests for other indexing methods on the Series


-def test_dataframe_add_column_from_series(backend):
+def test_dataframe_add_column_from_series(backend, using_copy_on_write):
    # Case: adding a new column to a DataFrame from an existing column/series
    # -> always already takes a copy on assignment
    # (no change in behaviour here)


you can update the TODO on the line below here

jorisvandenbossche · 2023-03-15T21:33:08Z

pandas/tests/copy_view/test_setitem.py

    else:
        # the data is copied
        assert not np.shares_memory(df["c"].values, df2["c"].values)

    # and modifying the set DataFrame does not modify the original DataFrame
    df2.iloc[0, 0] = 0
    tm.assert_series_equal(df["c"], Series([7, 8, 9], name="c"))
+
+
+def test_setitem_series_no_copy(using_copy_on_write):


How is this test different from the other ones above that you edited?

Missed that, but there is actually a difference, the existing test is a bit buggy, I will copy this one over and modify the other one as appropriate

Ah no I keep it here and modify the other test.

Regarding the other test: As soon as you update s, a copy is triggered and updating the df afterwards never has an effect if the initial copy worked because they don't share data anymore. The df-update is now handled by the new test.

Regarding the different cases: overwriting an existing column takes a different code path than adding a new one, so we have to test both

github-actions · 2023-04-15T00:05:34Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

jbrockmendel · 2023-04-21T17:55:46Z

pandas/core/frame.py

@@ -3944,11 +3945,11 @@ def isetitem(self, loc, value) -> None:
                )

            for i, idx in enumerate(loc):
-                arraylike = self._sanitize_column(value.iloc[:, i])
+                arraylike, _ = self._sanitize_column(value.iloc[:, i])


abstraction-wise ive been thinking of refs as being handled at the Manager/Block level. is this wrong? is there a viable way to keep this at that level?

Not if we don't want to move sanitise column to the same level. We have to keep track of the refs when we extract the array from values to set

jorisvandenbossche

Looks good to me!

I started to add some minor doc comments, but decided to just push them instead, hope you don't mind

jorisvandenbossche · 2023-05-05T11:01:08Z

pandas/core/frame.py

@@ -11917,4 +11933,4 @@ def _reindex_for_setitem(value: DataFrame | Series, index: Index) -> ArrayLike:
        raise TypeError(
            "incompatible index of inserted column with frame index"
        ) from err
-    return reindexed_value
+    return reindexed_value, None


Should we return here the result of value.reindex(index)._references, since in theory reindex can now return a view in certain corner cases?

jorisvandenbossche · 2023-05-05T11:05:14Z

pandas/core/frame.py

@@ -4166,8 +4173,9 @@ def _set_item(self, key, value) -> None:
                existing_piece = self[key]
                if isinstance(existing_piece, DataFrame):
                    value = np.tile(value, (len(existing_piece.columns), 1)).T
+                    refs = None


I was wondering if this would still be a view in case we are only setting a single column (i.e. len(existing_piece.columns) is equal to 1). But it seems that also in that case np.tile consistently returns a copy, so that should be fine.
(it might still be an opportunity for an optimization, although for a corner case)

jorisvandenbossche · 2023-05-05T11:18:40Z

pandas/tests/copy_view/test_setitem.py

+
+def test_setitem_series_no_copy_split_block(using_copy_on_write):
+    # Overwriting an existing column that is part of a larger block
+    df = DataFrame({"a": [1, 2, 3], "b": 1})


I was just thinking about the other issue to no longer use copy=True as default for dict (#52967), that such test cases as the above could start to no longer test what it is actually meant to test ..
I know that here it's not a dict of Series, so this example might not change, but writing it as DataFrame(np.array(..), columns=["a", "b"]) might be more robust?

Or set copy true explicitly?

jorisvandenbossche · 2023-05-05T11:20:38Z

Best to get this one in first, before looking at #52062 / #52051 / #52061, right?

phofl · 2023-05-05T11:22:24Z

Yep

jorisvandenbossche · 2023-05-05T11:22:42Z

pandas/tests/copy_view/test_indexing.py

-    # editing column in frame -> doesn't modify series
-    df.loc[2, "new"] = 100
-    expected_s = Series([0, 11, 12])
-    tm.assert_series_equal(s, expected_s)


Was there a reason to remove this? Or just that testing the modification in both ways ideally is done by recreating the original df, because the second modification / CoW-trigger is already influenced by the first?

Yeah the copy was already triggered in the step before, so this part of the test was useless since the objects didn't share memory anyway

phofl · 2023-05-05T15:25:15Z

Rebased all the other prs

phofl added 3 commits February 26, 2023 01:35

CoW: PoC for avoiding copy when setting values for rhs

b5d4f84

Merge remote-tracking branch 'upstream/main' into setitem_series_no_copy

93a5d6c

# Conflicts: # pandas/core/frame.py

Update tests

e3a50dd

phofl marked this pull request as draft February 28, 2023 22:42

phofl added 9 commits March 1, 2023 00:50

Fix array manager

290facd

Merge remote-tracking branch 'upstream/main' into setitem_series_no_copy

03ca70b

# Conflicts: # pandas/core/frame.py

Add type hint

d75f46a

Remove unnecessary

c61dffb

Fix typing

7bdff6b

Add tests

90777f5

Add tests

46256fe

Move tests

2bc836b

Add midx test

14d77da

phofl changed the title ~~POC: Avoid copying rhs when setting series into DataFrame~~ CoW: Delay copy when setting Series into DataFrame Mar 15, 2023

phofl marked this pull request as ready for review March 15, 2023 15:17

phofl requested a review from jorisvandenbossche March 15, 2023 15:18

Fix missing arg

fcb0bb4

jorisvandenbossche added the Copy / view semantics label Mar 15, 2023

jorisvandenbossche reviewed Mar 15, 2023

View reviewed changes

Remove

d6ed1cc

This was referenced Mar 17, 2023

CoW: Delay copy when setting Series or Frame with isetitem #52051

Merged

CoW: Delay copy when inserting Series into DataFrame #52061

Merged

github-actions bot added the Stale label Apr 15, 2023

jbrockmendel reviewed Apr 21, 2023

View reviewed changes

phofl added 2 commits April 22, 2023 01:20

Merge branch 'main' into setitem_series_no_copy

eca3446

Merge branch 'main' into setitem_series_no_copy

f86f077

phofl removed the Stale label May 5, 2023

phofl requested a review from jorisvandenbossche May 5, 2023 10:12

jorisvandenbossche added 2 commits May 5, 2023 12:52

add some more comments to the tests

1997e3f

update some docstrings

0e3d3f7

jorisvandenbossche reviewed May 5, 2023

View reviewed changes

jorisvandenbossche approved these changes May 5, 2023

View reviewed changes

jorisvandenbossche reviewed May 5, 2023

View reviewed changes

phofl merged commit 0dc6fdf into pandas-dev:main May 5, 2023

phofl deleted the setitem_series_no_copy branch May 5, 2023 15:19

phofl added this to the 2.1 milestone May 5, 2023

topper-123 pushed a commit to topper-123/pandas that referenced this pull request May 7, 2023

CoW: Delay copy when setting Series into DataFrame (pandas-dev#51698)

5bdd17c

lithomas1 mentioned this pull request May 14, 2023

Backport PR #53189 on branch 2.0.x (BUG/CoW: Series.rename not making a lazy copy when passed a scalar) #53221

Merged

Rylie-W pushed a commit to Rylie-W/pandas that referenced this pull request May 19, 2023

CoW: Delay copy when setting Series into DataFrame (pandas-dev#51698)

aa5a01f

jorisvandenbossche mentioned this pull request Jun 9, 2023

CLN: Cleanup after CoW setitem PRs #53142

Merged

5 tasks

Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023

CoW: Delay copy when setting Series into DataFrame (pandas-dev#51698)

e85792f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoW: Delay copy when setting Series into DataFrame #51698

CoW: Delay copy when setting Series into DataFrame #51698

phofl commented Feb 28, 2023 •

edited

Loading

phofl commented Mar 15, 2023 •

edited

Loading

jorisvandenbossche left a comment

jorisvandenbossche Mar 15, 2023

phofl Mar 15, 2023

jorisvandenbossche Mar 15, 2023

phofl Mar 15, 2023

phofl Mar 15, 2023

phofl Mar 15, 2023

github-actions bot commented Apr 15, 2023

jbrockmendel Apr 21, 2023

phofl Apr 21, 2023

jorisvandenbossche left a comment

jorisvandenbossche May 5, 2023

jorisvandenbossche May 5, 2023

jorisvandenbossche May 5, 2023

phofl May 5, 2023

jorisvandenbossche commented May 5, 2023

phofl commented May 5, 2023

jorisvandenbossche May 5, 2023

phofl May 5, 2023

phofl commented May 5, 2023

CoW: Delay copy when setting Series into DataFrame #51698

CoW: Delay copy when setting Series into DataFrame #51698

Conversation

phofl commented Feb 28, 2023 • edited Loading

phofl commented Mar 15, 2023 • edited Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Apr 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented May 5, 2023

phofl commented May 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl commented May 5, 2023

phofl commented Feb 28, 2023 •

edited

Loading

phofl commented Mar 15, 2023 •

edited

Loading