Skip to content

CoW: Delay copy when setting Series into DataFrame #51698

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
May 5, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Feb 28, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@phofl phofl marked this pull request as draft February 28, 2023 22:42
@phofl phofl changed the title POC: Avoid copying rhs when setting series into DataFrame CoW: Delay copy when setting Series into DataFrame Mar 15, 2023
@phofl phofl marked this pull request as ready for review March 15, 2023 15:17
@phofl
Copy link
Member Author

phofl commented Mar 15, 2023

This tackles one case for now, will add other cases step by step based on the strategy here.

Would not backport this, since it's unlikely that we'll finish everything till 2.0 comes out

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!
(it will be passing through this refs object in quite some places, but I don't think there is any way around that)

@@ -977,7 +977,7 @@ def test_column_as_series_no_item_cache(
# TODO add tests for other indexing methods on the Series


def test_dataframe_add_column_from_series(backend):
def test_dataframe_add_column_from_series(backend, using_copy_on_write):
# Case: adding a new column to a DataFrame from an existing column/series
# -> always already takes a copy on assignment
# (no change in behaviour here)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can update the TODO on the line below here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

else:
# the data is copied
assert not np.shares_memory(df["c"].values, df2["c"].values)

# and modifying the set DataFrame does not modify the original DataFrame
df2.iloc[0, 0] = 0
tm.assert_series_equal(df["c"], Series([7, 8, 9], name="c"))


def test_setitem_series_no_copy(using_copy_on_write):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this test different from the other ones above that you edited?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed that, but there is actually a difference, the existing test is a bit buggy, I will copy this one over and modify the other one as appropriate

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah no I keep it here and modify the other test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the other test: As soon as you update s, a copy is triggered and updating the df afterwards never has an effect if the initial copy worked because they don't share data anymore. The df-update is now handled by the new test.

Regarding the different cases: overwriting an existing column takes a different code path than adding a new one, so we have to test both

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Apr 15, 2023
@@ -3944,11 +3945,11 @@ def isetitem(self, loc, value) -> None:
)

for i, idx in enumerate(loc):
arraylike = self._sanitize_column(value.iloc[:, i])
arraylike, _ = self._sanitize_column(value.iloc[:, i])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abstraction-wise ive been thinking of refs as being handled at the Manager/Block level. is this wrong? is there a viable way to keep this at that level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not if we don't want to move sanitise column to the same level. We have to keep track of the refs when we extract the array from values to set

@phofl phofl removed the Stale label May 5, 2023
@phofl phofl requested a review from jorisvandenbossche May 5, 2023 10:12
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

I started to add some minor doc comments, but decided to just push them instead, hope you don't mind

@@ -11917,4 +11933,4 @@ def _reindex_for_setitem(value: DataFrame | Series, index: Index) -> ArrayLike:
raise TypeError(
"incompatible index of inserted column with frame index"
) from err
return reindexed_value
return reindexed_value, None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return here the result of value.reindex(index)._references, since in theory reindex can now return a view in certain corner cases?

@@ -4166,8 +4173,9 @@ def _set_item(self, key, value) -> None:
existing_piece = self[key]
if isinstance(existing_piece, DataFrame):
value = np.tile(value, (len(existing_piece.columns), 1)).T
refs = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if this would still be a view in case we are only setting a single column (i.e. len(existing_piece.columns) is equal to 1). But it seems that also in that case np.tile consistently returns a copy, so that should be fine.
(it might still be an opportunity for an optimization, although for a corner case)


def test_setitem_series_no_copy_split_block(using_copy_on_write):
# Overwriting an existing column that is part of a larger block
df = DataFrame({"a": [1, 2, 3], "b": 1})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just thinking about the other issue to no longer use copy=True as default for dict (#52967), that such test cases as the above could start to no longer test what it is actually meant to test ..
I know that here it's not a dict of Series, so this example might not change, but writing it as DataFrame(np.array(..), columns=["a", "b"]) might be more robust?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or set copy true explicitly?

@jorisvandenbossche
Copy link
Member

Best to get this one in first, before looking at #52062 / #52051 / #52061, right?

@phofl
Copy link
Member Author

phofl commented May 5, 2023

Yep

# editing column in frame -> doesn't modify series
df.loc[2, "new"] = 100
expected_s = Series([0, 11, 12])
tm.assert_series_equal(s, expected_s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a reason to remove this? Or just that testing the modification in both ways ideally is done by recreating the original df, because the second modification / CoW-trigger is already influenced by the first?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the copy was already triggered in the step before, so this part of the test was useless since the objects didn't share memory anyway

@phofl phofl merged commit 0dc6fdf into pandas-dev:main May 5, 2023
@phofl phofl deleted the setitem_series_no_copy branch May 5, 2023 15:19
@phofl
Copy link
Member Author

phofl commented May 5, 2023

Rebased all the other prs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants