-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
CoW: Delay copy when setting Series into DataFrame #51698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Conflicts: # pandas/core/frame.py
This tackles one case for now, will add other cases step by step based on the strategy here. Would not backport this, since it's unlikely that we'll finish everything till 2.0 comes out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
(it will be passing through this refs object in quite some places, but I don't think there is any way around that)
@@ -977,7 +977,7 @@ def test_column_as_series_no_item_cache( | |||
# TODO add tests for other indexing methods on the Series | |||
|
|||
|
|||
def test_dataframe_add_column_from_series(backend): | |||
def test_dataframe_add_column_from_series(backend, using_copy_on_write): | |||
# Case: adding a new column to a DataFrame from an existing column/series | |||
# -> always already takes a copy on assignment | |||
# (no change in behaviour here) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can update the TODO on the line below here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
else: | ||
# the data is copied | ||
assert not np.shares_memory(df["c"].values, df2["c"].values) | ||
|
||
# and modifying the set DataFrame does not modify the original DataFrame | ||
df2.iloc[0, 0] = 0 | ||
tm.assert_series_equal(df["c"], Series([7, 8, 9], name="c")) | ||
|
||
|
||
def test_setitem_series_no_copy(using_copy_on_write): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this test different from the other ones above that you edited?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed that, but there is actually a difference, the existing test is a bit buggy, I will copy this one over and modify the other one as appropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah no I keep it here and modify the other test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the other test: As soon as you update s
, a copy is triggered and updating the df afterwards never has an effect if the initial copy worked because they don't share data anymore. The df-update is now handled by the new test.
Regarding the different cases: overwriting an existing column takes a different code path than adding a new one, so we have to test both
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
@@ -3944,11 +3945,11 @@ def isetitem(self, loc, value) -> None: | |||
) | |||
|
|||
for i, idx in enumerate(loc): | |||
arraylike = self._sanitize_column(value.iloc[:, i]) | |||
arraylike, _ = self._sanitize_column(value.iloc[:, i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abstraction-wise ive been thinking of refs as being handled at the Manager/Block level. is this wrong? is there a viable way to keep this at that level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not if we don't want to move sanitise column to the same level. We have to keep track of the refs when we extract the array from values to set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
I started to add some minor doc comments, but decided to just push them instead, hope you don't mind
@@ -11917,4 +11933,4 @@ def _reindex_for_setitem(value: DataFrame | Series, index: Index) -> ArrayLike: | |||
raise TypeError( | |||
"incompatible index of inserted column with frame index" | |||
) from err | |||
return reindexed_value | |||
return reindexed_value, None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we return here the result of value.reindex(index)._references
, since in theory reindex
can now return a view in certain corner cases?
@@ -4166,8 +4173,9 @@ def _set_item(self, key, value) -> None: | |||
existing_piece = self[key] | |||
if isinstance(existing_piece, DataFrame): | |||
value = np.tile(value, (len(existing_piece.columns), 1)).T | |||
refs = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if this would still be a view in case we are only setting a single column (i.e. len(existing_piece.columns)
is equal to 1). But it seems that also in that case np.tile
consistently returns a copy, so that should be fine.
(it might still be an opportunity for an optimization, although for a corner case)
|
||
def test_setitem_series_no_copy_split_block(using_copy_on_write): | ||
# Overwriting an existing column that is part of a larger block | ||
df = DataFrame({"a": [1, 2, 3], "b": 1}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just thinking about the other issue to no longer use copy=True as default for dict (#52967), that such test cases as the above could start to no longer test what it is actually meant to test ..
I know that here it's not a dict of Series, so this example might not change, but writing it as DataFrame(np.array(..), columns=["a", "b"])
might be more robust?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or set copy true explicitly?
Yep |
# editing column in frame -> doesn't modify series | ||
df.loc[2, "new"] = 100 | ||
expected_s = Series([0, 11, 12]) | ||
tm.assert_series_equal(s, expected_s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there a reason to remove this? Or just that testing the modification in both ways ideally is done by recreating the original df
, because the second modification / CoW-trigger is already influenced by the first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the copy was already triggered in the step before, so this part of the test was useless since the objects didn't share memory anyway
Rebased all the other prs |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.