Skip to content

CoW: Add warning for replace with inplace #56060

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 20, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Nov 19, 2023

xref #56019

@phofl phofl added Warnings Warnings that appear or should be added to pandas Copy / view semantics labels Nov 19, 2023
Comment on lines 7780 to 7781
if isinstance(self, ABCSeries) and hasattr(self, "_cacher"):
ref_count += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem might be that in the warning mode, this is not the case? (so that might need to add a not warn_copy_on_write() to this if block updating the ref_count

Although the tests are not failing ..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand your comment, why would this not be the case?

I double checked and it seems to work in warning mode

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to #55838 (comment), I thought I had disabled the item cache for the warning mode, which I would think to affect the count.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked this locally, the cache is still populated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will take a closer look

Copy link
Member

@jorisvandenbossche jorisvandenbossche Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no, the _cacher points to the DataFrame (in a case like s = df["col"]), so it doesn't increase the ref count for s (which is what is relevant for chained setitem detection). And also it uses a weakref, so shouldn't actually increase the ref count?

The hasattr(self, "_cacher") is kind of a check whether the Series is derived from a DataFrame?
So I think that the reason we need to increase the ref count is still because of _item_cache, not actually _cacher. The presence of _cacher just turns out to be an equivalent check for checking that we are a Series in non-CoW mode (only in that mode _item_cache would be populated). So maybe a more "correct" check (although they will give the same result) would be:

Suggested change
if isinstance(self, ABCSeries) and hasattr(self, "_cacher"):
ref_count += 1
if isinstance(self, ABCSeries) and not (using_copy_on_write() or warn_copy_on_write()):
ref_count += 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that suggestion gives a bunch of failures (false positives) in pandas/tests/series/methods/test_replace.py

Copy link
Member

@jorisvandenbossche jorisvandenbossche Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is logical, because that's the whole point with _item_cache: you don't know when it is populated or when not in the non-CoW case. So you can't use a single fixed REF_COUNT value to check, this actually depends on the circumstances.

And so what you have (checking _cacher) is a way to check if the Series object is derived from a DataFrame with simple indexing (and has populated _item_cache). So that's probably fine? Or are there ways to get a Series that does not populate the cacher?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or are there ways to get a Series that does not populate the cacher?

Getting a single column with ["col"], .loc[:, "col"], .iloc[:, 0], .get("col"), .xs("col", axis=1) all populate the cacher.

In theory you can get a Series as a result of a calculation that doesn't do this (df.mean().replace(..), but of course that never could have done something useful so we don't need to care about that. We only need to care about getting a Series that is view.

So to summarize: your fix is probably fully correct, I only didn't understand why ;) You might want to add a comment about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes your conclusion is correct as far as I can tell. That was the reason why I added the hasattr check. Sorry for omitting this information.

Added a comment

@@ -395,6 +396,17 @@ def test_replace_chained_assignment(using_copy_on_write):
with tm.raises_chained_assignment_error():
df[["a"]].replace(1, 100, inplace=True)
tm.assert_frame_equal(df, df_orig)
else:
with tm.assert_produces_warning(FutureWarning, match="inplace method"):
with option_context("mode.chained_assignment", None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it otherwise also raise a SettingWithCopyWarning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, [["a"]] currently copies

elif not PYPY and not using_copy_on_write():
ctr = sys.getrefcount(self)
ref_count = REF_COUNT
if isinstance(self, ABCSeries) and hasattr(self, "_cacher"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isinstance(self, ABCSeries) and hasattr(self, "_cacher"):
# in non-CoW mode, chained Series access will populate the `_item_cache` which results in an increased ref count not below the threshold, while we still need to warn. We detect this case of a Series derived from a DataFrame through the presence of `_cacher`
if isinstance(self, ABCSeries) and hasattr(self, "_cacher"):

(maybe a bit long to put in every place that has this check (after the other PRs), but a comment like this would have explained it to me)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with adding this here. Maybe adding a link to this comment for the other prs?

@phofl
Copy link
Member Author

phofl commented Nov 20, 2023

Are you ok with merging this so that I can adjust the others?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 20, 2023 via email

@phofl phofl added this to the 2.2 milestone Nov 20, 2023
@phofl phofl merged commit 5625236 into pandas-dev:main Nov 20, 2023
@phofl phofl deleted the warn_cow_mode_replace branch November 20, 2023 22:23
phofl added a commit to phofl/pandas that referenced this pull request Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Copy / view semantics Warnings Warnings that appear or should be added to pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants