-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
CoW: Add warning for replace with inplace #56060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/generic.py
Outdated
if isinstance(self, ABCSeries) and hasattr(self, "_cacher"): | ||
ref_count += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem might be that in the warning mode, this is not the case? (so that might need to add a not warn_copy_on_write()
to this if block updating the ref_count
Although the tests are not failing ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand your comment, why would this not be the case?
I double checked and it seems to work in warning mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to #55838 (comment), I thought I had disabled the item cache for the warning mode, which I would think to affect the count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked this locally, the cache is still populated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will take a closer look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, no, the _cacher
points to the DataFrame (in a case like s = df["col"]
), so it doesn't increase the ref count for s
(which is what is relevant for chained setitem detection). And also it uses a weakref, so shouldn't actually increase the ref count?
The hasattr(self, "_cacher")
is kind of a check whether the Series is derived from a DataFrame?
So I think that the reason we need to increase the ref count is still because of _item_cache
, not actually _cacher
. The presence of _cacher
just turns out to be an equivalent check for checking that we are a Series in non-CoW mode (only in that mode _item_cache
would be populated). So maybe a more "correct" check (although they will give the same result) would be:
if isinstance(self, ABCSeries) and hasattr(self, "_cacher"): | |
ref_count += 1 | |
if isinstance(self, ABCSeries) and not (using_copy_on_write() or warn_copy_on_write()): | |
ref_count += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that suggestion gives a bunch of failures (false positives) in pandas/tests/series/methods/test_replace.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which is logical, because that's the whole point with _item_cache
: you don't know when it is populated or when not in the non-CoW case. So you can't use a single fixed REF_COUNT value to check, this actually depends on the circumstances.
And so what you have (checking _cacher
) is a way to check if the Series object is derived from a DataFrame with simple indexing (and has populated _item_cache
). So that's probably fine? Or are there ways to get a Series that does not populate the cacher?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or are there ways to get a Series that does not populate the cacher?
Getting a single column with ["col"]
, .loc[:, "col"]
, .iloc[:, 0]
, .get("col")
, .xs("col", axis=1)
all populate the cacher.
In theory you can get a Series as a result of a calculation that doesn't do this (df.mean().replace(..)
, but of course that never could have done something useful so we don't need to care about that. We only need to care about getting a Series that is view.
So to summarize: your fix is probably fully correct, I only didn't understand why ;) You might want to add a comment about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes your conclusion is correct as far as I can tell. That was the reason why I added the hasattr check. Sorry for omitting this information.
Added a comment
@@ -395,6 +396,17 @@ def test_replace_chained_assignment(using_copy_on_write): | |||
with tm.raises_chained_assignment_error(): | |||
df[["a"]].replace(1, 100, inplace=True) | |||
tm.assert_frame_equal(df, df_orig) | |||
else: | |||
with tm.assert_produces_warning(FutureWarning, match="inplace method"): | |||
with option_context("mode.chained_assignment", None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it otherwise also raise a SettingWithCopyWarning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, [["a"]] currently copies
elif not PYPY and not using_copy_on_write(): | ||
ctr = sys.getrefcount(self) | ||
ref_count = REF_COUNT | ||
if isinstance(self, ABCSeries) and hasattr(self, "_cacher"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if isinstance(self, ABCSeries) and hasattr(self, "_cacher"): | |
# in non-CoW mode, chained Series access will populate the `_item_cache` which results in an increased ref count not below the threshold, while we still need to warn. We detect this case of a Series derived from a DataFrame through the presence of `_cacher` | |
if isinstance(self, ABCSeries) and hasattr(self, "_cacher"): |
(maybe a bit long to put in every place that has this check (after the other PRs), but a comment like this would have explained it to me)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with adding this here. Maybe adding a link to this comment for the other prs?
Are you ok with merging this so that I can adjust the others? |
Yes, merge away
Op ma 20 nov. 2023 22:32 schreef Patrick Hoefler ***@***.***>:
… Are you ok with merging this so that I can adjust the others?
—
Reply to this email directly, view it on GitHub
<#56060 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHZEUBOTYPH4XHTM4MRQLDYFPD7LAVCNFSM6AAAAAA7RZMGT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJZHAZTKNJWG4>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
xref #56019