-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API / CoW: detect and raise error for chained assignment under Copy-on-Write #49467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API / CoW: detect and raise error for chained assignment under Copy-on-Write #49467
Conversation
pandas/errors/__init__.py
Outdated
@@ -298,6 +298,28 @@ class SettingWithCopyError(ValueError): | |||
""" | |||
|
|||
|
|||
class ChainedAssignmentError(ValueError): | |||
""" | |||
Exception raised when trying to set on a copied slice from a ``DataFrame``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to update this docstring (copied from SettingWithCopyError for now)
this seems counter intuitive. isn't the this the entire rationale for copy-on-write ? |
@jreback No, one of the main goals of it is to have a clear and consistent rules, and in the proposal the main rule is "any indexing operation returns a new object that behaves as a copy". This is mentioned explicitly in the proposal under discussion. Quoting the summary:
And there is also an explicit section about it at Chained assignment. It was also repeatedly called out as one of the major changes with this proposal (eg last bullet point in #36195 (comment), one of the bullet points when I revived the discussion with the current proposal at #36195 (comment), last bullet point of #36195 (comment) of Kevin supporting this change, #36195 (comment), etc, as some references in case someone wants to reread the discussion on this topic), and mentioned on the pandas-dev mailing list thread about this. Also note that this PR doesn't actually change any behaviour regarding that. The fact that chained assignment never works was already included in the initial PR #46958 (I am only adding an error here to signal this behaviour better to the user; it could also be a warning). So if we want to further discuss the core aspect of chained assignment, let's use the original issue for this (#36195)? |
ok that fair - agree consistency here is the important point |
9fb2148
to
017ed3e
Compare
Rebased this to get rid of the first commit now #49450 is merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to keep the commented print statements?
_chained_assignment_msg = ( | ||
"A value is trying to be set on a copy of a DataFrame or Series " | ||
"through chained assignment.\n" | ||
"When using the Copy-on-Write mode, such chained assignment never works " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never updates the original....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's on the next line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, should have been clearer:
I would reword to:
When using the Copy-on-Write mode, such chained assignment never updates the original DataFrame or Series, ...
This sounds better to me
No, that was just useful while implementing / debugging, and first wanted to make sure that all tests are passing ;) Will remove now since that seems to be the case. |
I think pre-commit complains about missing match statements in the pytest raises calls |
Refcounts work differently on PyPy (there is no |
I don't think its too urgent to get this working on PyPy at the moment. When I added the PyPy build, in addition to the segfaults, I remember that there was a bunch of failures relating to hashing (I think our hash functions might have been completely broken on PyPy). Once #50817 is fixed, I'll try to set up a mechanism to selectively enable some tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an error in the typing/doctest etc ci, otherwise lgtm
xref #48998
One of the consequences of the copy / view rules with the Copy-on-Write proposal is that direct chained assignement (i.e.
df[..][..] = ..
, without using any intermediate variable likesub = df[..]; sub[..] = ..
) consistently never works (so not depending on which type of indexing operation (eg column selection vs row mask) or on the order of the operations).Given that this will be one of the significant backwards incompatible aspects of the CoW change (and so we will also need to add warnings for this in advance (before using / switching to CoW); but this PR is focusing on the eventual behaviour with CoW enabled, adding such warnings is for another PR), I think it can be useful to keep raising an error for this even after the CoW behaviour would have become the default (in addition to warning about it in advance).
And given that this consistently never works, I think it is also fine to keep raising an error in the future for this, as there should never be a reason (in the future) to actually do this.
This is somewhat similar to the
SettingWithCopyError
we already have (the error that can optionally be enabled, to get errors instead of warnings). But I decided to not reuse this error but create a new exception class, because it is different enough (it is raised (or not) in different situations; for example SettingWithCopyError will not raise for chained assignment in the cases it is know the work at the moment, and it can raise in non-chained cases (using an intermediate variable) while the newChainedAssignmentError
would solely focus on chained cases).This includes the changes from #49450 as well (the first commit here), since that change was needed to have the refcounting work correctly. But it was a sufficiently stand-alone change, so I broke it off in its own PR.
How does this work? I am relying on the refcount of the object on which
__setitem__
is being called. If you consider the two simple chained and non-chained cases:In the first case, the temporary object (from
df[col]
) only lives in this chain, and doesn't have any references to it otherwise (and would also be cleaned up after the setitem operation), and does has a lower reference count compared to the second, non-chained case where the intermediate object (here calledsubset
) is explicitly created by the user. In the second case we don't want an error (because this is valid code to updatesubset
; because of triggering CoW it will just not update the parentdf
).From testing, the reference count in the first (chained) case seems to be 3, and if there is another reference to the object, it is always higher than 3, and for now this seems to be robustly so for our full test suite.
One problem with this approach is that it is CPython specific. For example PyPy doesn't use refcounting, and so this PR won't work to raise the error on PyPy (I probably still need to add a platform check together with the refcount check, to avoid we raise an error about
sys.getrefcount
not being available on PyPy). For example the numpyresize
method also uses refcounting for certain cases, and therefore requires the user to pass a keyword to disable this check on PyPy (numpy/numpy#8050).I am not familiar enough with PyPy to know if there might be alternatives ways to check this when using PyPy. And it is certainly a downside of the current approach that it would not work consistently across Python implementations. But it is also mostly a user convenience to signal they are doing something that won't work, so my feeling is that in this case this difference can be OK (it is not that correct code would behave differently. The refcount is not used to know if CoW needs to happen or not (which affects actual output), but only to know if it would be useful to signal the user about code that has no effect anyway).
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.