You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is done based on the refcount of the calling DataFrame of __setitem__: if this is only a temporary variable in a chain, it has a lower refcount than when it is assigned to a variable (using sys.getrefcount).
It seemed that we could robustly check for this with the above approach, but we discovered that this fails if the setitem call is done from a function compiled with cython. We actually encounter this in our own test suite, in the read_spss tests using pyreadstats (those are skipped for CoW now). pyreadstats has some functionality written in cython, and for example this function does a valid (non-chained) setitem call to update a column of a dataframe. However, this incorrectly triggers the ChainedAssignmentError.
This is because the cython generated code results in a lower refcount when getting to DataFrame/Series.__setitem__ (a refcount of 2 instead of 3 for a chained case, but so for a valid non-chained case it can give a refcount of 3, which triggers the error).
My current understanding is that both when calling a pure python function that gets interpreted, or calling a compiled cython function, it ends up calling PyObject_SetItem (which will then eventually call the __setitem__). However, when calling this through the interpreter, there is an extra refcount increment from having the object on the stack.
It's an annoying issue, and I am not directly sure of a good solution. Some options that I can currently think of:
Somehow "fix" it and find a way to not raise the error incorrectly:
Can we detect that we are being called from cython? Currently I didn't find anything.
Analyse the stack frames. The cython frames are missing, but so we could analyse the code of the frame above to see if this actually contains a chained setitem operation (but this doesn't sound very robust / fun to write ..)
Make it a warning instead of an error (which makes it less serious if raised as a false positive), and then we can document that libraries should best catch this warning to silence it if calling setitem from cython. Or have some option to enable/disable the warning
The text was updated successfully, but these errors were encountered:
I'm not a huge fan of using refcounting, since it seems very fragile(it depends on a magic number).
(I also don't think this approach will work for accessors too, since accessors store a reference to the DataFrame, so that could prevent the error from happenning)
One thing we could try is using weakref.finalize on the object produced by the first __getitem__ to raise an error if it goes out of scope (The error line might be off though).
(There is also zero chance of that working with PyPy, since the finalizer will not get called when the refcount hits zero like in CPython because there is no refcount)
Digging through the call stack seems more robust, but I think that that is going to absolutely kill perf.
One thing we could try is using weakref.finalize on the object produced by the first __getitem__ to raise an error if it goes out of scope (The error line might be off though).
How would this be able to distinguish between "valid" cases of objects from getitem that at a later stage go out of scope, versus the cases where it goes out of scope on the same line directly after the setitem call?
#49467 added the feature to detect "chained assignment" and raise an error for this when CoW is enabled (since it will then never work):
This is done based on the refcount of the calling DataFrame of
__setitem__
: if this is only a temporary variable in a chain, it has a lower refcount than when it is assigned to a variable (usingsys.getrefcount
).It seemed that we could robustly check for this with the above approach, but we discovered that this fails if the setitem call is done from a function compiled with cython. We actually encounter this in our own test suite, in the read_spss tests using
pyreadstats
(those are skipped for CoW now).pyreadstats
has some functionality written in cython, and for example this function does a valid (non-chained) setitem call to update a column of a dataframe. However, this incorrectly triggers the ChainedAssignmentError.This is because the cython generated code results in a lower refcount when getting to
DataFrame/Series.__setitem__
(a refcount of 2 instead of 3 for a chained case, but so for a valid non-chained case it can give a refcount of 3, which triggers the error).My current understanding is that both when calling a pure python function that gets interpreted, or calling a compiled cython function, it ends up calling
PyObject_SetItem
(which will then eventually call the__setitem__
). However, when calling this through the interpreter, there is an extra refcount increment from having the object on the stack.It's an annoying issue, and I am not directly sure of a good solution. Some options that I can currently think of:
The text was updated successfully, but these errors were encountered: