BUG: ChainedAssignmentError for CoW not working when setitem is called from cython #51315

jorisvandenbossche · 2023-02-10T22:58:02Z

#49467 added the feature to detect "chained assignment" and raise an error for this when CoW is enabled (since it will then never work):

>>> pd.options.mode.copy_on_write = True
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
>>> df["a"][0] = 10
...
ChainedAssignmentError: A value is trying to be set on a copy of a DataFrame or Series through chained assignment.
When using the Copy-on-Write mode, such chained assignment never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy.
...

This is done based on the refcount of the calling DataFrame of __setitem__: if this is only a temporary variable in a chain, it has a lower refcount than when it is assigned to a variable (using sys.getrefcount).

It seemed that we could robustly check for this with the above approach, but we discovered that this fails if the setitem call is done from a function compiled with cython. We actually encounter this in our own test suite, in the read_spss tests using pyreadstats (those are skipped for CoW now). pyreadstats has some functionality written in cython, and for example this function does a valid (non-chained) setitem call to update a column of a dataframe. However, this incorrectly triggers the ChainedAssignmentError.

This is because the cython generated code results in a lower refcount when getting to DataFrame/Series.__setitem__ (a refcount of 2 instead of 3 for a chained case, but so for a valid non-chained case it can give a refcount of 3, which triggers the error).
My current understanding is that both when calling a pure python function that gets interpreted, or calling a compiled cython function, it ends up calling PyObject_SetItem (which will then eventually call the __setitem__). However, when calling this through the interpreter, there is an extra refcount increment from having the object on the stack.

It's an annoying issue, and I am not directly sure of a good solution. Some options that I can currently think of:

Somehow "fix" it and find a way to not raise the error incorrectly:
- Can we detect that we are being called from cython? Currently I didn't find anything.
- Analyse the stack frames. The cython frames are missing, but so we could analyse the code of the frame above to see if this actually contains a chained setitem operation (but this doesn't sound very robust / fun to write ..)
Make it a warning instead of an error (which makes it less serious if raised as a false positive), and then we can document that libraries should best catch this warning to silence it if calling setitem from cython. Or have some option to enable/disable the warning

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2023-02-10T23:06:50Z

A small reproducer from the pyreadstats case, is the following function compiled by cython:

%%cython

def _test_setitem_from_cython(df):
    df_copy = df.copy()

    for var_name in df_copy.columns:
        df_copy[var_name] = df_copy[var_name].apply(lambda x: x)

    return df_copy

If you call that, it incorrectly triggers the ChainedAssignmentError:

def test_setitem_from_cython():
    df = DataFrame({"a": [1, 2, 3], "b": [0.1, 0.2, 0.3]})
    df2 = _test_setitem_from_cython(df)

lithomas1 · 2023-02-11T23:50:19Z

I'm not a huge fan of using refcounting, since it seems very fragile(it depends on a magic number).

(I also don't think this approach will work for accessors too, since accessors store a reference to the DataFrame, so that could prevent the error from happenning)

One thing we could try is using weakref.finalize on the object produced by the first __getitem__ to raise an error if it goes out of scope (The error line might be off though).

(There is also zero chance of that working with PyPy, since the finalizer will not get called when the refcount hits zero like in CPython because there is no refcount)

Digging through the call stack seems more robust, but I think that that is going to absolutely kill perf.

jorisvandenbossche · 2023-02-15T00:48:01Z

One thing we could try is using weakref.finalize on the object produced by the first __getitem__ to raise an error if it goes out of scope (The error line might be off though).

How would this be able to distinguish between "valid" cases of objects from getitem that at a later stage go out of scope, versus the cases where it goes out of scope on the same line directly after the setitem call?

jorisvandenbossche added Bug Copy / view semantics labels Feb 10, 2023

This was referenced Mar 13, 2023

CoW: change ChainedAssignmentError exception to a warning #51926

Merged

Copy-on-Write (PDEP-7) follow-up overview issue #48998

Open

jorisvandenbossche mentioned this issue Dec 11, 2023

BUG: ChainedAssignmentError for CoW not working for chained inplace methods when passing *args or **kwargs #56456

Open

Dr-Irv mentioned this issue Feb 13, 2024

CI: fix nightly pandas-dev/pandas-stubs#869

Merged

2 tasks

da-woods mentioned this issue Feb 19, 2024

[ENH] got pandas warning after cythonize cython/cython#6013

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: ChainedAssignmentError for CoW not working when setitem is called from cython #51315

BUG: ChainedAssignmentError for CoW not working when setitem is called from cython #51315

jorisvandenbossche commented Feb 10, 2023 •

edited

Loading

jorisvandenbossche commented Feb 10, 2023

lithomas1 commented Feb 11, 2023

jorisvandenbossche commented Feb 15, 2023

BUG: ChainedAssignmentError for CoW not working when setitem is called from cython #51315

BUG: ChainedAssignmentError for CoW not working when setitem is called from cython #51315

Comments

jorisvandenbossche commented Feb 10, 2023 • edited Loading

jorisvandenbossche commented Feb 10, 2023

lithomas1 commented Feb 11, 2023

jorisvandenbossche commented Feb 15, 2023

jorisvandenbossche commented Feb 10, 2023 •

edited

Loading