API / CoW: detect and raise error for chained assignment under Copy-on-Write #49467

jorisvandenbossche · 2022-11-02T09:10:10Z

One of the consequences of the copy / view rules with the Copy-on-Write proposal is that direct chained assignement (i.e. df[..][..] = .., without using any intermediate variable like sub = df[..]; sub[..] = ..) consistently never works (so not depending on which type of indexing operation (eg column selection vs row mask) or on the order of the operations).

Given that this will be one of the significant backwards incompatible aspects of the CoW change (and so we will also need to add warnings for this in advance (before using / switching to CoW); but this PR is focusing on the eventual behaviour with CoW enabled, adding such warnings is for another PR), I think it can be useful to keep raising an error for this even after the CoW behaviour would have become the default (in addition to warning about it in advance).
And given that this consistently never works, I think it is also fine to keep raising an error in the future for this, as there should never be a reason (in the future) to actually do this.

This is somewhat similar to the SettingWithCopyError we already have (the error that can optionally be enabled, to get errors instead of warnings). But I decided to not reuse this error but create a new exception class, because it is different enough (it is raised (or not) in different situations; for example SettingWithCopyError will not raise for chained assignment in the cases it is know the work at the moment, and it can raise in non-chained cases (using an intermediate variable) while the new ChainedAssignmentError would solely focus on chained cases).

This includes the changes from #49450 as well (the first commit here), since that change was needed to have the refcounting work correctly. But it was a sufficiently stand-alone change, so I broke it off in its own PR.

How does this work? I am relying on the refcount of the object on which __setitem__ is being called. If you consider the two simple chained and non-chained cases:

df[col][mask] = ..
# vs
subset = df[col]
subset[mask] = ..

In the first case, the temporary object (from df[col]) only lives in this chain, and doesn't have any references to it otherwise (and would also be cleaned up after the setitem operation), and does has a lower reference count compared to the second, non-chained case where the intermediate object (here called subset) is explicitly created by the user. In the second case we don't want an error (because this is valid code to update subset; because of triggering CoW it will just not update the parent df).

From testing, the reference count in the first (chained) case seems to be 3, and if there is another reference to the object, it is always higher than 3, and for now this seems to be robustly so for our full test suite.

One problem with this approach is that it is CPython specific. For example PyPy doesn't use refcounting, and so this PR won't work to raise the error on PyPy (I probably still need to add a platform check together with the refcount check, to avoid we raise an error about sys.getrefcount not being available on PyPy). For example the numpy resize method also uses refcounting for certain cases, and therefore requires the user to pass a keyword to disable this check on PyPy (numpy/numpy#8050).
I am not familiar enough with PyPy to know if there might be alternatives ways to check this when using PyPy. And it is certainly a downside of the current approach that it would not work consistently across Python implementations. But it is also mostly a user convenience to signal they are doing something that won't work, so my feeling is that in this case this difference can be OK (it is not that correct code would behave differently. The refcount is not used to know if CoW needs to happen or not (which affects actual output), but only to know if it would be useful to signal the user about code that has no effect anyway).

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

jorisvandenbossche · 2022-11-02T09:11:28Z

pandas/errors/__init__.py

@@ -298,6 +298,28 @@ class SettingWithCopyError(ValueError):
    """


+class ChainedAssignmentError(ValueError):
+    """
+    Exception raised when trying to set on a copied slice from a ``DataFrame``.


Still need to update this docstring (copied from SettingWithCopyError for now)

jreback · 2022-11-02T11:21:12Z

One of the consequences of the copy / view rules with the Copy-on-Write proposal is that direct chained assignement (i.e. df[..][..] = .., without using any intermediate variable like sub = df[..]; sub[..] = ..) consistently never works (so not depending on which type of indexing operation (eg column selection vs row mask) or on the order of the operations)

this seems counter intuitive. isn't the this the entire rationale for copy-on-write ?

jorisvandenbossche · 2022-11-02T17:21:49Z

One of the consequences of the copy / view rules with the Copy-on-Write proposal is that direct chained assignement consistently never works ...

this seems counter intuitive. isn't the this the entire rationale for copy-on-write ?

@jreback No, one of the main goals of it is to have a clear and consistent rules, and in the proposal the main rule is "any indexing operation returns a new object that behaves as a copy".
If you apply this rule consistently, this has the consequence that it doesn't matter if you do subset = df[..]; subset[..] = .. or df[..][..] = .., since in both cases the first df[..] returns a new object that behaves as a copy, and thus further modifying this resulting object doesn't modify the parent df. As a consequence, chained assignment consistently never works (as opposed to the current situation, where chained assignment sometimes works depending on the exact indexing operation and the order of the steps in the chain).

This is mentioned explicitly in the proposal under discussion. Quoting the summary:

Because every single indexing step behaves as a copy, this also means that with this proposal, “chained assignment” (with multiple setitem steps) will never work.

And there is also an explicit section about it at Chained assignment. It was also repeatedly called out as one of the major changes with this proposal (eg last bullet point in #36195 (comment), one of the bullet points when I revived the discussion with the current proposal at #36195 (comment), last bullet point of #36195 (comment) of Kevin supporting this change, #36195 (comment), etc, as some references in case someone wants to reread the discussion on this topic), and mentioned on the pandas-dev mailing list thread about this.

Also note that this PR doesn't actually change any behaviour regarding that. The fact that chained assignment never works was already included in the initial PR #46958 (I am only adding an error here to signal this behaviour better to the user; it could also be a warning). So if we want to further discuss the core aspect of chained assignment, let's use the original issue for this (#36195)?

jreback · 2022-11-03T00:34:53Z

ok that fair - agree consistency here is the important point

jorisvandenbossche · 2023-01-13T13:49:40Z

Rebased this to get rid of the first commit now #49450 is merged

phofl

Do you want to keep the commented print statements?

pandas/errors/__init__.py

phofl · 2023-01-13T14:54:32Z

pandas/errors/__init__.py

+_chained_assignment_msg = (
+    "A value is trying to be set on a copy of a DataFrame or Series "
+    "through chained assignment.\n"
+    "When using the Copy-on-Write mode, such chained assignment never works "


never updates the original....

That's on the next line?

Ah sorry, should have been clearer:

I would reword to:

When using the Copy-on-Write mode, such chained assignment never updates the original DataFrame or Series, ...

This sounds better to me

pandas/core/indexing.py

…etitem

jorisvandenbossche · 2023-01-16T15:59:06Z

Do you want to keep the commented print statements?

No, that was just useful while implementing / debugging, and first wanted to make sure that all tests are passing ;) Will remove now since that seems to be the case.

phofl · 2023-01-18T21:29:08Z

I think pre-commit complains about missing match statements in the pytest raises calls

…etitem

jorisvandenbossche · 2023-01-23T16:21:46Z

Refcounts work differently on PyPy (there is no sys.getrefcount doesn't exist there), so we won't be able tot give this informative error on PyPy (as mentioned in the top post). So I did put the check and error behind a if not PyPy check.
I also already updated the tests to handle this, but this currently can't actually be tested in practice, since our PyPy test build is segfaulting (#50817)

lithomas1 · 2023-01-23T17:13:18Z

I don't think its too urgent to get this working on PyPy at the moment. When I added the PyPy build, in addition to the segfaults, I remember that there was a bunch of failures relating to hashing (I think our hash functions might have been completely broken on PyPy).

Once #50817 is fixed, I'll try to set up a mechanism to selectively enable some tests.

phofl

There is an error in the typing/doctest etc ci, otherwise lgtm

phofl · 2023-01-24T15:10:34Z

thx @jorisvandenbossche

…n-Write (pandas-dev#49467)

jorisvandenbossche added the Copy / view semantics label Nov 2, 2022

jorisvandenbossche mentioned this pull request Nov 2, 2022

Copy-on-Write (PDEP-7) follow-up overview issue #48998

Open

38 tasks

jorisvandenbossche commented Nov 2, 2022

View reviewed changes

jorisvandenbossche mentioned this pull request Nov 2, 2022

TST: avoid chained assignment in tests outside of specific tests on chaining #49474

Merged

github-actions bot added the Stale label Dec 4, 2022

jorisvandenbossche added 3 commits January 13, 2023 14:31

API: detect and raise error for chained assignment under Copy-on-Write

16e31e9

skip SPSS tests for now (to further investigate)

2961701

use helper for option

017ed3e

jorisvandenbossche force-pushed the cow-error-chained-setitem branch from 9fb2148 to 017ed3e Compare January 13, 2023 13:49

pandas-dev deleted a comment from github-actions bot Jan 13, 2023

update error message

e170b6c

jorisvandenbossche marked this pull request as ready for review January 13, 2023 14:17

jorisvandenbossche requested a review from phofl January 13, 2023 14:22

phofl reviewed Jan 13, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/main' into cow-error-chained-s…

08adb25

…etitem

jorisvandenbossche added 2 commits January 16, 2023 16:59

address feedback

8d50d16

add whatsnew

4521d87

jorisvandenbossche added 5 commits January 23, 2023 10:57

Merge remote-tracking branch 'upstream/main' into cow-error-chained-s…

446863a

…etitem

add test build for CoW using PyPy

dc71236

don't try to raise on PYPY + update tests for that

085dd70

try fix typing

1c3a4bb

remove PyPy with CoW build again

a7089ba

phofl reviewed Jan 24, 2023

View reviewed changes

convert raises helper to function to avoid pytest import

e2bb7e1

phofl approved these changes Jan 24, 2023

View reviewed changes

phofl added this to the 2.0 milestone Jan 24, 2023

phofl merged commit 73840ef into pandas-dev:main Jan 24, 2023

jorisvandenbossche deleted the cow-error-chained-setitem branch January 24, 2023 15:28

pooja-subramaniam pushed a commit to pooja-subramaniam/pandas that referenced this pull request Jan 25, 2023

API / CoW: detect and raise error for chained assignment under Copy-o…

0fdb55b

…n-Write (pandas-dev#49467)

jorisvandenbossche mentioned this pull request Feb 10, 2023

BUG: ChainedAssignmentError for CoW not working when setitem is called from cython #51315

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API / CoW: detect and raise error for chained assignment under Copy-on-Write #49467

API / CoW: detect and raise error for chained assignment under Copy-on-Write #49467

jorisvandenbossche commented Nov 2, 2022 •

edited

Loading

jorisvandenbossche Nov 2, 2022

jreback commented Nov 2, 2022

jorisvandenbossche commented Nov 2, 2022

jreback commented Nov 3, 2022

jorisvandenbossche commented Jan 13, 2023

phofl left a comment

phofl Jan 13, 2023

jorisvandenbossche Jan 16, 2023

phofl Jan 16, 2023 •

edited

Loading

jorisvandenbossche commented Jan 16, 2023

phofl commented Jan 18, 2023

jorisvandenbossche commented Jan 23, 2023

lithomas1 commented Jan 23, 2023

phofl left a comment

phofl commented Jan 24, 2023

API / CoW: detect and raise error for chained assignment under Copy-on-Write #49467

API / CoW: detect and raise error for chained assignment under Copy-on-Write #49467

Conversation

jorisvandenbossche commented Nov 2, 2022 • edited Loading

jorisvandenbossche Nov 2, 2022

Choose a reason for hiding this comment

jreback commented Nov 2, 2022

jorisvandenbossche commented Nov 2, 2022

jreback commented Nov 3, 2022

jorisvandenbossche commented Jan 13, 2023

phofl left a comment

Choose a reason for hiding this comment

phofl Jan 13, 2023

Choose a reason for hiding this comment

jorisvandenbossche Jan 16, 2023

Choose a reason for hiding this comment

phofl Jan 16, 2023 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 16, 2023

phofl commented Jan 18, 2023

jorisvandenbossche commented Jan 23, 2023

lithomas1 commented Jan 23, 2023

phofl left a comment

Choose a reason for hiding this comment

phofl commented Jan 24, 2023

jorisvandenbossche commented Nov 2, 2022 •

edited

Loading

phofl Jan 16, 2023 •

edited

Loading