Skip to content

API / CoW: detect and raise error for chained assignment under Copy-on-Write #49467

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jan 24, 2023

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Nov 2, 2022

xref #48998

One of the consequences of the copy / view rules with the Copy-on-Write proposal is that direct chained assignement (i.e. df[..][..] = .., without using any intermediate variable like sub = df[..]; sub[..] = ..) consistently never works (so not depending on which type of indexing operation (eg column selection vs row mask) or on the order of the operations).

Given that this will be one of the significant backwards incompatible aspects of the CoW change (and so we will also need to add warnings for this in advance (before using / switching to CoW); but this PR is focusing on the eventual behaviour with CoW enabled, adding such warnings is for another PR), I think it can be useful to keep raising an error for this even after the CoW behaviour would have become the default (in addition to warning about it in advance).
And given that this consistently never works, I think it is also fine to keep raising an error in the future for this, as there should never be a reason (in the future) to actually do this.

This is somewhat similar to the SettingWithCopyError we already have (the error that can optionally be enabled, to get errors instead of warnings). But I decided to not reuse this error but create a new exception class, because it is different enough (it is raised (or not) in different situations; for example SettingWithCopyError will not raise for chained assignment in the cases it is know the work at the moment, and it can raise in non-chained cases (using an intermediate variable) while the new ChainedAssignmentError would solely focus on chained cases).

This includes the changes from #49450 as well (the first commit here), since that change was needed to have the refcounting work correctly. But it was a sufficiently stand-alone change, so I broke it off in its own PR.

How does this work? I am relying on the refcount of the object on which __setitem__ is being called. If you consider the two simple chained and non-chained cases:

df[col][mask] = ..
# vs
subset = df[col]
subset[mask] = ..

In the first case, the temporary object (from df[col]) only lives in this chain, and doesn't have any references to it otherwise (and would also be cleaned up after the setitem operation), and does has a lower reference count compared to the second, non-chained case where the intermediate object (here called subset) is explicitly created by the user. In the second case we don't want an error (because this is valid code to update subset; because of triggering CoW it will just not update the parent df).

From testing, the reference count in the first (chained) case seems to be 3, and if there is another reference to the object, it is always higher than 3, and for now this seems to be robustly so for our full test suite.

One problem with this approach is that it is CPython specific. For example PyPy doesn't use refcounting, and so this PR won't work to raise the error on PyPy (I probably still need to add a platform check together with the refcount check, to avoid we raise an error about sys.getrefcount not being available on PyPy). For example the numpy resize method also uses refcounting for certain cases, and therefore requires the user to pass a keyword to disable this check on PyPy (numpy/numpy#8050).
I am not familiar enough with PyPy to know if there might be alternatives ways to check this when using PyPy. And it is certainly a downside of the current approach that it would not work consistently across Python implementations. But it is also mostly a user convenience to signal they are doing something that won't work, so my feeling is that in this case this difference can be OK (it is not that correct code would behave differently. The refcount is not used to know if CoW needs to happen or not (which affects actual output), but only to know if it would be useful to signal the user about code that has no effect anyway).

@@ -298,6 +298,28 @@ class SettingWithCopyError(ValueError):
"""


class ChainedAssignmentError(ValueError):
"""
Exception raised when trying to set on a copied slice from a ``DataFrame``.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to update this docstring (copied from SettingWithCopyError for now)

@jreback
Copy link
Contributor

jreback commented Nov 2, 2022

One of the consequences of the copy / view rules with the Copy-on-Write proposal is that direct chained assignement (i.e. df[..][..] = .., without using any intermediate variable like sub = df[..]; sub[..] = ..) consistently never works (so not depending on which type of indexing operation (eg column selection vs row mask) or on the order of the operations)

this seems counter intuitive. isn't the this the entire rationale for copy-on-write ?

@jorisvandenbossche
Copy link
Member Author

One of the consequences of the copy / view rules with the Copy-on-Write proposal is that direct chained assignement consistently never works ...

this seems counter intuitive. isn't the this the entire rationale for copy-on-write ?

@jreback No, one of the main goals of it is to have a clear and consistent rules, and in the proposal the main rule is "any indexing operation returns a new object that behaves as a copy".
If you apply this rule consistently, this has the consequence that it doesn't matter if you do subset = df[..]; subset[..] = .. or df[..][..] = .., since in both cases the first df[..] returns a new object that behaves as a copy, and thus further modifying this resulting object doesn't modify the parent df. As a consequence, chained assignment consistently never works (as opposed to the current situation, where chained assignment sometimes works depending on the exact indexing operation and the order of the steps in the chain).

This is mentioned explicitly in the proposal under discussion. Quoting the summary:

Because every single indexing step behaves as a copy, this also means that with this proposal, “chained assignment” (with multiple setitem steps) will never work.

And there is also an explicit section about it at Chained assignment. It was also repeatedly called out as one of the major changes with this proposal (eg last bullet point in #36195 (comment), one of the bullet points when I revived the discussion with the current proposal at #36195 (comment), last bullet point of #36195 (comment) of Kevin supporting this change, #36195 (comment), etc, as some references in case someone wants to reread the discussion on this topic), and mentioned on the pandas-dev mailing list thread about this.

Also note that this PR doesn't actually change any behaviour regarding that. The fact that chained assignment never works was already included in the initial PR #46958 (I am only adding an error here to signal this behaviour better to the user; it could also be a warning). So if we want to further discuss the core aspect of chained assignment, let's use the original issue for this (#36195)?

@jreback
Copy link
Contributor

jreback commented Nov 3, 2022

ok that fair - agree consistency here is the important point

@github-actions github-actions bot added the Stale label Dec 4, 2022
@jorisvandenbossche jorisvandenbossche force-pushed the cow-error-chained-setitem branch from 9fb2148 to 017ed3e Compare January 13, 2023 13:49
@pandas-dev pandas-dev deleted a comment from github-actions bot Jan 13, 2023
@jorisvandenbossche
Copy link
Member Author

Rebased this to get rid of the first commit now #49450 is merged

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review January 13, 2023 14:17
Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to keep the commented print statements?

_chained_assignment_msg = (
"A value is trying to be set on a copy of a DataFrame or Series "
"through chained assignment.\n"
"When using the Copy-on-Write mode, such chained assignment never works "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never updates the original....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's on the next line?

Copy link
Member

@phofl phofl Jan 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, should have been clearer:

I would reword to:

When using the Copy-on-Write mode, such chained assignment never updates the original DataFrame or Series, ...

This sounds better to me

@jorisvandenbossche
Copy link
Member Author

Do you want to keep the commented print statements?

No, that was just useful while implementing / debugging, and first wanted to make sure that all tests are passing ;) Will remove now since that seems to be the case.

@phofl
Copy link
Member

phofl commented Jan 18, 2023

I think pre-commit complains about missing match statements in the pytest raises calls

@jorisvandenbossche
Copy link
Member Author

Refcounts work differently on PyPy (there is no sys.getrefcount doesn't exist there), so we won't be able tot give this informative error on PyPy (as mentioned in the top post). So I did put the check and error behind a if not PyPy check.
I also already updated the tests to handle this, but this currently can't actually be tested in practice, since our PyPy test build is segfaulting (#50817)

@lithomas1
Copy link
Member

I don't think its too urgent to get this working on PyPy at the moment. When I added the PyPy build, in addition to the segfaults, I remember that there was a bunch of failures relating to hashing (I think our hash functions might have been completely broken on PyPy).

Once #50817 is fixed, I'll try to set up a mechanism to selectively enable some tests.

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an error in the typing/doctest etc ci, otherwise lgtm

@phofl phofl added this to the 2.0 milestone Jan 24, 2023
@phofl phofl merged commit 73840ef into pandas-dev:main Jan 24, 2023
@phofl
Copy link
Member

phofl commented Jan 24, 2023

thx @jorisvandenbossche

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants