-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: Added indexing views to roadmap #36988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,312 @@ | ||
.. _roadmap.indexing_views: | ||
|
||
================== | ||
Indexing and Views | ||
================== | ||
|
||
*A proposal for consistent, clear copy vs. view semantics in pandas' indexing.* | ||
|
||
**Issue**: https://github.com/pandas-dev/pandas/issues/36195 | ||
|
||
Motivation | ||
---------- | ||
|
||
pandas’ current behavior on whether indexing returns a view or copy is | ||
confusing. Even for experienced users, it’s hard to tell whether a view or copy | ||
will be returned (see below for a summary). We’d like to provide an API that is | ||
consistent and sensible about returning views vs. copies. | ||
|
||
We also care about performance. Returning views from indexing operations is | ||
faster and reduces memory usage (at least for that operation; whether it’s | ||
faster for a full workflow depends on whether downstream operations trigger a | ||
copy (possibly through block consolidation)). | ||
|
||
Finally, there are API / usability issues around views. It can be challenging to | ||
know the user’s intent in operations that modify a subset of a DataFrame (column | ||
and/or row selection), like: | ||
|
||
.. code-block:: python | ||
|
||
>>> df = pd.DataFrame({"A”": [1, 2], "B": [3, 4]}) | ||
>>> df2 = df[["A"]] | ||
>>> df2.iloc[:, 0] = 10 | ||
|
||
Did the user intend to modify ``df`` when they modified ``df2`` (setting aside | ||
issues with the current implementation)? In other words, if we had a perfectly | ||
consistent world where indexing the columns always returned views or always | ||
returned a copy, does the code above imply that the user wants to mutate ``df``? | ||
|
||
There are two possible behaviours the user might intend: | ||
|
||
1. I know my subset might be a view of the original and I want to modify the | ||
original as well. | ||
2. I just want to modify the subset without modifying the original. | ||
|
||
Today, pandas’ inconsistency means neither of these workflows is really | ||
possible. The first is difficult, because indexing operations often (though not | ||
always) return copies, and even when a view is returned you sometimes get a | ||
``SettingWithCopyWarning`` when mutating. The second is somewhat possible, but | ||
requires many defensive copies (to avoid ``SettingWithCopyWarning``, or to | ||
ensure that you have a copy when a view was returned). | ||
|
||
Proposal Summary | ||
---------------- | ||
|
||
For these reasons (consistency, performance, code clarity), we propose three | ||
changes: | ||
|
||
1. Indexing always returns a view when possible. This means that indexing | ||
columns of a dataframe always returns a view | ||
(https://github.com/pandas-dev/pandas/pull/33597), and indexing rows may | ||
return a view, depending on the type of the row indexer. | ||
2. We implement Error-on-Write (explained below) | ||
3. We provide APIs for explicitly marking a DataFrame as a “mutable view” | ||
(mutating the dataframe would mutate its parents) and copying a dataframe | ||
only if needed to avoid concerns with mutating other dataframes (i.e. it is | ||
not a view on another dataframe). | ||
|
||
The intent is to capture the performance benefits of views, while allowing users | ||
to explicitly choose the behavior they want for inplace operations that might | ||
mutate other dataframes. This essentially makes returning views an internal | ||
optimization, without the user needing to know if the specific indexing | ||
operation would return a view or a copy. | ||
|
||
Taking the example from above, if the user wants to make use of the fact that | ||
``df2`` is a view to modify the original ``df``, they would write: | ||
|
||
.. code-block:: python | ||
|
||
# Case 1: user wants mutations of df2 to be reflected in df | ||
>>> df = pd.DataFrame({"A": [1, 2], "B": [3, 4]}) | ||
>>> df2 = df[["A"]].as_mutable_view() # name TBD | ||
>>> df2.iloc[:, 0] = 10 | ||
>>> df.iloc[0, 0] # df was mutated 10 | ||
|
||
For the user who wishes to not mutate the parent, we require that the user | ||
explicitly break the reference from ``df2`` to ``df`` by implementing “Error on Write”. | ||
|
||
.. code-block:: python | ||
|
||
# Case 2: The user does not want mutating df2 to mutate df, via EoW | ||
>>> df = pd.DataFrame({"A": [1, 2], "B": [3, 4]}) | ||
>>> df2 = df[["A"]] | ||
>>> df2.iloc[0, 0] = 10 | ||
MutableViewError("error on write to subset of other dataframe") | ||
>>> df2 = df2.copy_if_needed() # API is TBD. Could be a keyword argument to copy. | ||
>>> df2.iloc[:, 0] = 10 | ||
>>> df.iloc[0, 0] # df was not mutated 1 | ||
|
||
Copy-on-Write vs. Error-on-Write | ||
-------------------------------- | ||
|
||
Consider the following example: | ||
|
||
.. code-block:: python | ||
|
||
>>> df2 = df[['A']] | ||
>>> df2.iloc[0, 0] = 10 # df2 can be a view of df, what happens by default? | ||
>>> df3 = df[df['A'] == 1] | ||
>>> df3.iloc[0, 0] = 10 # df3 is already a copy of df, what happens by default? | ||
|
||
We have a few options for the default: | ||
|
||
1. Well-Defined copy/view rules: ensure we have more consistent rules (e.g. | ||
selecting columns is always a view), and then views result in mutating the | ||
parent, copies not. This comes down to fixing some bugs and clearly | ||
documenting and testing which operations are views, and which are copies. | ||
2. Copy-on-Write: The setitem would check if it’s a view on another dataframe. | ||
If it is, then we would copy our data before mutating. | ||
3. Error-on-Write: The setitem would check if it’s a subset of another dataframe | ||
(both view of copy). Only rather than copying in case of a view we would | ||
raise an exception telling the user to either copy the data with | ||
``.copy_if_needed()`` (name TBD) or mark the frame as “a mutable view” with | ||
``.as_mutable_view()`` (name TBD). | ||
|
||
We propose "Error on Write" by default. This forces a decision on the user, and | ||
is the most explicit in terms of code. | ||
|
||
Additionally, consider the "classic" case of chained indexing, which was the | ||
original motivation for the ``SettingWithCopy`` warning | ||
|
||
.. code-block:: python | ||
|
||
>>> df[df['B'] > 4]['B'] = 10 | ||
|
||
That is roughly equivalent to | ||
|
||
.. code-block:: python | ||
|
||
>>> df2 = df[df['B'] > 4] # Copy under NumPy’s rules | ||
>>> df2['B'] = 10 # Update (the copy) df2, df not changed | ||
>>> del df2 # All references to df2 are lost, goes out of scope | ||
|
||
And so ``df`` is not modified. If we adopted Copy On Write to completely replace the | ||
current ``SettingWithCopy`` warning, we would restore the old behavior of silently | ||
“failing” to update ``df2``. Under Error on Write, we’d track that the ``df2`` created | ||
by the first getitem references ``df`` and raise an exception when it was being | ||
mutated. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It may be worth including what error-on-write would be "roughly equivalent to" (including comments on each line) to make it clear what would happen if you did break things up like you did in lines 137-139 |
||
|
||
New methods | ||
----------- | ||
|
||
In addition to the behavior changes to indexing columns, this proposal includes | ||
two new methods for controlling behavior in operations downstream of an indexing | ||
operation. | ||
|
||
.. code-block:: python | ||
|
||
def as_mutable_view(self): # name TBD | ||
""" | ||
Mark a DataFrame as mutable so that setitem operations propagate. | ||
|
||
Any setitem operations on the returned DataFrame will propagate | ||
to the DataFrame(s) this DataFrame is a view on. | ||
|
||
Examples | ||
-------- | ||
>>> df1 = pd.DataFrame({"A": [1, 2]}) | ||
>>> df2 = df[["A"]].as_mutable_view() # df2 is a view on df | ||
>>> df2.iloc[0, 0] = 10 | ||
>>> df1.iloc[0, 0] # The parent df1 was mutated. | ||
10 | ||
""" | ||
|
||
If we implement Error-On-Write, a ``copy_if_needed`` method is necessary for | ||
libraries and user code to avoid unnecessary defensive copying. | ||
|
||
.. code-block:: python | ||
|
||
def copy_if_needed(self): # name TBD | ||
""" | ||
Copy the data in a Series / DataFrame if it is a view on some other. | ||
|
||
This will copy the data backing a DataFrame only if it's a view | ||
on other some other dataframe. If it's not a view then no data is | ||
copied. | ||
|
||
Examples | ||
-------- | ||
>>> df1 = pd.DataFrame({"A": [1, 2]}) | ||
>>> df2 = df1[["A"]] # df2 is a view on df1 | ||
>>> df3 = df2.copy_if_needed() # triggers a copy | ||
|
||
When no copy is necessary (the object is not a view on another dataframe) | ||
then no copy is performed. | ||
|
||
>>> df4 = df1[df1['a'] == 1].copy_if_needed() # No copy, since boolean masking already returned a copy | ||
""" | ||
|
||
|
||
These two methods give users the control to say whether setitem operations on a | ||
dataframe that is a view on another dataframe should mutate the “parent” | ||
dataframe. Users wishing to mutate the parent will make it explicit with | ||
``.as_mutable_view()``. Users wishing to “break the chain” will call | ||
``.copy_if_needed()``. | ||
|
||
Extended proposal | ||
----------------- | ||
|
||
In principle, there’s nothing special about indexing when it comes to defensive | ||
copying. Any method that returns a new ``NDFrame`` without altering existing data | ||
(rename, set_index, possibly assign, dropping columns, etc.) is a candidate for | ||
returning a view. That said, we think it’d be unfortunate if something like the | ||
following was the behavior | ||
|
||
.. code-block:: python | ||
|
||
>>> df2 = df.rename(lambda x: x) # suppose df2 is a view on df | ||
>>> df2.iloc[0, 0] = 10 | ||
MutableViewError("This DataFrame is a view on another DataFrame. Set .as_mutable_view() or copy with ".copy_if_needed()"") | ||
|
||
Now we have to ask: does a reasonable consumer of the pandas API expect ``df2`` | ||
to be a view? Such that mutating ``df2`` would mutate ``df``? I’d argue no, | ||
people wouldn’t expect that. If that’s the case, then I think requiring people | ||
to include a ``.as_mutable_view()`` or ``.copy_if_needed()`` would be unfortunate | ||
line noise. So in this extended proposal we would probably prefer Copy-on-Write | ||
over Error-on-Write. That said, we don’t wish to discuss the extended proposal | ||
much here. We wish to focus primarily on indexing, and we can make a choice that | ||
is best for indexing. We only mention it here to inform our choice of | ||
Copy-on-Write vs. Error-on-Write. | ||
|
||
Propagating mutation forwards | ||
----------------------------- | ||
|
||
Thus far we’ve considered the (more common) case of taking a subset, mutating | ||
the subset, and how that should affect the parent. What about the other | ||
direction, where the parent is mutated? | ||
|
||
.. code-block:: python | ||
|
||
>>> df = pd.DataFrame({"A": [1, 2], "B": [3, 4]}) | ||
>>> df2 = df[["A"]] | ||
>>> df.iloc[0, 0] = 10 | ||
>>> df2.iloc[0, 0] # what is this value? | ||
|
||
We might value symmetry with the “backwards” case, which would argue that the | ||
setitem above should raise (under Error on Write) or copy (under Copy on Write). | ||
Users wishing that setitem operations on the parent should propagate to the | ||
child would need to call .as_mutable_view(). | ||
|
||
Deprecation or breaking change? | ||
------------------------------- | ||
|
||
Because of the subtleties around views vs. copies and mutation, we propose doing | ||
this as an API breaking change accompanying a major version bump. We think that | ||
simply always returning a view is too large a behavior change (even if the | ||
current semantics aren’t well tested / documented, people have written code | ||
that’s tailored to the current implementation). We also think a deprecation | ||
warning is too noisy. Indexing is too common an operation to include a warning | ||
(even if we limit it to just those operations that previously returned copies). | ||
|
||
Interaction with BlockManager, ArrayManager, and Consolidation | ||
-------------------------------------------------------------- | ||
|
||
This proposal is consistent with either the BlockManager or a proposed | ||
ArrayManager. However, there is a subtle interaction with the BlockManager’s | ||
*inplace* consolidation. Today, some operations (e.g. reductions) perform an | ||
inplace consolidation | ||
|
||
.. code-block:: python | ||
|
||
>>> df1 = pd.DataFrame({"A": [1, 2], "B": [3, 4]}) | ||
>>> df2 = df1[["A"]].as_mutable_view() # df2 is a view | ||
>>> df2.mean() # mean consolidates inplace, causing a copy, breaking the view. | ||
>>> df2.iloc[0, 0] = 1 | ||
|
||
It would be unfortunate if the presence or absence of a .mean() call changed the | ||
behavior of the later setitem. We likely have the tools to detect these cases | ||
and warn or raise if they occur. But this proposal would likely work better with | ||
a modified BlockManager that doesn’t do inplace consolidation. This will cause | ||
apparent regressions in the performance for workloads that do indexing followed | ||
by many operations that benefit from consolidation. We might consider exposing | ||
consolidation in the public API, though the details of that are left for a | ||
separate discussion. | ||
|
||
This proposal is consistent with the proposed ArrayManager. | ||
|
||
Background: Current behaviour of views vs copy | ||
---------------------------------------------- | ||
|
||
To the best of our knowledge, indexing operations currently return views in the | ||
following cases: | ||
|
||
Selecting a single column (as a Series) out of a DataFrame is always a view | ||
(``df['a']``) Slicing columns from a DataFrame creating a subset DataFrame | ||
(``df[['a':'b']]`` or ``df.loc[:, 'a': 'b']``) is a view if the the original | ||
DataFrame consists of a single block (single dtype, consolidated) and if you are | ||
slicing (so not a list selection). In all other cases, getting a subset is | ||
always a copy. Slicing rows can return a view, when the row indexer is a slice | ||
object. | ||
|
||
Remaining operations (subsetting rows with a list indexer or boolean mask) in | ||
practice return a copy, and we will raise a ``SettingWithCopy`` warning when the | ||
user tries to modify the subset. | ||
|
||
Background: Previous attempts | ||
----------------------------- | ||
|
||
We’ve discussed this general issue before. | ||
https://github.com/pandas-dev/pandas/issues/10954 and a few pull requests | ||
(https://github.com/pandas-dev/pandas/pull/12036, | ||
https://github.com/pandas-dev/pandas/pull/11207, | ||
https://github.com/pandas-dev/pandas/pull/11500). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering how often users want this behavior? Could be a lack of imagination on my part, but I would expect that the intention is almost always to change only the thing itself when modifying an object.
If it is indeed uncommon then perhaps could simplify things by not making it a part of the API? Users who want to change both objects can still do so without this shortcut (assuming copy on write).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally agree with you. The issue is the inconsistency (today) between doing
df["A"]
(which is a mutable view - it's aSeries
) anddf[["A"]]
, which today is not. The proposal is to always makedf[["A"]]
a view, and do copy-on-write or error-on-write if someone tries to modify it.