pandas-dev · TomAugspurger · Oct 8, 2020 · Oct 9, 2020 · Oct 9, 2020 · Oct 9, 2020
diff --git a/doc/source/development/roadmap-indexing-views.rst b/doc/source/development/roadmap-indexing-views.rst
@@ -0,0 +1,312 @@
+.. _roadmap.indexing_views:
+
+==================
+Indexing and Views
+==================
+
+*A proposal for consistent, clear copy vs. view semantics in pandas' indexing.*
+
+**Issue**: https://github.com/pandas-dev/pandas/issues/36195
+
+Motivation
+----------
+
+pandas’ current behavior on whether indexing returns a view or copy is
+confusing. Even for experienced users, it’s hard to tell whether a view or copy
+will be returned (see below for a summary). We’d like to provide an API that is
+consistent and sensible about returning views vs. copies.
+
+We also care about performance. Returning views from indexing operations is
+faster and reduces memory usage (at least for that operation; whether it’s
+faster for a full workflow depends on whether downstream operations trigger a
+copy (possibly through block consolidation)).
+
+Finally, there are API / usability issues around views. It can be challenging to
+know the user’s intent in operations that modify a subset of a DataFrame (column
+and/or row selection), like:
+
+.. code-block:: python
+
+   >>> df = pd.DataFrame({"A”": [1, 2], "B": [3, 4]})
+   >>> df2 = df[["A"]]
+   >>> df2.iloc[:, 0] = 10
+
+Did the user intend to modify ``df`` when they modified ``df2`` (setting aside
+issues with the current implementation)? In other words, if we had a perfectly
+consistent world where indexing the columns always returned views or always
+returned a copy, does the code above imply that the user wants to mutate ``df``?
+
+There are two possible behaviours the user might intend:
+
+1. I know my subset might be a view of the original and I want to modify the
+   original as well.
+2. I just want to modify the subset without modifying the original.
+
+Today, pandas’ inconsistency means neither of these workflows is really
+possible. The first is difficult, because indexing operations often (though not
+always) return copies, and even when a view is returned you sometimes get a
+``SettingWithCopyWarning`` when mutating. The second is somewhat possible, but
+requires many defensive copies (to avoid ``SettingWithCopyWarning``, or to
+ensure that you have a copy when a view was returned).
+
+Proposal Summary
+----------------
+
+For these reasons (consistency, performance, code clarity), we propose three
+changes:
+
+1. Indexing always returns a view when possible. This means that indexing
+   columns of a dataframe always returns a view
+   (https://github.com/pandas-dev/pandas/pull/33597), and indexing rows may
+   return a view, depending on the type of the row indexer.
+2. We implement Error-on-Write (explained below)
+3. We provide APIs for explicitly marking a DataFrame as a “mutable view”
+   (mutating the dataframe would mutate its parents) and copying a dataframe
+   only if needed to avoid concerns with mutating other dataframes (i.e. it is
+   not a view on another dataframe).
+
+The intent is to capture the performance benefits of views, while allowing users
+to explicitly choose the behavior they want for inplace operations that might
+mutate other dataframes. This essentially makes returning views an internal
+optimization, without the user needing to know if the specific indexing
+operation would return a view or a copy.
+
+Taking the example from above, if the user wants to make use of the fact that
+``df2`` is a view to modify the original ``df``, they would write:
+
+.. code-block:: python
+
+   # Case 1: user wants mutations of df2 to be reflected in df
+   >>> df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
+   >>> df2 = df[["A"]].as_mutable_view() # name TBD
+   >>> df2.iloc[:, 0] = 10
+   >>> df.iloc[0, 0] # df was mutated 10
+
+For the user who wishes to not mutate the parent, we require that the user
+explicitly break the reference from ``df2`` to ``df`` by implementing “Error on Write”.
+
+.. code-block:: python
+
+   # Case 2: The user does not want mutating df2 to mutate df, via EoW
+   >>> df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
+   >>> df2 = df[["A"]]
+   >>> df2.iloc[0, 0] = 10
+   MutableViewError("error on write to subset of other dataframe")
+   >>> df2 = df2.copy_if_needed() # API is TBD. Could be a keyword argument to copy.
+   >>> df2.iloc[:, 0] = 10
+   >>> df.iloc[0, 0] # df was not mutated 1
+
+Copy-on-Write vs. Error-on-Write
+--------------------------------
+
+Consider the following example:
+
+.. code-block:: python
+
+   >>> df2 = df[['A']]
+   >>> df2.iloc[0, 0] = 10 # df2 can be a view of df, what happens by default?
+   >>> df3 = df[df['A'] == 1]
+   >>> df3.iloc[0, 0] = 10 # df3 is already a copy of df, what happens by default?
+
+We have a few options for the default:
+
+1. Well-Defined copy/view rules: ensure we have more consistent rules (e.g.
+   selecting columns is always a view), and then views result in mutating the
+   parent, copies not. This comes down to fixing some bugs and clearly
+   documenting and testing which operations are views, and which are copies.
+2. Copy-on-Write: The setitem would check if it’s a view on another dataframe.
+   If it is, then we would copy our data before mutating.
+3. Error-on-Write: The setitem would check if it’s a subset of another dataframe
+   (both view of copy). Only rather than copying in case of a view we would
+   raise an exception telling the user to either copy the data with
+   ``.copy_if_needed()`` (name TBD) or mark the frame as “a mutable view” with
+   ``.as_mutable_view()`` (name TBD).
+
+We propose "Error on Write" by default. This forces a decision on the user, and
+is the most explicit in terms of code.
+
+Additionally, consider the "classic" case of chained indexing, which was the
+original motivation for the ``SettingWithCopy`` warning
+
+.. code-block:: python
+
+   >>> df[df['B'] > 4]['B'] = 10
+
+That is roughly equivalent to
+
+.. code-block:: python
+
+   >>> df2 = df[df['B'] > 4] # Copy under NumPy’s rules
+   >>> df2['B'] = 10 # Update (the copy) df2, df not changed
+   >>> del df2 # All references to df2 are lost, goes out of scope
+
+And so ``df`` is not modified. If we adopted Copy On Write to completely replace the
+current ``SettingWithCopy`` warning, we would restore the old behavior of silently
+“failing” to update ``df2``. Under Error on Write, we’d track that the ``df2`` created
+by the first getitem references ``df`` and raise an exception when it was being
+mutated.
+
+New methods
+-----------
+
+In addition to the behavior changes to indexing columns, this proposal includes
+two new methods for controlling behavior in operations downstream of an indexing
+operation.
+
+.. code-block:: python
+
+   def as_mutable_view(self):  # name TBD
+       """
+       Mark a DataFrame as mutable so that setitem operations propagate.
+
+       Any setitem operations on the returned DataFrame will propagate
+       to the DataFrame(s) this DataFrame is a view on.
+
+       Examples
+       --------
+       >>> df1 = pd.DataFrame({"A": [1, 2]})
+       >>> df2 = df[["A"]].as_mutable_view()  # df2 is a view on df
+       >>> df2.iloc[0, 0] = 10
+       >>> df1.iloc[0, 0]  # The parent df1 was mutated.
+       10
+       """
+
+If we implement Error-On-Write, a ``copy_if_needed`` method is necessary for
+libraries and user code to avoid unnecessary defensive copying.
+
+.. code-block:: python
+
+   def copy_if_needed(self):  # name TBD
+       """
+       Copy the data in a Series / DataFrame if it is a view on some other.
+
+       This will copy the data backing a DataFrame only if it's a view
+       on other some other dataframe. If it's not a view then no data is
+       copied.
+
+       Examples
+       --------
+       >>> df1 = pd.DataFrame({"A": [1, 2]})
+       >>> df2 = df1[["A"]]  # df2 is a view on df1
+       >>> df3 = df2.copy_if_needed()  # triggers a copy
+
+       When no copy is necessary (the object is not a view on another dataframe)
+       then no copy is performed.
+
+       >>> df4 = df1[df1['a'] == 1].copy_if_needed()  # No copy, since boolean masking already returned a copy
+       """
+
+
+These two methods give users the control to say whether setitem operations on a
+dataframe that is a view on another dataframe should mutate the “parent”
+dataframe. Users wishing to mutate the parent will make it explicit with
+``.as_mutable_view()``. Users wishing to “break the chain” will call
+``.copy_if_needed()``.
+
+Extended proposal
+-----------------
+
+In principle, there’s nothing special about indexing when it comes to defensive
+copying. Any method that returns a new ``NDFrame`` without altering existing data
+(rename, set_index, possibly assign, dropping columns, etc.) is a candidate for
+returning a view. That said, we think it’d be unfortunate if something like the
+following was the behavior
+
+.. code-block:: python
+
+   >>> df2 = df.rename(lambda x: x) # suppose df2 is a view on df
+   >>> df2.iloc[0, 0] = 10
+   MutableViewError("This DataFrame is a view on another DataFrame. Set .as_mutable_view() or copy with ".copy_if_needed()"")
+
+Now we have to ask: does a reasonable consumer of the pandas API expect ``df2``
+to be a view? Such that mutating ``df2`` would mutate ``df``? I’d argue no,
+people wouldn’t expect that. If that’s the case, then I think requiring people
+to include a ``.as_mutable_view()`` or ``.copy_if_needed()`` would be unfortunate
+line noise. So in this extended proposal we would probably prefer Copy-on-Write
+over Error-on-Write. That said, we don’t wish to discuss the extended proposal
+much here. We wish to focus primarily on indexing, and we can make a choice that
+is best for indexing. We only mention it here to inform our choice of
+Copy-on-Write vs. Error-on-Write.
+
+Propagating mutation forwards
+-----------------------------
+
+Thus far we’ve considered the (more common) case of taking a subset, mutating
+the subset, and how that should affect the parent. What about the other
+direction, where the parent is mutated?
+
+.. code-block:: python
+
+   >>> df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
+   >>> df2 = df[["A"]]
+   >>> df.iloc[0, 0] = 10
+   >>> df2.iloc[0, 0] # what is this value?
+
+We might value symmetry with the “backwards” case, which would argue that the
+setitem above should raise (under Error on Write) or copy (under Copy on Write).
+Users wishing that setitem operations on the parent should propagate to the
+child would need to call .as_mutable_view().
+
+Deprecation or breaking change?
+-------------------------------
+
+Because of the subtleties around views vs. copies and mutation, we propose doing
+this as an API breaking change accompanying a major version bump. We think that
+simply always returning a view is too large a behavior change (even if the
+current semantics aren’t well tested / documented, people have written code
+that’s tailored to the current implementation). We also think a deprecation
+warning is too noisy. Indexing is too common an operation to include a warning
+(even if we limit it to just those operations that previously returned copies).
+
+Interaction with BlockManager, ArrayManager, and Consolidation
+--------------------------------------------------------------
+
+This proposal is consistent with either the BlockManager or a proposed
+ArrayManager. However, there is a subtle interaction with the BlockManager’s
+*inplace* consolidation. Today, some operations (e.g. reductions) perform an
+inplace consolidation
+
+.. code-block:: python
+
+   >>> df1 = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
+   >>> df2 = df1[["A"]].as_mutable_view() # df2 is a view
+   >>> df2.mean() # mean consolidates inplace, causing a copy, breaking the view.
+   >>> df2.iloc[0, 0] = 1
+
+It would be unfortunate if the presence or absence of a .mean() call changed the
+behavior of the later setitem. We likely have the tools to detect these cases
+and warn or raise if they occur. But this proposal would likely work better with
+a modified BlockManager that doesn’t do inplace consolidation. This will cause
+apparent regressions in the performance for workloads that do indexing followed
+by many operations that benefit from consolidation. We might consider exposing
+consolidation in the public API, though the details of that are left for a
+separate discussion.
+
+This proposal is consistent with the proposed ArrayManager.
+
+Background: Current behaviour of views vs copy
+----------------------------------------------
+
+To the best of our knowledge, indexing operations currently return views in the
+following cases:
+
+Selecting a single column (as a Series) out of a DataFrame is always a view
+(``df['a']``) Slicing columns from a DataFrame creating a subset DataFrame
+(``df[['a':'b']]`` or ``df.loc[:, 'a': 'b']``) is a view if the the original
+DataFrame consists of a single block (single dtype, consolidated) and if you are
+slicing (so not a list selection). In all other cases, getting a subset is
+always a copy. Slicing rows can return a view, when the row indexer is a slice
+object.
+
+Remaining operations (subsetting rows with a list indexer or boolean mask) in
+practice return a copy, and we will raise a ``SettingWithCopy`` warning when the
+user tries to modify the subset.
+
+Background: Previous attempts
+-----------------------------
+
+We’ve discussed this general issue before.
+https://github.com/pandas-dev/pandas/issues/10954 and a few pull requests
+(https://github.com/pandas-dev/pandas/pull/12036,
+https://github.com/pandas-dev/pandas/pull/11207,
+https://github.com/pandas-dev/pandas/pull/11500).
diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst
@@ -162,6 +162,39 @@ We'd like to fund improvements and maintenance of these tools to
 * Build a GitHub bot to request ASV runs *before* a PR is merged. Currently, the
   benchmarks are only run nightly.
 
+Indexing and Views
+------------------
+
+pandas’ current behavior on whether indexing returns a view or copy is confusing
+and slow. Consider the following example:
+
+.. code-block:: python
+
+   >>> df = pd.DataFrame({"A”": [1, 2], "B": [3, 4]})
+   >>> df2 = df[["A"]]
+   >>> df2.iloc[:, 0] = 10
+
+What should happen to ``df``? Does the user's code intend to modify just ``df2``,
+or should ``df`` be modified as well?
+
+This item proposes to standardize copy vs. view behavior in pandas' indexing,
+and to provide users the APIs necessary to safely and efficiently express their
+desired operation.
+
+In particular, we propose three changes:
+
+1. Indexing always returns a view when possible. This means that indexing
+   columns of a dataframe always returns a view
+   (https://github.com/pandas-dev/pandas/pull/33597), and indexing rows may
+   return a view, depending on the type of the row indexer.
+2. We implement Error-on-Write (explained below)
+3. We provide APIs for explicitly marking a DataFrame as a “mutable view”
+   (mutating the dataframe would mutate its parents) and copying a dataframe
+   only if needed to avoid concerns with mutating other dataframes (i.e. it is
+   not a view on another dataframe).
+
+See :ref:`roadmap.indexing_views` for a detailed proposal.
+
 .. _roadmap.evolution:
 
 Roadmap evolution
@@ -206,3 +239,8 @@ We improved the pandas documentation
   pandas users coming from a variety of backgrounds (:issue:`26831`).
 
 .. _pydata-sphinx-theme: https://github.com/pandas-dev/pydata-sphinx-theme
+
+.. toctree::
+   :maxdepth: 1
+
+   roadmap-indexing-views.rst