Skip to content

DOC: Added indexing views to roadmap #36988

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
312 changes: 312 additions & 0 deletions doc/source/development/roadmap-indexing-views.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,312 @@
.. _roadmap.indexing_views:

==================
Indexing and Views
==================

*A proposal for consistent, clear copy vs. view semantics in pandas' indexing.*

**Issue**: https://github.com/pandas-dev/pandas/issues/36195

Motivation
----------

pandas’ current behavior on whether indexing returns a view or copy is
confusing. Even for experienced users, it’s hard to tell whether a view or copy
will be returned (see below for a summary). We’d like to provide an API that is
consistent and sensible about returning views vs. copies.

We also care about performance. Returning views from indexing operations is
faster and reduces memory usage (at least for that operation; whether it’s
faster for a full workflow depends on whether downstream operations trigger a
copy (possibly through block consolidation)).

Finally, there are API / usability issues around views. It can be challenging to
know the user’s intent in operations that modify a subset of a DataFrame (column
and/or row selection), like:

.. code-block:: python

>>> df = pd.DataFrame({"A”": [1, 2], "B": [3, 4]})
>>> df2 = df[["A"]]
>>> df2.iloc[:, 0] = 10

Did the user intend to modify ``df`` when they modified ``df2`` (setting aside
issues with the current implementation)? In other words, if we had a perfectly
consistent world where indexing the columns always returned views or always
returned a copy, does the code above imply that the user wants to mutate ``df``?

There are two possible behaviours the user might intend:

1. I know my subset might be a view of the original and I want to modify the
original as well.
2. I just want to modify the subset without modifying the original.

Today, pandas’ inconsistency means neither of these workflows is really
possible. The first is difficult, because indexing operations often (though not
always) return copies, and even when a view is returned you sometimes get a
``SettingWithCopyWarning`` when mutating. The second is somewhat possible, but
requires many defensive copies (to avoid ``SettingWithCopyWarning``, or to
ensure that you have a copy when a view was returned).

Proposal Summary
----------------

For these reasons (consistency, performance, code clarity), we propose three
changes:

1. Indexing always returns a view when possible. This means that indexing
columns of a dataframe always returns a view
(https://github.com/pandas-dev/pandas/pull/33597), and indexing rows may
return a view, depending on the type of the row indexer.
2. We implement Error-on-Write (explained below)
3. We provide APIs for explicitly marking a DataFrame as a “mutable view”
(mutating the dataframe would mutate its parents) and copying a dataframe
only if needed to avoid concerns with mutating other dataframes (i.e. it is
not a view on another dataframe).

The intent is to capture the performance benefits of views, while allowing users
to explicitly choose the behavior they want for inplace operations that might
mutate other dataframes. This essentially makes returning views an internal
optimization, without the user needing to know if the specific indexing
operation would return a view or a copy.

Taking the example from above, if the user wants to make use of the fact that
``df2`` is a view to modify the original ``df``, they would write:

.. code-block:: python

# Case 1: user wants mutations of df2 to be reflected in df
>>> df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
>>> df2 = df[["A"]].as_mutable_view() # name TBD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering how often users want this behavior? Could be a lack of imagination on my part, but I would expect that the intention is almost always to change only the thing itself when modifying an object.

If it is indeed uncommon then perhaps could simplify things by not making it a part of the API? Users who want to change both objects can still do so without this shortcut (assuming copy on write).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally agree with you. The issue is the inconsistency (today) between doing df["A"] (which is a mutable view - it's a Series) and df[["A"]], which today is not. The proposal is to always make df[["A"]] a view, and do copy-on-write or error-on-write if someone tries to modify it.

>>> df2.iloc[:, 0] = 10
>>> df.iloc[0, 0] # df was mutated 10

For the user who wishes to not mutate the parent, we require that the user
explicitly break the reference from ``df2`` to ``df`` by implementing “Error on Write”.

.. code-block:: python

# Case 2: The user does not want mutating df2 to mutate df, via EoW
>>> df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
>>> df2 = df[["A"]]
>>> df2.iloc[0, 0] = 10
MutableViewError("error on write to subset of other dataframe")
>>> df2 = df2.copy_if_needed() # API is TBD. Could be a keyword argument to copy.
>>> df2.iloc[:, 0] = 10
>>> df.iloc[0, 0] # df was not mutated 1

Copy-on-Write vs. Error-on-Write
--------------------------------

Consider the following example:

.. code-block:: python

>>> df2 = df[['A']]
>>> df2.iloc[0, 0] = 10 # df2 can be a view of df, what happens by default?
>>> df3 = df[df['A'] == 1]
>>> df3.iloc[0, 0] = 10 # df3 is already a copy of df, what happens by default?

We have a few options for the default:

1. Well-Defined copy/view rules: ensure we have more consistent rules (e.g.
selecting columns is always a view), and then views result in mutating the
parent, copies not. This comes down to fixing some bugs and clearly
documenting and testing which operations are views, and which are copies.
2. Copy-on-Write: The setitem would check if it’s a view on another dataframe.
If it is, then we would copy our data before mutating.
3. Error-on-Write: The setitem would check if it’s a subset of another dataframe
(both view of copy). Only rather than copying in case of a view we would
raise an exception telling the user to either copy the data with
``.copy_if_needed()`` (name TBD) or mark the frame as “a mutable view” with
``.as_mutable_view()`` (name TBD).

We propose "Error on Write" by default. This forces a decision on the user, and
is the most explicit in terms of code.

Additionally, consider the "classic" case of chained indexing, which was the
original motivation for the ``SettingWithCopy`` warning

.. code-block:: python

>>> df[df['B'] > 4]['B'] = 10

That is roughly equivalent to

.. code-block:: python

>>> df2 = df[df['B'] > 4] # Copy under NumPy’s rules
>>> df2['B'] = 10 # Update (the copy) df2, df not changed
>>> del df2 # All references to df2 are lost, goes out of scope

And so ``df`` is not modified. If we adopted Copy On Write to completely replace the
current ``SettingWithCopy`` warning, we would restore the old behavior of silently
“failing” to update ``df2``. Under Error on Write, we’d track that the ``df2`` created
by the first getitem references ``df`` and raise an exception when it was being
mutated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth including what error-on-write would be "roughly equivalent to" (including comments on each line) to make it clear what would happen if you did break things up like you did in lines 137-139


New methods
-----------

In addition to the behavior changes to indexing columns, this proposal includes
two new methods for controlling behavior in operations downstream of an indexing
operation.

.. code-block:: python

def as_mutable_view(self): # name TBD
"""
Mark a DataFrame as mutable so that setitem operations propagate.

Any setitem operations on the returned DataFrame will propagate
to the DataFrame(s) this DataFrame is a view on.

Examples
--------
>>> df1 = pd.DataFrame({"A": [1, 2]})
>>> df2 = df[["A"]].as_mutable_view() # df2 is a view on df
>>> df2.iloc[0, 0] = 10
>>> df1.iloc[0, 0] # The parent df1 was mutated.
10
"""

If we implement Error-On-Write, a ``copy_if_needed`` method is necessary for
libraries and user code to avoid unnecessary defensive copying.

.. code-block:: python

def copy_if_needed(self): # name TBD
"""
Copy the data in a Series / DataFrame if it is a view on some other.

This will copy the data backing a DataFrame only if it's a view
on other some other dataframe. If it's not a view then no data is
copied.

Examples
--------
>>> df1 = pd.DataFrame({"A": [1, 2]})
>>> df2 = df1[["A"]] # df2 is a view on df1
>>> df3 = df2.copy_if_needed() # triggers a copy

When no copy is necessary (the object is not a view on another dataframe)
then no copy is performed.

>>> df4 = df1[df1['a'] == 1].copy_if_needed() # No copy, since boolean masking already returned a copy
"""


These two methods give users the control to say whether setitem operations on a
dataframe that is a view on another dataframe should mutate the “parent”
dataframe. Users wishing to mutate the parent will make it explicit with
``.as_mutable_view()``. Users wishing to “break the chain” will call
``.copy_if_needed()``.

Extended proposal
-----------------

In principle, there’s nothing special about indexing when it comes to defensive
copying. Any method that returns a new ``NDFrame`` without altering existing data
(rename, set_index, possibly assign, dropping columns, etc.) is a candidate for
returning a view. That said, we think it’d be unfortunate if something like the
following was the behavior

.. code-block:: python

>>> df2 = df.rename(lambda x: x) # suppose df2 is a view on df
>>> df2.iloc[0, 0] = 10
MutableViewError("This DataFrame is a view on another DataFrame. Set .as_mutable_view() or copy with ".copy_if_needed()"")

Now we have to ask: does a reasonable consumer of the pandas API expect ``df2``
to be a view? Such that mutating ``df2`` would mutate ``df``? I’d argue no,
people wouldn’t expect that. If that’s the case, then I think requiring people
to include a ``.as_mutable_view()`` or ``.copy_if_needed()`` would be unfortunate
line noise. So in this extended proposal we would probably prefer Copy-on-Write
over Error-on-Write. That said, we don’t wish to discuss the extended proposal
much here. We wish to focus primarily on indexing, and we can make a choice that
is best for indexing. We only mention it here to inform our choice of
Copy-on-Write vs. Error-on-Write.

Propagating mutation forwards
-----------------------------

Thus far we’ve considered the (more common) case of taking a subset, mutating
the subset, and how that should affect the parent. What about the other
direction, where the parent is mutated?

.. code-block:: python

>>> df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
>>> df2 = df[["A"]]
>>> df.iloc[0, 0] = 10
>>> df2.iloc[0, 0] # what is this value?

We might value symmetry with the “backwards” case, which would argue that the
setitem above should raise (under Error on Write) or copy (under Copy on Write).
Users wishing that setitem operations on the parent should propagate to the
child would need to call .as_mutable_view().

Deprecation or breaking change?
-------------------------------

Because of the subtleties around views vs. copies and mutation, we propose doing
this as an API breaking change accompanying a major version bump. We think that
simply always returning a view is too large a behavior change (even if the
current semantics aren’t well tested / documented, people have written code
that’s tailored to the current implementation). We also think a deprecation
warning is too noisy. Indexing is too common an operation to include a warning
(even if we limit it to just those operations that previously returned copies).

Interaction with BlockManager, ArrayManager, and Consolidation
--------------------------------------------------------------

This proposal is consistent with either the BlockManager or a proposed
ArrayManager. However, there is a subtle interaction with the BlockManager’s
*inplace* consolidation. Today, some operations (e.g. reductions) perform an
inplace consolidation

.. code-block:: python

>>> df1 = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
>>> df2 = df1[["A"]].as_mutable_view() # df2 is a view
>>> df2.mean() # mean consolidates inplace, causing a copy, breaking the view.
>>> df2.iloc[0, 0] = 1

It would be unfortunate if the presence or absence of a .mean() call changed the
behavior of the later setitem. We likely have the tools to detect these cases
and warn or raise if they occur. But this proposal would likely work better with
a modified BlockManager that doesn’t do inplace consolidation. This will cause
apparent regressions in the performance for workloads that do indexing followed
by many operations that benefit from consolidation. We might consider exposing
consolidation in the public API, though the details of that are left for a
separate discussion.

This proposal is consistent with the proposed ArrayManager.

Background: Current behaviour of views vs copy
----------------------------------------------

To the best of our knowledge, indexing operations currently return views in the
following cases:

Selecting a single column (as a Series) out of a DataFrame is always a view
(``df['a']``) Slicing columns from a DataFrame creating a subset DataFrame
(``df[['a':'b']]`` or ``df.loc[:, 'a': 'b']``) is a view if the the original
DataFrame consists of a single block (single dtype, consolidated) and if you are
slicing (so not a list selection). In all other cases, getting a subset is
always a copy. Slicing rows can return a view, when the row indexer is a slice
object.

Remaining operations (subsetting rows with a list indexer or boolean mask) in
practice return a copy, and we will raise a ``SettingWithCopy`` warning when the
user tries to modify the subset.

Background: Previous attempts
-----------------------------

We’ve discussed this general issue before.
https://github.com/pandas-dev/pandas/issues/10954 and a few pull requests
(https://github.com/pandas-dev/pandas/pull/12036,
https://github.com/pandas-dev/pandas/pull/11207,
https://github.com/pandas-dev/pandas/pull/11500).
38 changes: 38 additions & 0 deletions doc/source/development/roadmap.rst
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,39 @@ We'd like to fund improvements and maintenance of these tools to
* Build a GitHub bot to request ASV runs *before* a PR is merged. Currently, the
benchmarks are only run nightly.

Indexing and Views
------------------

pandas’ current behavior on whether indexing returns a view or copy is confusing
and slow. Consider the following example:

.. code-block:: python

>>> df = pd.DataFrame({"A”": [1, 2], "B": [3, 4]})
>>> df2 = df[["A"]]
>>> df2.iloc[:, 0] = 10

What should happen to ``df``? Does the user's code intend to modify just ``df2``,
or should ``df`` be modified as well?

This item proposes to standardize copy vs. view behavior in pandas' indexing,
and to provide users the APIs necessary to safely and efficiently express their
desired operation.

In particular, we propose three changes:

1. Indexing always returns a view when possible. This means that indexing
columns of a dataframe always returns a view
(https://github.com/pandas-dev/pandas/pull/33597), and indexing rows may
return a view, depending on the type of the row indexer.
2. We implement Error-on-Write (explained below)
3. We provide APIs for explicitly marking a DataFrame as a “mutable view”
(mutating the dataframe would mutate its parents) and copying a dataframe
only if needed to avoid concerns with mutating other dataframes (i.e. it is
not a view on another dataframe).

See :ref:`roadmap.indexing_views` for a detailed proposal.

.. _roadmap.evolution:

Roadmap evolution
Expand Down Expand Up @@ -206,3 +239,8 @@ We improved the pandas documentation
pandas users coming from a variety of backgrounds (:issue:`26831`).

.. _pydata-sphinx-theme: https://github.com/pandas-dev/pydata-sphinx-theme

.. toctree::
:maxdepth: 1

roadmap-indexing-views.rst