Skip to content

PDEP-5: NoRowIndex #49694

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Mar 2, 2023
380 changes: 380 additions & 0 deletions web/pandas/pdeps/0005-no-default-index-mode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,380 @@
# PDEP-5: NoRowIndex

- Created: 14 November 2022
- Status: Withdrawn
- Discussion: [#49693](https://github.com/pandas-dev/pandas/pull/49693)
- Author: [Marco Gorelli](https://github.com/MarcoGorelli)
- Revision: 2

## Abstract

The suggestion is to add a ``NoRowIndex`` class. Internally, it would act a bit like
a ``RangeIndex``, but some methods would be stricter. This would be one
step towards enabling users who do not want to think about indices to not need to.

## Motivation

The Index can be a source of confusion and frustration for pandas users. For example, let's consider the inputs

```python
In[37]: ser1 = pd.Series([10, 15, 20, 25], index=[1, 2, 3, 5])

In[38]: ser2 = pd.Series([10, 15, 20, 25], index=[1, 2, 3, 4])
```

Then:

- it can be unexpected that adding `Series` with the same length (but different indices) produces `NaN`s in the result (https://stackoverflow.com/q/66094702/4451315):

```python
In [41]: ser1 + ser2
Out[41]:
1 20.0
2 30.0
3 40.0
4 NaN
5 NaN
dtype: float64
```

- concatenation, even with `ignore_index=True`, still aligns on the index (https://github.com/pandas-dev/pandas/issues/25349):

```python
In [42]: pd.concat([ser1, ser2], axis=1, ignore_index=True)
Out[42]:
0 1
1 10.0 10.0
2 15.0 15.0
3 20.0 20.0
5 25.0 NaN
4 NaN 25.0
```

- it can be frustrating to have to repeatedly call `.reset_index()` (https://twitter.com/chowthedog/status/1559946277315641345):

```python
In [3]: ser1.reset_index(drop=True) + ser2.reset_index(drop=True)
Out[3]:
0 20
1 30
2 40
3 50
dtype: int64
```

If a user did not want to think about row labels (which they may have ended up after slicing / concatenating operations),
then ``NoRowIndex`` would enable the above to work in a more intuitive
manner (details and examples to follow below).

## Scope

This proposal deals exclusively with the ``NoRowIndex`` class. To allow users to fully "opt-out" of having to think
about row labels, the following could also be useful:
- a ``pd.set_option('mode.no_row_index', True)`` mode which would default to creating new ``DataFrame``s and
``Series`` with ``NoRowIndex`` instead of ``RangeIndex``;
- giving ``as_index`` options to methods which currently create an index
(e.g. ``value_counts``, ``.sum()``, ``.pivot_table``) to just insert a new column instead of creating an
``Index``.

However, neither of the above will be discussed here.

## Detailed Description

The core pandas code would change as little as possible. The additional complexity should be handled
within the ``NoRowIndex`` object. It would act just like ``RangeIndex``, but would be a bit stricter
in some cases:
- `name` could only be `None`;
- `start` could only be `0`, `step` `1`;
- when appending one ``NoRowIndex`` to another ``NoRowIndex``, the result would still be ``NoRowIndex``.
Appending a ``NoRowIndex`` to any other index (or vice-versa) would raise;
- the ``NoRowIndex`` class would be preserved under slicing;
- a ``NoRowIndex`` could only be aligned with another ``Index`` if it's also ``NoRowIndex`` and if it's of the same length;
- ``DataFrame`` columns cannot be `NoRowIndex` (so ``transpose`` would need some adjustments when called on a ``NoRowIndex`` ``DataFrame``);
- `insert` and `delete` should raise. As a consequence, if ``df`` is a ``DataFrame`` with a
``NoRowIndex``, then `df.drop` with `axis=0` would always raise;
- arithmetic operations (e.g. `NoRowIndex(3) + 2`) would always raise;
- when printing a ``DataFrame``/``Series`` with a ``NoRowIndex``, then the row labels would not be printed;
- a ``MultiIndex`` could not be created with a ``NoRowIndex`` as one of its levels.

Let's go into more detail for some of these. In the examples that follow, the ``NoRowIndex`` will be passed explicitly,
but this is not how users would be expected to use it (see "Usage and Impact" section for details).

### NoRowIndex.append

If one has two ``DataFrame``s with ``NoRowIndex``, then one would expect that concatenating them would
result in a ``DataFrame`` which still has ``NoRowIndex``. To do this, the following rule could be introduced:

> If appending a ``NoRowIndex`` of length ``y`` to a ``NoRowIndex`` of length ``x``, the result will be a
``NoRowIndex`` of length ``x + y``.

Example:

```python
In [6]: df1 = pd.DataFrame({'a': [1, 2], 'b': [4, 5]}, index=NoRowIndex(2))

In [7]: df2 = pd.DataFrame({'a': [4], 'b': [0]}, index=NoRowIndex(1))

In [8]: df1
Out[8]:
a b
1 4
2 5

In [9]: df2
Out[9]:
a b
4 0

In [10]: pd.concat([df1, df2])
Out[10]:
a b
1 4
2 5
4 0

In [11]: pd.concat([df1, df2]).index
Out[11]: NoRowIndex(len=3)
```

Appending anything other than another ``NoRowIndex`` would raise.

### Slicing a ``NoRowIndex``

If one has a ``DataFrame`` with ``NoRowIndex``, then one would expect that a slice of it would still have
a ``NoRowIndex``. This could be accomplished with:

> If a slice of length ``x`` is taken from a ``NoRowIndex`` of length ``y``, then one gets a
``NoRowIndex`` of length ``x``. Label-based slicing would not be allowed.

Example:

```python
In [12]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, index=NoRowIndex(3))

In [13]: df.loc[df['a']>1, 'b']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability, would be useful to have spaces around the > operator here and a few lines below

Out[13]:
5
6
Name: b, dtype: int64

In [14]: df.loc[df['a']>1, 'b'].index
Out[14]: NoRowIndex(len=2)
```

Slicing by label, however, would be disallowed:
```python
In [15]: df.loc[0, 'b']
---------------------------------------------------------------------------
IndexError: Cannot use label-based indexing on NoRowIndex!
```

Note too that:
- other uses of ``.loc``, such as boolean masks, would still be allowed (see F.A.Q);
- ``.iloc`` and ``.iat`` would keep working as before;
- ``.at`` would raise.

### Aligning ``NoRowIndex``s

To minimise surprises, the rule would be:

> A ``NoRowIndex`` can only be aligned with another ``NoRowIndex`` of the same length.
> Attempting to align it with anything else would raise.

Example:
```python
In [1]: ser1 = pd.Series([1, 2, 3], index=NoRowIndex(3))

In [2]: ser2 = pd.Series([4, 5, 6], index=NoRowIndex(3))

In [3]: ser1 + ser2 # works!
Out[3]:
5
7
9
dtype: int64

In [4]: ser1 + ser2.iloc[1:] # errors!
---------------------------------------------------------------------------
TypeError: Cannot join NoRowIndex of different lengths
```

### Columns cannot be NoRowIndex

This proposal deals exclusively with allowing users to not need to think about
row labels. There's no suggestion to remove the column labels.

In particular, calling ``transpose`` on a ``NoRowIndex`` ``DataFrame``
would error. The error would come with a helpful error message, informing
users that they should first set an index. E.g.:
```python
In [4]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, index=NoRowIndex(3))

In [5]: df.transpose()
---------------------------------------------------------------------------
ValueError: Columns cannot be NoRowIndex.
If you got here via `transpose` or an `axis=1` operation, then you should first set an index, e.g.: `df.pipe(lambda _df: _df.set_axis(pd.RangeIndex(len(_df))))`
```

### DataFrameFormatter and SeriesFormatter changes

When printing an object with a ``NoRowIndex``, then the row labels would not be shown:

```python
In [15]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, index=NoRowIndex(3))

In [16]: df
Out[16]:
a b
1 4
2 5
3 6
```

Of the above changes, this may be the only one that would need implementing within
``DataFrameFormatter`` / ``SerieFormatter``, as opposed to within ``NoRowIndex``.

## Usage and Impact

Users would not be expected to work with the ``NoRowIndex`` class itself directly.
Usage would probably involve a mode which would change how the ``default_index``
function to return a ``NoRowIndex`` rather than a ``RangeIndex``.
Then, if a ``mode.no_row_index`` option was introduced and a user opted in to it with

```python
pd.set_option("mode.no_row_index", True)
```

then the following would all create a ``DataFrame`` with a ``NoRowIndex`` (as they
all call ``default_index``):

- ``df.reset_index()``;
- ``pd.concat([df1, df2], ignore_index=True)``
- ``df1.merge(df2, on=col)``;
- ``df = pd.DataFrame({'col_1': [1, 2, 3]})``

Further discussion of such a mode is out-of-scope for this proposal. A ``NoRowIndex`` would
just be a first step towards getting there.

## Implementation

Draft pull request showing proof of concept: https://github.com/pandas-dev/pandas/pull/49693.

Note that implementation details could well change even if this PDEP were
accepted. For example, ``NoRowIndex`` would not necessarily need to subclass
``RangeIndex``, and it would not necessarily need to be accessible to the user
(``df.index`` could well return ``None``)

## Likely FAQ

**Q: Could not users just use ``RangeIndex``? Why do we need a new class?**

**A**: ``RangeIndex`` is not preserved under slicing and appending, e.g.:
```python
In[1]: ser = pd.Series([1, 2, 3])

In[2]: ser[ser != 2].index
Out[2]: Int64Index([0, 2], dtype="int64")
```
If someone does not want to think about row labels and starts off
with a ``RangeIndex``, they'll very quickly lose it.

**Q: Are indices not really powerful?**

**A:** Yes! And they're also confusing to many users, even experienced developers.
Often users are using ``.reset_index`` to avoid issues with indices and alignment.
Such users would benefit from being able to not think about indices
and alignment. Indices would be here to stay, and ``NoRowIndex`` would not be the
default.

**Q: How could one switch a ``NoRowIndex`` ``DataFrame`` back to one with an index?**

**A:** The simplest way would probably be:
```python
df.set_axis(pd.RangeIndex(len(df)))
```
There's probably no need to introduce a new method for this.

Conversely, to get rid of the index, then if the ``mode.no_row_index`` option was introduced, then
one could simply do ``df.reset_index(drop=True)``.

**Q: How would ``tz_localize`` and other methods which operate on the index work on a ``NoRowIndex`` ``DataFrame``?**

**A:** Same way they work on other ``NumericIndex``s, which would typically be to raise:

```python
In [2]: ser.tz_localize('UTC')
---------------------------------------------------------------------------
TypeError: index is not a valid DatetimeIndex or PeriodIndex
```

**Q: Why not let transpose switch ``NoRowIndex`` to ``RangeIndex`` under the hood before swapping index and columns?**

**A:** This is the kind of magic that can lead to surprising behaviour that's
difficult to debug. For example, ``df.transpose().transpose()`` would not
round-trip. It's easy enough to set an index after all, better to "force" users
to be intentional about what they want and end up with fewer surprises later
on.

**Q: What would df.sum(), and other methods which introduce an index, return?**

**A:** Such methods would still set an index and would work the same way they
do now. There may be some way to change that (e.g. introducing ``as_index``
arguments and introducing a mode to set its default) but that's out of scope
for this particular PDEP.

**Q: How would a user opt-in to a ``NoRowIndex`` DataFrame?**

**A:** This PDEP would only allow it via the constructor, passing
``index=NoRowIndex(len(df))``. A mode could be introduced to toggle
making that the default, but would be out-of-scope for the current PDEP.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inconsistent with what you wrote above about having the mode.no_row_index option

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how so? even when I mention it above, I make it clear that this PDEP only deals with the NoRowIndex class, and that this mode would be out-of-scope for this PDEP

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is text in the PDEP that refers to the mode that makes it seem like it is part of the proposal.

For example, lines 297-298 read:

Conversely, to get rid of the index, then (so long as one has enabled the ``mode.no_row_index`` option)
  one could simply do ``df.reset_index(drop=True)``.

When I read that, it seems like the mode is part of the proposal. So I think you may need to qualify all references to mode.no_row_index to describe what would happen if the mode.no_row_index option is not available.

Or make a decision to include mode.no_row_index as part of the proposal. (Which I think makes it easier to understand, IMHO)


**Q: Would ``.loc`` stop working?**

**A:** No. It would only raise if used for label-based selection. Other uses
of ``.loc``, such as ``df.loc[:, col_1]`` or ``df.loc[boolean_mask, col_1]``, would
continue working.

**Q: What's unintuitive about ``Series`` aligning indices when summing?**

**A:** Not sure, but I once asked a group of experienced developers what the
output of
```python
ser1 = pd.Series([1, 1, 1], index=[1, 2, 3])
ser2 = pd.Series([1, 1, 1], index=[3, 4, 5])
print(ser1 + ser2)
```
would be, and _nobody_ got it right.

## Reasons for withdrawal

After some discussions, it has become clear there is not enough for support for the proposal in its current state.
In short, it would add too much complexity to justify the potential benefits. It would unacceptably increase
the maintenance burden, the testing requirements, and the benefits would be minimal.

Concretely:
- maintenance burden: it would not be possible to handle all the complexity within the ``NoRowIndex`` class itself, some
extra logic would need to go into the pandas core codebase, which is already very complex and hard to maintain;
- the testing burden would be too high. Properly testing this would mean almost doubling the size of the test suite.
Coverage for options already is not great: for example [this issue](https://github.com/pandas-dev/pandas/issues/49732)
was caused by a PR which passed CI, but CI did not (and still does not) cover that option (plotting backends);
- it will not benefit most users, as users do not tend to use nor discover options which are not the default;
- it would be difficult to reconcile with some existing behaviours: for example, ``df.sum()`` returns a Series with the
column names in the index.

In order to make no-index the pandas default and have a chance of benefiting users, a more comprehensive set of changes
would need to made at the same time. This would require a proposal much larger in scope, and would be a much more radical change.
It may be that this proposal will be revisited in the future, but in its current state (as an option) it cannot be accepted.

This has still been a useful exercise, though, as it has resulted in two related proposals (see below).

## Related proposals

- Deprecate automatic alignment, at least in some cases: https://github.com/pandas-dev/pandas/issues/49939;
- ``.value_counts`` behaviour change: https://github.com/pandas-dev/pandas/issues/49497

## PDEP History

- 14 November 2022: Initial draft
- 18 November 2022: First revision (limited the proposal to a new class, leaving a ``mode`` to a separate proposal)
- 14 December 2022: Withdrawal (difficulty reconciling with some existing methods, lack of strong support,
maintenance burden increasing unjustifiably)
8 changes: 7 additions & 1 deletion web/pandas_web.py
Original file line number Diff line number Diff line change
Expand Up @@ -252,7 +252,13 @@ def roadmap_pdeps(context):
and linked from there. This preprocessor obtains the list of
PDEP's in different status from the directory tree and GitHub.
"""
KNOWN_STATUS = {"Under discussion", "Accepted", "Implemented", "Rejected"}
KNOWN_STATUS = {
"Under discussion",
"Accepted",
"Implemented",
"Rejected",
"Withdrawn",
}
context["pdeps"] = collections.defaultdict(list)

# accepted, rejected and implemented
Expand Down