|
| 1 | +# PDEP-5: No-default-index mode |
| 2 | + |
| 3 | +- Created: 14 November 2022 |
| 4 | +- Status: Draft |
| 5 | +- Discussion: [#49693](https://github.com/pandas-dev/pandas/pull/49693) |
| 6 | +- Author: [Marco Gorelli](https://github.com/MarcoGorelli) |
| 7 | +- Revision: 1 |
| 8 | + |
| 9 | +## Abstract |
| 10 | + |
| 11 | +The suggestion is to add a `mode.no_default_index` option which, if enabled, |
| 12 | +would ensure: |
| 13 | +- if a ``DataFrame`` / ``Series`` is created, then by default it won't have an ``Index``; |
| 14 | +- nobody will get an ``Index`` unless they ask for one - this would affect the default behaviour of ``groupby``, ``value_counts``, ``pivot_table``, and more. |
| 15 | + |
| 16 | +This option would not be the default. Users would need to explicitly opt-in to it, via ``pd.set_option('mode.no_default_index', True)``, via ``pd.option_context``, or via the ``PANDAS_NO_DEFAULT_INDEX`` environment variable. |
| 17 | + |
| 18 | +## Motivation and Scope |
| 19 | + |
| 20 | +The Index can be a source of confusion and frustration for pandas users. For example, let's consider the inputs |
| 21 | + |
| 22 | +```python |
| 23 | +In [37]: ser1 = df.groupby('sender')['amount'].sum() |
| 24 | + |
| 25 | +In [38]: ser2 = df.groupby('receiver')['amount'].sum() |
| 26 | + |
| 27 | +In [39]: ser1 |
| 28 | +Out[39]: |
| 29 | +sender |
| 30 | +1 10 |
| 31 | +2 15 |
| 32 | +3 20 |
| 33 | +5 25 |
| 34 | +Name: amount, dtype: int64 |
| 35 | + |
| 36 | +In [40]: ser2 |
| 37 | +Out[40]: |
| 38 | +receiver |
| 39 | +1 10 |
| 40 | +2 15 |
| 41 | +3 20 |
| 42 | +4 25 |
| 43 | +Name: amount, dtype: int64 |
| 44 | +``` |
| 45 | +. Then: |
| 46 | + |
| 47 | +- it can be unexpected that summing `Series` with the same length (but different indices) produces `NaN`s in the result (https://stackoverflow.com/q/66094702/4451315): |
| 48 | + |
| 49 | + ```python |
| 50 | + In [41]: ser1 + ser2 |
| 51 | + Out[41]: |
| 52 | + 1 20.0 |
| 53 | + 2 30.0 |
| 54 | + 3 40.0 |
| 55 | + 4 NaN |
| 56 | + 5 NaN |
| 57 | + Name: amount, dtype: float64 |
| 58 | + ``` |
| 59 | + |
| 60 | +- concatenation, even with `ignore_index=True`, still aligns on the index (https://github.com/pandas-dev/pandas/issues/25349): |
| 61 | + |
| 62 | + ```python |
| 63 | + In [42]: pd.concat([ser1, ser2], axis=1, ignore_index=True) |
| 64 | + Out[42]: |
| 65 | + 0 1 |
| 66 | + 1 10.0 10.0 |
| 67 | + 2 15.0 15.0 |
| 68 | + 3 20.0 20.0 |
| 69 | + 5 25.0 NaN |
| 70 | + 4 NaN 25.0 |
| 71 | + ``` |
| 72 | + |
| 73 | +- it can be frustrating to have to repeatedly call `.reset_index()` (https://twitter.com/chowthedog/status/1559946277315641345): |
| 74 | + |
| 75 | + ```python |
| 76 | + In [45]: df.value_counts(['sender', 'receiver']).reset_index().rename(columns={0: 'count'}) |
| 77 | + Out[45]: |
| 78 | + sender receiver count |
| 79 | + 0 1 1 1 |
| 80 | + 1 2 2 1 |
| 81 | + 2 3 3 1 |
| 82 | + 3 5 4 1 |
| 83 | + ``` |
| 84 | + |
| 85 | +With this option enabled, users who don't want to worry about indices wouldn't need to. |
| 86 | + |
| 87 | +## Detailed Description |
| 88 | + |
| 89 | +This would require 3 steps: |
| 90 | +1. creation of a ``NoIndex`` object, which would be a subclass of ``RangeIndex`` on which |
| 91 | + some operations such as ``append`` would behave differently. |
| 92 | + The ``default_index`` function would then return ``NoIndex`` (rather than ``RangeIndex``) if this mode is enabled; |
| 93 | +2. adjusting ``DataFrameFormatter`` and ``SeriesFormatter`` to not print row labels for objects with a ``NoIndex``; |
| 94 | +3. adjusting methods which currently return an index to just insert a new column instead. |
| 95 | + |
| 96 | +Let's expand on all three below. |
| 97 | + |
| 98 | +### 1. NoIndex object |
| 99 | + |
| 100 | +Most of the logic could be handled within the ``NoIndex`` object. |
| 101 | +It would be like a ``RangeIndex``, but with the following differences: |
| 102 | +- `name` could only be `None`; |
| 103 | +- `start` could only be `0`, `step` `1`; |
| 104 | +- when appending an extra element, the new `Index` would still be `NoIndex`; |
| 105 | +- when slicing, one would still get a `NoIndex`; |
| 106 | +- two ``NoIndex`` objects can't be aligned. Either they're the same length, or pandas raises; |
| 107 | +- aligning a ``NoIndex`` object with one which has an index will raise, always; |
| 108 | +- ``DataFrame`` columns can't be `NoIndex` (so ``transpose`` would need some adjustments when called on a ``NoIndex`` ``DataFrame``); |
| 109 | +- `insert` and `delete` should raise. As a consequence, `.drop` with `axis=0` would always raise; |
| 110 | +- arithmetic operations (e.g. `NoIndex(3) + 2`) would all raise. |
| 111 | + |
| 112 | +### 2. DataFrameFormatter and SeriesFormatter changes |
| 113 | + |
| 114 | +When printing an object with a ``NoIndex``, then the row labels wouldn't be shown: |
| 115 | + |
| 116 | +```python |
| 117 | +In [14]: pd.set_option('mode.no_default_index', True) |
| 118 | + |
| 119 | +In [15]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}) |
| 120 | + |
| 121 | +In [16]: df |
| 122 | +Out[16]: |
| 123 | + a b c |
| 124 | + 1 4 7 |
| 125 | + 2 5 8 |
| 126 | + 3 6 9 |
| 127 | +``` |
| 128 | + |
| 129 | +### 3. Nobody should get an index unless they ask for one |
| 130 | + |
| 131 | +The following would work in the same way: |
| 132 | +```python |
| 133 | +pivot = ( |
| 134 | + pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"], aggfunc=np.sum) |
| 135 | +).reset_index() |
| 136 | + |
| 137 | +with pd.option_context('mode.no_default_index', True): |
| 138 | + pivot = ( |
| 139 | + pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"], aggfunc=np.sum) |
| 140 | + ) |
| 141 | +``` |
| 142 | + |
| 143 | +Likewise for ``value_counts``. In ``groupby``, the default would be ``as_index=False``. |
| 144 | + |
| 145 | +## Usage and Impact |
| 146 | + |
| 147 | +Users who like the power of the ``Index`` could continue using pandas exactly as it is, |
| 148 | +without changing anything. |
| 149 | + |
| 150 | +The addition of this mode would enable users who don't want to think about indices to |
| 151 | +not have to. |
| 152 | + |
| 153 | +The implementation would be quite simple: most of the logic would be handled within the |
| 154 | +``NoIndex`` class, and only some minor adjustments (e.g. to the ``default_index`` function) |
| 155 | +would be needed in core pandas. |
| 156 | + |
| 157 | +## Implementation |
| 158 | + |
| 159 | +Draft pull request showing proof of concept: https://github.com/pandas-dev/pandas/pull/49693. |
| 160 | + |
| 161 | +## Likely FAQ |
| 162 | + |
| 163 | +**Q: Aren't indices really powerful?** |
| 164 | + |
| 165 | +**A:** Yes! And they're also confusing to many users, even experienced developers. |
| 166 | + It's fairly common to see pandas code with ``.reset_index`` scattered around every |
| 167 | + other line. Such users would benefit from a mode in which they wouldn't need to think |
| 168 | + about indices and alignment. |
| 169 | + |
| 170 | +**Q: In this mode, could users still get an ``Index`` if they really wanted to?** |
| 171 | + |
| 172 | +**A:** Yes! For example with |
| 173 | + ```python |
| 174 | + df.set_index(Index(range(len(df)))) |
| 175 | + ``` |
| 176 | + or, if they don't have a column named ``'index'``: |
| 177 | + ```python |
| 178 | + df.reset_index().set_index('index') |
| 179 | + ``` |
| 180 | + |
| 181 | +**Q: Why is it necessary to change the behaviour of ``value_counts``? Isn't the introduction of a ``NoIndex`` object enough?** |
| 182 | + |
| 183 | +**A:** The objective of this mode is to enable users to not have to think about indices if they don't want to. If they have to call |
| 184 | + ``.reset_index`` after each ``value_counts`` / ``pivot_table`` call, or remember to pass ``as_index=False`` to each ``groupby`` |
| 185 | + call, then this objective has arguably not quite been reached. |
| 186 | + |
| 187 | +## PDEP History |
| 188 | + |
| 189 | +- 14 November: Initial draft |
0 commit comments