Skip to content

Commit 92d5545

Browse files
author
MarcoGorelli
committed
[skip ci] pdep-5 initial draft
1 parent 8da8743 commit 92d5545

File tree

1 file changed

+189
-0
lines changed

1 file changed

+189
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
# PDEP-5: No-default-index mode
2+
3+
- Created: 14 November 2022
4+
- Status: Draft
5+
- Discussion: [#49693](https://github.com/pandas-dev/pandas/pull/49693)
6+
- Author: [Marco Gorelli](https://github.com/MarcoGorelli)
7+
- Revision: 1
8+
9+
## Abstract
10+
11+
The suggestion is to add a `mode.no_default_index` option which, if enabled,
12+
would ensure:
13+
- if a ``DataFrame`` / ``Series`` is created, then by default it won't have an ``Index``;
14+
- nobody will get an ``Index`` unless they ask for one - this would affect the default behaviour of ``groupby``, ``value_counts``, ``pivot_table``, and more.
15+
16+
This option would not be the default. Users would need to explicitly opt-in to it, via ``pd.set_option('mode.no_default_index', True)``, via ``pd.option_context``, or via the ``PANDAS_NO_DEFAULT_INDEX`` environment variable.
17+
18+
## Motivation and Scope
19+
20+
The Index can be a source of confusion and frustration for pandas users. For example, let's consider the inputs
21+
22+
```python
23+
In [37]: ser1 = df.groupby('sender')['amount'].sum()
24+
25+
In [38]: ser2 = df.groupby('receiver')['amount'].sum()
26+
27+
In [39]: ser1
28+
Out[39]:
29+
sender
30+
1 10
31+
2 15
32+
3 20
33+
5 25
34+
Name: amount, dtype: int64
35+
36+
In [40]: ser2
37+
Out[40]:
38+
receiver
39+
1 10
40+
2 15
41+
3 20
42+
4 25
43+
Name: amount, dtype: int64
44+
```
45+
. Then:
46+
47+
- it can be unexpected that summing `Series` with the same length (but different indices) produces `NaN`s in the result (https://stackoverflow.com/q/66094702/4451315):
48+
49+
```python
50+
In [41]: ser1 + ser2
51+
Out[41]:
52+
1 20.0
53+
2 30.0
54+
3 40.0
55+
4 NaN
56+
5 NaN
57+
Name: amount, dtype: float64
58+
```
59+
60+
- concatenation, even with `ignore_index=True`, still aligns on the index (https://github.com/pandas-dev/pandas/issues/25349):
61+
62+
```python
63+
In [42]: pd.concat([ser1, ser2], axis=1, ignore_index=True)
64+
Out[42]:
65+
0 1
66+
1 10.0 10.0
67+
2 15.0 15.0
68+
3 20.0 20.0
69+
5 25.0 NaN
70+
4 NaN 25.0
71+
```
72+
73+
- it can be frustrating to have to repeatedly call `.reset_index()` (https://twitter.com/chowthedog/status/1559946277315641345):
74+
75+
```python
76+
In [45]: df.value_counts(['sender', 'receiver']).reset_index().rename(columns={0: 'count'})
77+
Out[45]:
78+
sender receiver count
79+
0 1 1 1
80+
1 2 2 1
81+
2 3 3 1
82+
3 5 4 1
83+
```
84+
85+
With this option enabled, users who don't want to worry about indices wouldn't need to.
86+
87+
## Detailed Description
88+
89+
This would require 3 steps:
90+
1. creation of a ``NoIndex`` object, which would be a subclass of ``RangeIndex`` on which
91+
some operations such as ``append`` would behave differently.
92+
The ``default_index`` function would then return ``NoIndex`` (rather than ``RangeIndex``) if this mode is enabled;
93+
2. adjusting ``DataFrameFormatter`` and ``SeriesFormatter`` to not print row labels for objects with a ``NoIndex``;
94+
3. adjusting methods which currently return an index to just insert a new column instead.
95+
96+
Let's expand on all three below.
97+
98+
### 1. NoIndex object
99+
100+
Most of the logic could be handled within the ``NoIndex`` object.
101+
It would be like a ``RangeIndex``, but with the following differences:
102+
- `name` could only be `None`;
103+
- `start` could only be `0`, `step` `1`;
104+
- when appending an extra element, the new `Index` would still be `NoIndex`;
105+
- when slicing, one would still get a `NoIndex`;
106+
- two ``NoIndex`` objects can't be aligned. Either they're the same length, or pandas raises;
107+
- aligning a ``NoIndex`` object with one which has an index will raise, always;
108+
- ``DataFrame`` columns can't be `NoIndex` (so ``transpose`` would need some adjustments when called on a ``NoIndex`` ``DataFrame``);
109+
- `insert` and `delete` should raise. As a consequence, `.drop` with `axis=0` would always raise;
110+
- arithmetic operations (e.g. `NoIndex(3) + 2`) would all raise.
111+
112+
### 2. DataFrameFormatter and SeriesFormatter changes
113+
114+
When printing an object with a ``NoIndex``, then the row labels wouldn't be shown:
115+
116+
```python
117+
In [14]: pd.set_option('mode.no_default_index', True)
118+
119+
In [15]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
120+
121+
In [16]: df
122+
Out[16]:
123+
a b c
124+
1 4 7
125+
2 5 8
126+
3 6 9
127+
```
128+
129+
### 3. Nobody should get an index unless they ask for one
130+
131+
The following would work in the same way:
132+
```python
133+
pivot = (
134+
pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"], aggfunc=np.sum)
135+
).reset_index()
136+
137+
with pd.option_context('mode.no_default_index', True):
138+
pivot = (
139+
pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"], aggfunc=np.sum)
140+
)
141+
```
142+
143+
Likewise for ``value_counts``. In ``groupby``, the default would be ``as_index=False``.
144+
145+
## Usage and Impact
146+
147+
Users who like the power of the ``Index`` could continue using pandas exactly as it is,
148+
without changing anything.
149+
150+
The addition of this mode would enable users who don't want to think about indices to
151+
not have to.
152+
153+
The implementation would be quite simple: most of the logic would be handled within the
154+
``NoIndex`` class, and only some minor adjustments (e.g. to the ``default_index`` function)
155+
would be needed in core pandas.
156+
157+
## Implementation
158+
159+
Draft pull request showing proof of concept: https://github.com/pandas-dev/pandas/pull/49693.
160+
161+
## Likely FAQ
162+
163+
**Q: Aren't indices really powerful?**
164+
165+
**A:** Yes! And they're also confusing to many users, even experienced developers.
166+
It's fairly common to see pandas code with ``.reset_index`` scattered around every
167+
other line. Such users would benefit from a mode in which they wouldn't need to think
168+
about indices and alignment.
169+
170+
**Q: In this mode, could users still get an ``Index`` if they really wanted to?**
171+
172+
**A:** Yes! For example with
173+
```python
174+
df.set_index(Index(range(len(df))))
175+
```
176+
or, if they don't have a column named ``'index'``:
177+
```python
178+
df.reset_index().set_index('index')
179+
```
180+
181+
**Q: Why is it necessary to change the behaviour of ``value_counts``? Isn't the introduction of a ``NoIndex`` object enough?**
182+
183+
**A:** The objective of this mode is to enable users to not have to think about indices if they don't want to. If they have to call
184+
``.reset_index`` after each ``value_counts`` / ``pivot_table`` call, or remember to pass ``as_index=False`` to each ``groupby``
185+
call, then this objective has arguably not quite been reached.
186+
187+
## PDEP History
188+
189+
- 14 November: Initial draft

0 commit comments

Comments
 (0)