Skip to content

Commit f2cfeb5

Browse files
author
MarcoGorelli
committed
[skip ci] First revision
1 parent bd8756b commit f2cfeb5

File tree

1 file changed

+73
-24
lines changed

1 file changed

+73
-24
lines changed

web/pandas/pdeps/0005-no-default-index-mode.md

+73-24
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ Then:
6262
dtype: int64
6363
```
6464

65-
If a user didn't want to think about row labels (which they may have ended up after slicing some other ``DataFrame``/``Series``),
65+
If a user didn't want to think about row labels (which they may have ended up after slicing / concatenating operations),
6666
then ``NoRowIndex`` would enable the above to work in a more intuitive
6767
manner (details and examples to follow below).
6868

@@ -107,27 +107,28 @@ result in a ``DataFrame`` which still has ``NoRowIndex``. To do this, the follow
107107
Example:
108108

109109
```python
110-
In [8]: df = pd.DataFrame({'a': [1, 2], 'b': [4, 5]}, index=NoRowIndex(2))
110+
In [7]: df1 = pd.DataFrame({'a': [1, 2], 'b': [4, 5]}, index=NoRowIndex(2))
111111

112-
In [9]: df
112+
In [8]: df2 = pd.DataFrame({'a': [4], 'b': [0]}, index=NoRowIndex(1))
113+
114+
In [9]: df1
113115
Out[9]:
114116
a b
115117
1 4
116118
2 5
117119

118-
In [10]: pd.concat([df, df])
120+
In [10]: pd.concat([df1, df2])
119121
Out[10]:
120122
a b
121123
1 4
122124
2 5
123-
1 4
124-
2 5
125+
4 0
125126

126-
In [11]: pd.concat([df, df]).index
127-
Out[11]: NoRowIndex(len=4)
127+
In [11]: pd.concat([df1, df2]).index
128+
Out[11]: NoRowIndex(len=3)
128129
```
129130

130-
Appending anything index other than another ``NoRowIndex`` would raise.
131+
Appending anything other than another ``NoRowIndex`` would raise.
131132

132133
### Slicing a ``NoRowIndex``
133134

@@ -158,13 +159,14 @@ In [15]: df.loc[0, 'b']
158159
---------------------------------------------------------------------------
159160
IndexError: Cannot use label-based indexing on NoRowIndex!
160161
```
162+
Note that other uses of ``.loc``, such as boolean masks, would still be allowed (see F.A.Q).
161163

162164
### Aligning ``NoRowIndex``s
163165

164166
To minimise surprises, the rule would be:
165167

166-
A ``NoRowIndex`` can only be aligned with another ``NoRowIndex`` of the same length.
167-
Attempting to align it with anything else would raise.
168+
> A ``NoRowIndex`` can only be aligned with another ``NoRowIndex`` of the same length.
169+
> Attempting to align it with anything else would raise.
168170
169171
Example:
170172
```python
@@ -192,13 +194,13 @@ row labels. There's no suggestion to remove the column labels.
192194
In particular, calling ``transpose`` on a ``NoRowIndex`` ``DataFrame``
193195
would error. The error would come with a helpful error message, informing
194196
users that they should first set an index. E.g.:
195-
```
197+
```python
196198
In [4]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, index=NoRowIndex(3))
197199

198200
In [5]: df.transpose()
199201
---------------------------------------------------------------------------
200202
ValueError: Columns cannot be NoRowIndex.
201-
If you got here via `transpose` or an `axis=1` operation, then you should first set an index, e.g.: `df.pipe(lambda _df: _df.set_axis(pd.RangeIndex(len(df))))`
203+
If you got here via `transpose` or an `axis=1` operation, then you should first set an index, e.g.: `df.pipe(lambda _df: _df.set_axis(pd.RangeIndex(len(_df))))`
202204
```
203205

204206
### DataFrameFormatter and SeriesFormatter changes
@@ -222,21 +224,40 @@ Of the above changes, this may be the only one that would need implementing with
222224
## Usage and Impact
223225

224226
By itself, ``NoRowIndex`` would be of limited use. To become useful and user-friendly,
225-
a mode ``no_default_index`` could be introduced which, if enabled, would change
227+
a ``no_default_index`` mode could be introduced which, if enabled, would change
226228
the ``default_index`` function to return a ``NoRowIndex`` of the appropriate length.
227229
In particular, ``.reset_index()`` would result in a ``DataFrame`` with a ``NoRowIndex``.
228230
Likewise, a ``DataFrame`` constructed without explicitly specifying ``index=``.
229231

230-
Then, if a user doesn't want to think about row labels, then with ``pd.set_option('no_default_index')``
231-
set, they wouldn't need to (barring methods such as `.pivot_table` which introduce an index).
232-
Discussion of such a mode is out-of-scope for this proposal.
232+
Furthermore, it could be useful to add ``as_index`` options to methods which currently
233+
set an index, and then allow for that mode to control the ``as_index`` default.
234+
235+
Discussion of such a mode is out-of-scope for this proposal. A ``NoRowIndex`` would
236+
just be a first step towards getting there.
233237

234238
## Implementation
235239

236240
Draft pull request showing proof of concept: https://github.com/pandas-dev/pandas/pull/49693.
237241

242+
Note that implementation details could well change even if this PDEP were
243+
accepted. For example, ``NoRowIndex`` wouldn't necessarily need to subclass
244+
``RangeIndex``, and it wouldn't necessarily need to be accessible to the user
245+
(``df.index`` could well return ``None``)
246+
238247
## Likely FAQ
239248

249+
**Q: Couldn't users just use ``RangeIndex``? Why do we need a new class?**
250+
251+
**A**: ``RangeIndex`` isn't preserved under slicing and appending, e.g.:
252+
```python
253+
In [1]: ser = pd.Series([1,2,3])
254+
255+
In [2]: ser[ser!=2].index
256+
Out[2]: Int64Index([0, 2], dtype='int64')
257+
```
258+
If someone doesn't want to think about row labels and starts off
259+
with a ``RangeIndex``, they'll very quickly lose it.
260+
240261
**Q: Aren't indices really powerful?**
241262

242263
**A:** Yes! And they're also confusing to many users, even experienced developers.
@@ -245,16 +266,13 @@ Draft pull request showing proof of concept: https://github.com/pandas-dev/panda
245266
and alignment. Indices would be here to stay, and ``NoRowIndex`` would not be the
246267
default.
247268

248-
**Q: In this mode, could users still get an ``Index`` if they really wanted to?**
269+
**Q: How could one switch a ``NoRowIndex`` ``DataFrame`` back to one with an index?**
249270

250-
**A:** Yes! For example with
271+
**A:** The simplest way would probably be:
251272
```python
252-
df.set_index(Index(range(len(df))))
253-
```
254-
or, if they don't have a column named ``'index'``:
255-
```python
256-
df.reset_index().set_index('index')
273+
df.set_axis(pd.RangeIndex(len(df)))
257274
```
275+
There's probably no need to introduce a new method for this.
258276

259277
**Q: Why not let transpose switch ``NoRowIndex`` to ``RangeIndex`` under the hood before swapping index and columns?**
260278

@@ -264,6 +282,37 @@ Draft pull request showing proof of concept: https://github.com/pandas-dev/panda
264282
to be intentional about what they want and end up with fewer surprises later
265283
on.
266284

285+
**Q: What would df.sum(), and other methods which introduce an index, return?**
286+
287+
**A:** Such methods would still set an index and would work the same way they
288+
do now. There may be some way to change that (e.g. introducing ``as_index``
289+
arguments and introducing a mode to set its default) but that's out of scope
290+
for this particular PDEP.
291+
292+
**Q: How would a user opt-in to a ``NoRowIndex`` DataFrame?**
293+
294+
**A:** This PDEP would only allow it via the constructor, passing
295+
``index=NoRowIndex(len(df))``. A mode could be introduced to toggle
296+
making that the default, but would be out-of-scope for the current PDEP.
297+
298+
**Q: Would ``.loc`` stop working?**
299+
300+
**A:** No. It would only raise if used for label-based selection. Other uses
301+
of ``.loc``, such as ``df.loc[:, col_1]`` or ``df.loc[mask, col_1]``, would
302+
continue working.
303+
304+
**Q: What's unintuitive about ``Series`` aligning indices when summing?**
305+
306+
**A:** Not sure, but I once asked a group of experienced developers what the
307+
output of
308+
```python
309+
ser1 = pd.Series([1,1,1], index=[1,2,3])
310+
ser2 = pd.Series([1,1,1], index=[3,4,5])
311+
print(ser1 + ser2)
312+
```
313+
would be, and _nobody_ got it right.
314+
267315
## PDEP History
268316

269317
- 14 November: Initial draft
318+
- 18 November: First revision

0 commit comments

Comments
 (0)