Skip to content

Commit 646e081

Browse files
author
MarcoGorelli
committed
[skip ci] clarify some points as per reviews
1 parent 7dd4920 commit 646e081

File tree

1 file changed

+64
-32
lines changed

1 file changed

+64
-32
lines changed

web/pandas/pdeps/0005-no-default-index-mode.md

+64-32
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010

1111
The suggestion is to add a ``NoRowIndex`` class. Internally, it would act a bit like
1212
a ``RangeIndex``, but some methods would be stricter. This would be one
13-
step towards enabling users who don't want to think about indices to not have to.
13+
step towards enabling users who do not want to think about indices to not need to.
1414

1515
## Motivation
1616

@@ -24,7 +24,7 @@ In [38]: ser2 = pd.Series([10, 15, 20, 25], index=[1, 2, 3, 4])
2424

2525
Then:
2626

27-
- it can be unexpected that summing `Series` with the same length (but different indices) produces `NaN`s in the result (https://stackoverflow.com/q/66094702/4451315):
27+
- it can be unexpected that adding `Series` with the same length (but different indices) produces `NaN`s in the result (https://stackoverflow.com/q/66094702/4451315):
2828

2929
```python
3030
In [41]: ser1 + ser2
@@ -62,15 +62,15 @@ Then:
6262
dtype: int64
6363
```
6464

65-
If a user didn't want to think about row labels (which they may have ended up after slicing / concatenating operations),
65+
If a user did not want to think about row labels (which they may have ended up after slicing / concatenating operations),
6666
then ``NoRowIndex`` would enable the above to work in a more intuitive
6767
manner (details and examples to follow below).
6868

6969
## Scope
7070

7171
This proposal deals exclusively with the ``NoRowIndex`` class. To allow users to fully "opt-out" of having to think
7272
about row labels, the following could also be useful:
73-
- a ``pd.set_option('mode.no_default_index')`` mode which would default to creating new ``DataFrame``s and
73+
- a ``pd.set_option('mode.no_row_index', True)`` mode which would default to creating new ``DataFrame``s and
7474
``Series`` with ``NoRowIndex`` instead of ``RangeIndex``;
7575
- giving ``as_index`` options to methods which currently create an index
7676
(e.g. ``value_counts``, ``.sum()``, ``.pivot_table``) to just insert a new column instead of creating an
@@ -85,17 +85,19 @@ within the ``NoRowIndex`` object. It would act just like ``RangeIndex``, but wou
8585
in some cases:
8686
- `name` could only be `None`;
8787
- `start` could only be `0`, `step` `1`;
88-
- when appending a ``NoRowIndex``, the result would still be ``NoRowIndex``;
88+
- when appending one ``NoRowIndex`` to another ``NoRowIndex``, the result would still be ``NoRowIndex``.
89+
Appending a ``NoRowIndex`` to any other index (or vice-versa) would raise;
8990
- the ``NoRowIndex`` class would be preserved under slicing;
90-
- it could only be aligned with another ``Index`` if it's also ``NoRowIndex`` and if it's of the same length;
91-
- ``DataFrame`` columns can't be `NoRowIndex` (so ``transpose`` would need some adjustments when called on a ``NoRowIndex`` ``DataFrame``);
91+
- a ``NoRowIndex`` could only be aligned with another ``Index`` if it's also ``NoRowIndex`` and if it's of the same length;
92+
- ``DataFrame`` columns cannot be `NoRowIndex` (so ``transpose`` would need some adjustments when called on a ``NoRowIndex`` ``DataFrame``);
9293
- `insert` and `delete` should raise. As a consequence, if ``df`` is a ``DataFrame`` with a
9394
``NoRowIndex``, then `df.drop` with `axis=0` would always raise;
9495
- arithmetic operations (e.g. `NoRowIndex(3) + 2`) would always raise;
95-
- when printing a ``DataFrame``/``Series`` with a ``NoRowIndex``, then the row labels wouldn't be printed;
96+
- when printing a ``DataFrame``/``Series`` with a ``NoRowIndex``, then the row labels would not be printed;
9697
- a ``MultiIndex`` could not be created with a ``NoRowIndex`` as one of its levels.
9798

98-
Let's go into more detail for some of these.
99+
Let's go into more detail for some of these. In the examples that follow, the ``NoRowIndex`` will be passed explicitly,
100+
but this is not how users would be expected to use it (see "Usage and Impact" section for details).
99101

100102
### NoRowIndex.append
101103

@@ -108,16 +110,21 @@ result in a ``DataFrame`` which still has ``NoRowIndex``. To do this, the follow
108110
Example:
109111

110112
```python
111-
In [7]: df1 = pd.DataFrame({'a': [1, 2], 'b': [4, 5]}, index=NoRowIndex(2))
113+
In [6]: df1 = pd.DataFrame({'a': [1, 2], 'b': [4, 5]}, index=NoRowIndex(2))
112114

113-
In [8]: df2 = pd.DataFrame({'a': [4], 'b': [0]}, index=NoRowIndex(1))
115+
In [7]: df2 = pd.DataFrame({'a': [4], 'b': [0]}, index=NoRowIndex(1))
114116

115-
In [9]: df1
116-
Out[9]:
117+
In [8]: df1
118+
Out[8]:
117119
a b
118120
1 4
119121
2 5
120122

123+
In [9]: df2
124+
Out[9]:
125+
a b
126+
4 0
127+
121128
In [10]: pd.concat([df1, df2])
122129
Out[10]:
123130
a b
@@ -160,7 +167,11 @@ In [15]: df.loc[0, 'b']
160167
---------------------------------------------------------------------------
161168
IndexError: Cannot use label-based indexing on NoRowIndex!
162169
```
163-
Note that other uses of ``.loc``, such as boolean masks, would still be allowed (see F.A.Q).
170+
171+
Note too that:
172+
- other uses of ``.loc``, such as boolean masks, would still be allowed (see F.A.Q);
173+
- ``.iloc`` and ``.iat`` would keep working as before;
174+
- ``.at`` would raise.
164175

165176
### Aligning ``NoRowIndex``s
166177

@@ -184,10 +195,10 @@ dtype: int64
184195

185196
In [4]: ser1 + ser2.iloc[1:] # errors!
186197
---------------------------------------------------------------------------
187-
TypeError: Can't join NoRowIndex of different lengths
198+
TypeError: Cannot join NoRowIndex of different lengths
188199
```
189200

190-
### Columns can't be NoRowIndex
201+
### Columns cannot be NoRowIndex
191202

192203
This proposal deals exclusively with letting users not have to think about
193204
row labels. There's no suggestion to remove the column labels.
@@ -206,7 +217,7 @@ If you got here via `transpose` or an `axis=1` operation, then you should first
206217

207218
### DataFrameFormatter and SeriesFormatter changes
208219

209-
When printing an object with a ``NoRowIndex``, then the row labels wouldn't be shown:
220+
When printing an object with a ``NoRowIndex``, then the row labels would not be shown:
210221

211222
```python
212223
In [15]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, index=NoRowIndex(3))
@@ -224,42 +235,50 @@ Of the above changes, this may be the only one that would need implementing with
224235

225236
## Usage and Impact
226237

227-
By itself, ``NoRowIndex`` would be of limited use. To become useful and user-friendly,
228-
a ``no_default_index`` mode could be introduced which, if enabled, would change
229-
the ``default_index`` function to return a ``NoRowIndex`` of the appropriate length.
230-
In particular, ``.reset_index()`` would result in a ``DataFrame`` with a ``NoRowIndex``.
231-
Likewise, a ``DataFrame`` constructed without explicitly specifying ``index=``.
238+
Users would not be expected to work with the ``NoRowIndex`` class itself directly.
239+
Usage would probably involve a mode which would change how the ``default_index``
240+
function to return a ``NoRowIndex`` rather than a ``RangeIndex``.
241+
Then, if a user opted in to this mode with
242+
243+
```python
244+
pd.set_option('mode.no_row_index', True)
245+
```
246+
247+
then the following would all create a ``DataFrame`` with a ``NoRowIndex`` (as they
248+
all call ``default_index``):
232249

233-
Furthermore, it could be useful to add ``as_index`` options to methods which currently
234-
set an index, and then allow for that mode to control the ``as_index`` default.
250+
- ``df.reset_index(drop=True)``;
251+
- ``pd.concat([df1, df2], ignore_index=True)``
252+
- ``df1.merge(df2, on=col)``;
253+
- ``df = pd.DataFrame({'col_1': [1, 2, 3]})``
235254

236-
Discussion of such a mode is out-of-scope for this proposal. A ``NoRowIndex`` would
255+
Further discussion of such a mode is out-of-scope for this proposal. A ``NoRowIndex`` would
237256
just be a first step towards getting there.
238257

239258
## Implementation
240259

241260
Draft pull request showing proof of concept: https://github.com/pandas-dev/pandas/pull/49693.
242261

243262
Note that implementation details could well change even if this PDEP were
244-
accepted. For example, ``NoRowIndex`` wouldn't necessarily need to subclass
245-
``RangeIndex``, and it wouldn't necessarily need to be accessible to the user
263+
accepted. For example, ``NoRowIndex`` would not necessarily need to subclass
264+
``RangeIndex``, and it would not necessarily need to be accessible to the user
246265
(``df.index`` could well return ``None``)
247266

248267
## Likely FAQ
249268

250-
**Q: Couldn't users just use ``RangeIndex``? Why do we need a new class?**
269+
**Q: Could not users just use ``RangeIndex``? Why do we need a new class?**
251270

252-
**A**: ``RangeIndex`` isn't preserved under slicing and appending, e.g.:
271+
**A**: ``RangeIndex`` is not preserved under slicing and appending, e.g.:
253272
```python
254273
In [1]: ser = pd.Series([1,2,3])
255274

256275
In [2]: ser[ser!=2].index
257276
Out[2]: Int64Index([0, 2], dtype='int64')
258277
```
259-
If someone doesn't want to think about row labels and starts off
278+
If someone does not want to think about row labels and starts off
260279
with a ``RangeIndex``, they'll very quickly lose it.
261280

262-
**Q: Aren't indices really powerful?**
281+
**Q: Are not indices really powerful?**
263282

264283
**A:** Yes! And they're also confusing to many users, even experienced developers.
265284
It's fairly common to see pandas code with ``.reset_index`` scattered around every
@@ -275,10 +294,23 @@ accepted. For example, ``NoRowIndex`` wouldn't necessarily need to subclass
275294
```
276295
There's probably no need to introduce a new method for this.
277296

297+
Conversely, to get rid of the index, then (so long as one has enabled the ``mode.no_row_index`` option)
298+
one could simply do ``df.reset_index(drop=True)``.
299+
300+
**Q: How would ``tz_localize`` and other methods which operate on the index work on a ``NoRowIndex`` ``DataFrame``?**
301+
302+
**A:** Same way they work on other ``NumericIndex``s, which would typically be to raise:
303+
304+
```python
305+
In [2]: ser.tz_localize('UTC')
306+
---------------------------------------------------------------------------
307+
TypeError: index is not a valid DatetimeIndex or PeriodIndex
308+
```
309+
278310
**Q: Why not let transpose switch ``NoRowIndex`` to ``RangeIndex`` under the hood before swapping index and columns?**
279311

280312
**A:** This is the kind of magic that can lead to surprising behaviour that's
281-
difficult to debug. For example, ``df.transpose().transpose()`` wouldn't
313+
difficult to debug. For example, ``df.transpose().transpose()`` would not
282314
round-trip. It's easy enough to set an index after all, better to "force" users
283315
to be intentional about what they want and end up with fewer surprises later
284316
on.

0 commit comments

Comments
 (0)