diff --git a/doc/source/conf.py b/doc/source/conf.py index 2a6ec8947c8d7..81ff14d33758a 100644 --- a/doc/source/conf.py +++ b/doc/source/conf.py @@ -453,7 +453,6 @@ # extlinks alias extlinks = { "issue": ("https://github.com/pandas-dev/pandas/issues/%s", "GH"), - "wiki": ("https://github.com/pandas-dev/pandas/wiki/%s", "wiki "), } diff --git a/web/pandas/about/roadmap.md b/web/pandas/about/roadmap.md index 35a6b3361f32e..3c6c4d4fdf9a2 100644 --- a/web/pandas/about/roadmap.md +++ b/web/pandas/about/roadmap.md @@ -103,9 +103,52 @@ uses label-based, rather than position-based, indexing. We propose that it should only work with positional indexing, and the translation of keys to positions should be entirely done at a higher level. -Indexing is a complicated API with many subtleties. This refactor will -require care and attention. More details are discussed at - +Indexing is a complicated API with many subtleties. This refactor will require care +and attention. The following principles should inspire refactoring of indexing code and +should result on cleaner, simpler, and more performant code. + +1. Label indexing must never involve looking in an axis twice for the same label(s). +This implies that any validation step must either: + + * limit validation to general features (e.g. dtype/structure of the key/index), or + * reuse the result for the actual indexing. + +2. Indexers must never rely on an explicit call to other indexers. +For instance, it is OK to have some internal method of `.loc` call some +internal method of `__getitem__` (or of their common base class), +but never in the code flow of `.loc` should `the_obj[something]` appear. + +3. Execution of positional indexing must never involve labels (as currently, sadly, happens). +That is, the code flow of a getter call (or a setter call in which the right hand side is non-indexed) +to `.iloc` should never involve the axes of the object in any way. + +4. Indexing must never involve accessing/modifying values (i.e., act on `._data` or `.values`) more than once. +The following steps must hence be clearly decoupled: + + * find positions we need to access/modify on each axis + * (if we are accessing) derive the type of object we need to return (dimensionality) + * actually access/modify the values + * (if we are accessing) construct the return object + +5. As a corollary to the decoupling between 4.i and 4.iii, any code which deals on how data is stored +(including any combination of handling multiple dtypes, and sparse storage, categoricals, third-party types) +must be independent from code that deals with identifying affected rows/columns, +and take place only once step 4.i is completed. + + * In particular, such code should most probably not live in `pandas/core/indexing.py` + * ... and must not depend in any way on the type(s) of axes (e.g. no `MultiIndex` special cases) + +6. As a corollary to point 1.i, `Index` (sub)classes must provide separate methods for any desired validity check of label(s) which does not involve actual lookup, +on the one side, and for any required conversion/adaptation/lookup of label(s), on the other. + +7. Use of trial and error should be limited, and anyway restricted to catch only exceptions +which are actually expected (typically `KeyError`). + + * In particular, code should never (intentionally) raise new exceptions in the `except` portion of a `try... exception` + +8. Any code portion which is not specific to setters and getters must be shared, +and when small differences in behavior are expected (e.g. getting with `.loc` raises for +missing labels, setting still doesn't), they can be managed with a specific parameter. ## Numba-accelerated operations