|
| 1 | +.. _roadmap: |
| 2 | + |
| 3 | +======= |
| 4 | +Roadmap |
| 5 | +======= |
| 6 | + |
| 7 | +This page provides an overview of the major themes in pandas' development. Each of |
| 8 | +these items requires a relatively large amount of effort to implement. These may |
| 9 | +be achieved more quickly with dedicated funding or interest from contributors. |
| 10 | + |
| 11 | +An item being on the roadmap does not mean that it will *necessarily* happen, even |
| 12 | +with unlimited funding. During the implementation period we may discover issues |
| 13 | +preventing the adoption of the feature. |
| 14 | + |
| 15 | +Additionally, an item *not* being on the roadmap does not exclude it from inclusion |
| 16 | +in pandas. The roadmap is intended for larger, fundamental changes to the project that |
| 17 | +are likely to take months or years of developer time. Smaller-scoped items will continue |
| 18 | +to be tracked on our `issue tracker <https://github.com/pandas-dev/pandas/issues>`__. |
| 19 | + |
| 20 | +See :ref:`roadmap.evolution` for proposing changes to this document. |
| 21 | + |
| 22 | +Extensibility |
| 23 | +------------- |
| 24 | + |
| 25 | +Pandas :ref:`extending.extension-types` allow for extending NumPy types with custom |
| 26 | +data types and array storage. Pandas uses extension types internally, and provides |
| 27 | +an interface for 3rd-party libraries to define their own custom data types. |
| 28 | + |
| 29 | +Many parts of pandas still unintentionally convert data to a NumPy array. |
| 30 | +These problems are especially pronounced for nested data. |
| 31 | + |
| 32 | +We'd like to improve the handling of extension arrays throughout the library, |
| 33 | +making their behavior more consistent with the handling of NumPy arrays. We'll do this |
| 34 | +by cleaning up pandas' internals and adding new methods to the extension array interface. |
| 35 | + |
| 36 | +String data type |
| 37 | +---------------- |
| 38 | + |
| 39 | +Currently, pandas stores text data in an ``object`` -dtype NumPy array. |
| 40 | +The current implementation has two primary drawbacks: First, ``object`` -dtype |
| 41 | +is not specific to strings: any Python object can be stored in an ``object`` -dtype |
| 42 | +array, not just strings. Second: this is not efficient. The NumPy memory model |
| 43 | +isn't especially well-suited to variable width text data. |
| 44 | + |
| 45 | +To solve the first issue, we propose a new extension type for string data. This |
| 46 | +will initially be opt-in, with users explicitly requesting ``dtype="string"``. |
| 47 | +The array backing this string dtype may initially be the current implementation: |
| 48 | +an ``object`` -dtype NumPy array of Python strings. |
| 49 | + |
| 50 | +To solve the second issue (performance), we'll explore alternative in-memory |
| 51 | +array libraries (for example, Apache Arrow). As part of the work, we may |
| 52 | +need to implement certain operations expected by pandas users (for example |
| 53 | +the algorithm used in, ``Series.str.upper``). That work may be done outside of |
| 54 | +pandas. |
| 55 | + |
| 56 | +Apache Arrow interoperability |
| 57 | +----------------------------- |
| 58 | + |
| 59 | +`Apache Arrow <https://arrow.apache.org>`__ is a cross-language development |
| 60 | +platform for in-memory data. The Arrow logical types are closely aligned with |
| 61 | +typical pandas use cases. |
| 62 | + |
| 63 | +We'd like to provide better-integrated support for Arrow memory and data types |
| 64 | +within pandas. This will let us take advantage of its I/O capabilities and |
| 65 | +provide for better interoperability with other languages and libraries |
| 66 | +using Arrow. |
| 67 | + |
| 68 | +Block manager rewrite |
| 69 | +--------------------- |
| 70 | + |
| 71 | +We'd like to replace pandas current internal data structures (a collection of |
| 72 | +1 or 2-D arrays) with a simpler collection of 1-D arrays. |
| 73 | + |
| 74 | +Pandas internal data model is quite complex. A DataFrame is made up of |
| 75 | +one or more 2-dimensional "blocks", with one or more blocks per dtype. This |
| 76 | +collection of 2-D arrays is managed by the BlockManager. |
| 77 | + |
| 78 | +The primary benefit of the BlockManager is improved performance on certain |
| 79 | +operations (construction from a 2D array, binary operations, reductions across the columns), |
| 80 | +especially for wide DataFrames. However, the BlockManager substantially increases the |
| 81 | +complexity and maintenance burden of pandas. |
| 82 | + |
| 83 | +By replacing the BlockManager we hope to achieve |
| 84 | + |
| 85 | +* Substantially simpler code |
| 86 | +* Easier extensibility with new logical types |
| 87 | +* Better user control over memory use and layout |
| 88 | +* Improved micro-performance |
| 89 | +* Option to provide a C / Cython API to pandas' internals |
| 90 | + |
| 91 | +See `these design documents <https://dev.pandas.io/pandas2/internal-architecture.html#removal-of-blockmanager-new-dataframe-internals>`__ |
| 92 | +for more. |
| 93 | + |
| 94 | +Decoupling of indexing and internals |
| 95 | +------------------------------------ |
| 96 | + |
| 97 | +The code for getting and setting values in pandas' data structures needs refactoring. |
| 98 | +In particular, we must clearly separate code that converts keys (e.g., the argument |
| 99 | +to ``DataFrame.loc``) to positions from code that uses uses these positions to get |
| 100 | +or set values. This is related to the proposed BlockManager rewrite. Currently, the |
| 101 | +BlockManager sometimes uses label-based, rather than position-based, indexing. |
| 102 | +We propose that it should only work with positional indexing, and the translation of keys |
| 103 | +to positions should be entirely done at a higher level. |
| 104 | + |
| 105 | +Indexing is a complicated API with many subtleties. This refactor will require care |
| 106 | +and attention. More details are discussed at |
| 107 | +https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restructuring-indexing-code |
| 108 | + |
| 109 | +Numba-accelerated operations |
| 110 | +---------------------------- |
| 111 | + |
| 112 | +`Numba <https://numba.pydata.org>`__ is a JIT compiler for Python code. We'd like to provide |
| 113 | +ways for users to apply their own Numba-jitted functions where pandas accepts user-defined functions |
| 114 | +(for example, :meth:`Series.apply`, :meth:`DataFrame.apply`, :meth:`DataFrame.applymap`, |
| 115 | +and in groupby and window contexts). This will improve the performance of |
| 116 | +user-defined-functions in these operations by staying within compiled code. |
| 117 | + |
| 118 | + |
| 119 | +Documentation improvements |
| 120 | +-------------------------- |
| 121 | + |
| 122 | +We'd like to improve the content, structure, and presentation of the pandas documentation. |
| 123 | +Some specific goals include |
| 124 | + |
| 125 | +* Overhaul the HTML theme with a modern, responsive design (:issue:`15556`) |
| 126 | +* Improve the "Getting Started" documentation, designing and writing learning paths |
| 127 | + for users different backgrounds (e.g. brand new to programming, familiar with |
| 128 | + other languages like R, already familiar with Python). |
| 129 | +* Improve the overall organization of the documentation and specific subsections |
| 130 | + of the documentation to make navigation and finding content easier. |
| 131 | + |
| 132 | +Package docstring validation |
| 133 | +---------------------------- |
| 134 | + |
| 135 | +To improve the quality and consistency of pandas docstrings, we've developed |
| 136 | +tooling to check docstrings in a variety of ways. |
| 137 | +https://github.com/pandas-dev/pandas/blob/master/scripts/validate_docstrings.py |
| 138 | +contains the checks. |
| 139 | + |
| 140 | +Like many other projects, pandas uses the |
| 141 | +`numpydoc <https://numpydoc.readthedocs.io/en/latest/>`__ style for writing |
| 142 | +docstrings. With the collaboration of the numpydoc maintainers, we'd like to |
| 143 | +move the checks to a package other than pandas so that other projects can easily |
| 144 | +use them as well. |
| 145 | + |
| 146 | +Performance monitoring |
| 147 | +---------------------- |
| 148 | + |
| 149 | +Pandas uses `airspeed velocity <https://asv.readthedocs.io/en/stable/>`__ to |
| 150 | +monitor for performance regressions. ASV itself is a fabulous tool, but requires |
| 151 | +some additional work to be integrated into an open source project's workflow. |
| 152 | + |
| 153 | +The `asv-runner <https://github.com/asv-runner>`__ organization, currently made up |
| 154 | +of pandas maintainers, provides tools built on top of ASV. We have a physical |
| 155 | +machine for running a number of project's benchmarks, and tools managing the |
| 156 | +benchmark runs and reporting on results. |
| 157 | + |
| 158 | +We'd like to fund improvements and maintenance of these tools to |
| 159 | + |
| 160 | +* Be more stable. Currently, they're maintained on the nights and weekends when |
| 161 | + a maintainer has free time. |
| 162 | +* Tune the system for benchmarks to improve stability, following |
| 163 | + https://pyperf.readthedocs.io/en/latest/system.html |
| 164 | +* Build a GitHub bot to request ASV runs *before* a PR is merged. Currently, the |
| 165 | + benchmarks are only run nightly. |
| 166 | + |
| 167 | +.. _roadmap.evolution: |
| 168 | + |
| 169 | +Roadmap Evolution |
| 170 | +----------------- |
| 171 | + |
| 172 | +Pandas continues to evolve. The direction is primarily determined by community |
| 173 | +interest. Everyone is welcome to review existing items on the roadmap and |
| 174 | +to propose a new item. |
| 175 | + |
| 176 | +Each item on the roadmap should be a short summary of a larger design proposal. |
| 177 | +The proposal should include |
| 178 | + |
| 179 | +1. Short summary of the changes, which would be appropriate for inclusion in |
| 180 | + the roadmap if accepted. |
| 181 | +2. Motivation for the changes. |
| 182 | +3. An explanation of why the change is in scope for pandas. |
| 183 | +4. Detailed design: Preferably with example-usage (even if not implemented yet) |
| 184 | + and API documentation |
| 185 | +5. API Change: Any API changes that may result from the proposal. |
| 186 | + |
| 187 | +That proposal may then be submitted as a GitHub issue, where the pandas maintainers |
| 188 | +can review and comment on the design. The `pandas mailing list <https://mail.python.org/mailman/listinfo/pandas-dev>`__ |
| 189 | +should be notified of the proposal. |
| 190 | + |
| 191 | +When there's agreement that an implementation |
| 192 | +would be welcome, the roadmap should be updated to include the summary and a |
| 193 | +link to the discussion issue. |
0 commit comments