Skip to content

Commit e3f8348

Browse files
meeseeksmachinejreback
authored andcommitted
Backport PR #27478: Add a Roadmap (#27698)
1 parent 9d4237b commit e3f8348

File tree

2 files changed

+194
-0
lines changed

2 files changed

+194
-0
lines changed

doc/source/development/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -16,3 +16,4 @@ Development
1616
internals
1717
extending
1818
developer
19+
roadmap

doc/source/development/roadmap.rst

+193
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
.. _roadmap:
2+
3+
=======
4+
Roadmap
5+
=======
6+
7+
This page provides an overview of the major themes in pandas' development. Each of
8+
these items requires a relatively large amount of effort to implement. These may
9+
be achieved more quickly with dedicated funding or interest from contributors.
10+
11+
An item being on the roadmap does not mean that it will *necessarily* happen, even
12+
with unlimited funding. During the implementation period we may discover issues
13+
preventing the adoption of the feature.
14+
15+
Additionally, an item *not* being on the roadmap does not exclude it from inclusion
16+
in pandas. The roadmap is intended for larger, fundamental changes to the project that
17+
are likely to take months or years of developer time. Smaller-scoped items will continue
18+
to be tracked on our `issue tracker <https://github.com/pandas-dev/pandas/issues>`__.
19+
20+
See :ref:`roadmap.evolution` for proposing changes to this document.
21+
22+
Extensibility
23+
-------------
24+
25+
Pandas :ref:`extending.extension-types` allow for extending NumPy types with custom
26+
data types and array storage. Pandas uses extension types internally, and provides
27+
an interface for 3rd-party libraries to define their own custom data types.
28+
29+
Many parts of pandas still unintentionally convert data to a NumPy array.
30+
These problems are especially pronounced for nested data.
31+
32+
We'd like to improve the handling of extension arrays throughout the library,
33+
making their behavior more consistent with the handling of NumPy arrays. We'll do this
34+
by cleaning up pandas' internals and adding new methods to the extension array interface.
35+
36+
String data type
37+
----------------
38+
39+
Currently, pandas stores text data in an ``object`` -dtype NumPy array.
40+
The current implementation has two primary drawbacks: First, ``object`` -dtype
41+
is not specific to strings: any Python object can be stored in an ``object`` -dtype
42+
array, not just strings. Second: this is not efficient. The NumPy memory model
43+
isn't especially well-suited to variable width text data.
44+
45+
To solve the first issue, we propose a new extension type for string data. This
46+
will initially be opt-in, with users explicitly requesting ``dtype="string"``.
47+
The array backing this string dtype may initially be the current implementation:
48+
an ``object`` -dtype NumPy array of Python strings.
49+
50+
To solve the second issue (performance), we'll explore alternative in-memory
51+
array libraries (for example, Apache Arrow). As part of the work, we may
52+
need to implement certain operations expected by pandas users (for example
53+
the algorithm used in, ``Series.str.upper``). That work may be done outside of
54+
pandas.
55+
56+
Apache Arrow interoperability
57+
-----------------------------
58+
59+
`Apache Arrow <https://arrow.apache.org>`__ is a cross-language development
60+
platform for in-memory data. The Arrow logical types are closely aligned with
61+
typical pandas use cases.
62+
63+
We'd like to provide better-integrated support for Arrow memory and data types
64+
within pandas. This will let us take advantage of its I/O capabilities and
65+
provide for better interoperability with other languages and libraries
66+
using Arrow.
67+
68+
Block manager rewrite
69+
---------------------
70+
71+
We'd like to replace pandas current internal data structures (a collection of
72+
1 or 2-D arrays) with a simpler collection of 1-D arrays.
73+
74+
Pandas internal data model is quite complex. A DataFrame is made up of
75+
one or more 2-dimensional "blocks", with one or more blocks per dtype. This
76+
collection of 2-D arrays is managed by the BlockManager.
77+
78+
The primary benefit of the BlockManager is improved performance on certain
79+
operations (construction from a 2D array, binary operations, reductions across the columns),
80+
especially for wide DataFrames. However, the BlockManager substantially increases the
81+
complexity and maintenance burden of pandas.
82+
83+
By replacing the BlockManager we hope to achieve
84+
85+
* Substantially simpler code
86+
* Easier extensibility with new logical types
87+
* Better user control over memory use and layout
88+
* Improved micro-performance
89+
* Option to provide a C / Cython API to pandas' internals
90+
91+
See `these design documents <https://dev.pandas.io/pandas2/internal-architecture.html#removal-of-blockmanager-new-dataframe-internals>`__
92+
for more.
93+
94+
Decoupling of indexing and internals
95+
------------------------------------
96+
97+
The code for getting and setting values in pandas' data structures needs refactoring.
98+
In particular, we must clearly separate code that converts keys (e.g., the argument
99+
to ``DataFrame.loc``) to positions from code that uses uses these positions to get
100+
or set values. This is related to the proposed BlockManager rewrite. Currently, the
101+
BlockManager sometimes uses label-based, rather than position-based, indexing.
102+
We propose that it should only work with positional indexing, and the translation of keys
103+
to positions should be entirely done at a higher level.
104+
105+
Indexing is a complicated API with many subtleties. This refactor will require care
106+
and attention. More details are discussed at
107+
https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restructuring-indexing-code
108+
109+
Numba-accelerated operations
110+
----------------------------
111+
112+
`Numba <https://numba.pydata.org>`__ is a JIT compiler for Python code. We'd like to provide
113+
ways for users to apply their own Numba-jitted functions where pandas accepts user-defined functions
114+
(for example, :meth:`Series.apply`, :meth:`DataFrame.apply`, :meth:`DataFrame.applymap`,
115+
and in groupby and window contexts). This will improve the performance of
116+
user-defined-functions in these operations by staying within compiled code.
117+
118+
119+
Documentation improvements
120+
--------------------------
121+
122+
We'd like to improve the content, structure, and presentation of the pandas documentation.
123+
Some specific goals include
124+
125+
* Overhaul the HTML theme with a modern, responsive design (:issue:`15556`)
126+
* Improve the "Getting Started" documentation, designing and writing learning paths
127+
for users different backgrounds (e.g. brand new to programming, familiar with
128+
other languages like R, already familiar with Python).
129+
* Improve the overall organization of the documentation and specific subsections
130+
of the documentation to make navigation and finding content easier.
131+
132+
Package docstring validation
133+
----------------------------
134+
135+
To improve the quality and consistency of pandas docstrings, we've developed
136+
tooling to check docstrings in a variety of ways.
137+
https://github.com/pandas-dev/pandas/blob/master/scripts/validate_docstrings.py
138+
contains the checks.
139+
140+
Like many other projects, pandas uses the
141+
`numpydoc <https://numpydoc.readthedocs.io/en/latest/>`__ style for writing
142+
docstrings. With the collaboration of the numpydoc maintainers, we'd like to
143+
move the checks to a package other than pandas so that other projects can easily
144+
use them as well.
145+
146+
Performance monitoring
147+
----------------------
148+
149+
Pandas uses `airspeed velocity <https://asv.readthedocs.io/en/stable/>`__ to
150+
monitor for performance regressions. ASV itself is a fabulous tool, but requires
151+
some additional work to be integrated into an open source project's workflow.
152+
153+
The `asv-runner <https://github.com/asv-runner>`__ organization, currently made up
154+
of pandas maintainers, provides tools built on top of ASV. We have a physical
155+
machine for running a number of project's benchmarks, and tools managing the
156+
benchmark runs and reporting on results.
157+
158+
We'd like to fund improvements and maintenance of these tools to
159+
160+
* Be more stable. Currently, they're maintained on the nights and weekends when
161+
a maintainer has free time.
162+
* Tune the system for benchmarks to improve stability, following
163+
https://pyperf.readthedocs.io/en/latest/system.html
164+
* Build a GitHub bot to request ASV runs *before* a PR is merged. Currently, the
165+
benchmarks are only run nightly.
166+
167+
.. _roadmap.evolution:
168+
169+
Roadmap Evolution
170+
-----------------
171+
172+
Pandas continues to evolve. The direction is primarily determined by community
173+
interest. Everyone is welcome to review existing items on the roadmap and
174+
to propose a new item.
175+
176+
Each item on the roadmap should be a short summary of a larger design proposal.
177+
The proposal should include
178+
179+
1. Short summary of the changes, which would be appropriate for inclusion in
180+
the roadmap if accepted.
181+
2. Motivation for the changes.
182+
3. An explanation of why the change is in scope for pandas.
183+
4. Detailed design: Preferably with example-usage (even if not implemented yet)
184+
and API documentation
185+
5. API Change: Any API changes that may result from the proposal.
186+
187+
That proposal may then be submitted as a GitHub issue, where the pandas maintainers
188+
can review and comment on the design. The `pandas mailing list <https://mail.python.org/mailman/listinfo/pandas-dev>`__
189+
should be notified of the proposal.
190+
191+
When there's agreement that an implementation
192+
would be welcome, the roadmap should be updated to include the summary and a
193+
link to the discussion issue.

0 commit comments

Comments
 (0)