From 965ecd18fd944a47040b27059894aa525f0ae2b0 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Thu, 18 Jul 2019 19:43:51 -0500 Subject: [PATCH 01/24] added roadmap --- doc/source/roadmap.rst | 112 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 112 insertions(+) create mode 100644 doc/source/roadmap.rst diff --git a/doc/source/roadmap.rst b/doc/source/roadmap.rst new file mode 100644 index 0000000000000..118e00540f366 --- /dev/null +++ b/doc/source/roadmap.rst @@ -0,0 +1,112 @@ + +.. _roadmap: + +======= +Roadmap +======= + +This page provides an overview of the major themes pandas' development. + +Extensibility +------------- + +Pandas Extension Arrays provide 3rd party libraries the ability to +extend pandas' supported types. In theory, these 3rd party types can do +everything one of pandas. In practice, many places in pandas will unintentionally +convert the ExtensionArray to a NumPy array of Python objects, causing +performance issues and the loss of type information. + +Additionally, there are known issues when the scalar type of an ExtensionArray +is actual a sequence (nested data). Developing APIs for working with nested data, +and ensuring that pandas' internal routines can handled it, will be a major effort. + +String Dtype +------------ + +Currently, pandas stores text data in an ``object`` -dtype NumPy array. +Each array stores Python strings. While pragmatic, since we rely on NumPy +for storage and Python for string operations, this is memory inefficient +and slow. We'd like to provide a native string type for pandas. + +The most obvious candidate is Apache Arrow. Currently, Arrow provides +storage for string data. We can work with the Arrow developers to implement +the operations needed for pandas users (for example, ``Series.str.upper``). + +Apache Arrow Interoperability +----------------------------- + +`Apache Arrow `__ is a cross-language development +platform for in-memory data. The Arrow logical types are closely aligned with +typical pandas use cases (for example, support for nullable integers). + +We'd like have a pandas DataFrame be backed by a collection of Apache Arrow +arrays. This should simplify pandas internals and ensure more consistent +handling of data types through operations. + +Block Manager Rewrite +--------------------- + +Pandas internal data model is quite complex. A DataFrame is made up of +one or more 2-dimension "blocks", with one or more blocks per dtype. This +collection of 2-D arrays is managed by the BlockManager. + +The primary benefit of the BlockManager is improved performance on certain +operations, especially for wide DataFrames. Consider summing a table with ``P`` +columns. When stored as a 2-D array, this results in a single call to +``numpy.sum``. If this were stored as ``P`` arrays, we'd have a Python for loop +going calling ``numpy.sum`` P times. + +By replacing the BlockManager we hope to achieve + +* Substantially simpler code +* Easier extensibility with new logical types +* Better user control over memory use and layout +* Improved microperformance + +See `these design documents `__ +for more. + +Weighted Operations +------------------- + +In many fields, sample weights are necessary to correctly estimate population +statistics. We'd like to support weighted operations (like `mean`, `sum`, `std`, +etc.), possibly with an API similar to `DataFrame.groupby`. + +See https://github.com/pandas-dev/pandas/issues/10030 for more. + +Package Docstring Validation +---------------------------- + +To improve the quality and consistency of pandas docstrings, we've developed +tooling to check docstrings in a variety of ways. +https://github.com/pandas-dev/pandas/blob/master/scripts/validate_docstrings.py +contains the checks. + +Like many other projects, pandas uses the +[numpydoc](https://numpydoc.readthedocs.io/en/latest/) style for writing +docstrings. With the collaboration of the numpydoc maintainers, we'd like to +move the checks to a package other than pandas so that other projects can easily +use them as well. + +Performance Monitoring +---------------------- + +Pandas uses [airspeed velocity](https://asv.readthedocs.io/en/stable/) to +monitor for performance regressions. ASV itself is a fabulous tool, but requires +some additional work to be integrated into an open source project's workflow. + +The [asv-runner](https://github.com/asv-runner) organization, currently made up +of pandas maintainers, provides tools built on top of ASV. We have a physical +machine for running a number of project's benchmarks, and tools managing the +benchmark runs and reporting on results. + +We'd like to fund improvements and maintenance of these tools to + +* Be more stable. Currently, they're maintained on the nights and weekends when + a maintainer has free time. +* Tune the system for benchmarks to improve stability, following + https://pyperf.readthedocs.io/en/latest/system.html +* Build a GitHub bot to request ASV runs *before* a PR is merged. Currently, the + benchmarks are only run nightly. + From 98656c87d5f15bf55ad8c01708af596a1f8a7823 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Fri, 19 Jul 2019 09:43:25 -0500 Subject: [PATCH 02/24] added roadmap --- doc/source/development/index.rst | 1 + doc/source/roadmap.rst | 29 ++++++++++++++++------------- 2 files changed, 17 insertions(+), 13 deletions(-) diff --git a/doc/source/development/index.rst b/doc/source/development/index.rst index a149f31118ed5..c7710ff19f078 100644 --- a/doc/source/development/index.rst +++ b/doc/source/development/index.rst @@ -16,3 +16,4 @@ Development internals extending developer + roadmap diff --git a/doc/source/roadmap.rst b/doc/source/roadmap.rst index 118e00540f366..65960d635467c 100644 --- a/doc/source/roadmap.rst +++ b/doc/source/roadmap.rst @@ -1,11 +1,11 @@ - .. _roadmap: ======= Roadmap ======= -This page provides an overview of the major themes pandas' development. +This page provides an overview of the major themes pandas' development. Implementation +of these goals may be hastened with dedicated funding. Extensibility ------------- @@ -14,11 +14,11 @@ Pandas Extension Arrays provide 3rd party libraries the ability to extend pandas' supported types. In theory, these 3rd party types can do everything one of pandas. In practice, many places in pandas will unintentionally convert the ExtensionArray to a NumPy array of Python objects, causing -performance issues and the loss of type information. +performance issues and the loss of type information. These problems are especially +pronounced for nested data. -Additionally, there are known issues when the scalar type of an ExtensionArray -is actual a sequence (nested data). Developing APIs for working with nested data, -and ensuring that pandas' internal routines can handled it, will be a major effort. +We'd like to improve the handling of extension arrays throughout the library, +making their behavior more consistent with the handling of NumPy arrays. String Dtype ------------ @@ -28,7 +28,7 @@ Each array stores Python strings. While pragmatic, since we rely on NumPy for storage and Python for string operations, this is memory inefficient and slow. We'd like to provide a native string type for pandas. -The most obvious candidate is Apache Arrow. Currently, Arrow provides +The most obvious alternative is Apache Arrow. Currently, Arrow provides storage for string data. We can work with the Arrow developers to implement the operations needed for pandas users (for example, ``Series.str.upper``). @@ -39,22 +39,24 @@ Apache Arrow Interoperability platform for in-memory data. The Arrow logical types are closely aligned with typical pandas use cases (for example, support for nullable integers). -We'd like have a pandas DataFrame be backed by a collection of Apache Arrow -arrays. This should simplify pandas internals and ensure more consistent +We'd like have a pandas DataFrame be backed by Arrow memory and data types +by default. This should simplify pandas internals and ensure more consistent handling of data types through operations. Block Manager Rewrite --------------------- +We'd like to replace pandas current internal data structures (a collection of +1 or 2-D arrays) with a simpler collection of 1-D arrays. + Pandas internal data model is quite complex. A DataFrame is made up of one or more 2-dimension "blocks", with one or more blocks per dtype. This collection of 2-D arrays is managed by the BlockManager. The primary benefit of the BlockManager is improved performance on certain -operations, especially for wide DataFrames. Consider summing a table with ``P`` -columns. When stored as a 2-D array, this results in a single call to -``numpy.sum``. If this were stored as ``P`` arrays, we'd have a Python for loop -going calling ``numpy.sum`` P times. +operations (construction from a 2D array, binary operations, reductions across the columns), +especially for wide DataFrames. However, the BlockManager substantially increases the +complexity and maintenance burden of pandas'. By replacing the BlockManager we hope to achieve @@ -62,6 +64,7 @@ By replacing the BlockManager we hope to achieve * Easier extensibility with new logical types * Better user control over memory use and layout * Improved microperformance +* Option to provide a C / Cython API to pandas' internals See `these design documents `__ for more. From 12f1f6749dd9a6582205e278902263bf539b096e Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Fri, 19 Jul 2019 13:07:53 -0500 Subject: [PATCH 03/24] update roadmap --- doc/source/roadmap.rst | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/doc/source/roadmap.rst b/doc/source/roadmap.rst index 65960d635467c..4194e0403e0ba 100644 --- a/doc/source/roadmap.rst +++ b/doc/source/roadmap.rst @@ -31,6 +31,9 @@ and slow. We'd like to provide a native string type for pandas. The most obvious alternative is Apache Arrow. Currently, Arrow provides storage for string data. We can work with the Arrow developers to implement the operations needed for pandas users (for example, ``Series.str.upper``). +These operations could be implemented in Numba ( +as prototyped in `Fletcher `__) +or in the Apache Arrow C++ library. Apache Arrow Interoperability ----------------------------- @@ -78,6 +81,19 @@ etc.), possibly with an API similar to `DataFrame.groupby`. See https://github.com/pandas-dev/pandas/issues/10030 for more. +Documentation Improvements +-------------------------- + +We'd like to improve the content, structure, and presentation of pandas documentation. +Some specific goals include + +* Overhaul the HTML theme with a modern, responsive design. +* Improve the "Getting Started" documentation, designing and writing learning paths + for users different backgrounds (e.g. brand new to programming, familiar with + other languages like R, already familiar with Python). +* Improve the overall organization of the documentation and specific subsections + of the documentation to make navigation and finding content easier. + Package Docstring Validation ---------------------------- @@ -87,7 +103,7 @@ https://github.com/pandas-dev/pandas/blob/master/scripts/validate_docstrings.py contains the checks. Like many other projects, pandas uses the -[numpydoc](https://numpydoc.readthedocs.io/en/latest/) style for writing +`numpydoc `__ style for writing docstrings. With the collaboration of the numpydoc maintainers, we'd like to move the checks to a package other than pandas so that other projects can easily use them as well. @@ -95,11 +111,11 @@ use them as well. Performance Monitoring ---------------------- -Pandas uses [airspeed velocity](https://asv.readthedocs.io/en/stable/) to +Pandas uses `airspeed velocity `__ to monitor for performance regressions. ASV itself is a fabulous tool, but requires some additional work to be integrated into an open source project's workflow. -The [asv-runner](https://github.com/asv-runner) organization, currently made up +The `asv-runner `__ organization, currently made up of pandas maintainers, provides tools built on top of ASV. We have a physical machine for running a number of project's benchmarks, and tools managing the benchmark runs and reporting on results. @@ -112,4 +128,3 @@ We'd like to fund improvements and maintenance of these tools to https://pyperf.readthedocs.io/en/latest/system.html * Build a GitHub bot to request ASV runs *before* a PR is merged. Currently, the benchmarks are only run nightly. - From fb844aea088bbafa417964b8a0d1a2fe5fadd76f Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Fri, 19 Jul 2019 15:12:41 -0500 Subject: [PATCH 04/24] move to development --- doc/source/{ => development}/roadmap.rst | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename doc/source/{ => development}/roadmap.rst (100%) diff --git a/doc/source/roadmap.rst b/doc/source/development/roadmap.rst similarity index 100% rename from doc/source/roadmap.rst rename to doc/source/development/roadmap.rst From c3103704e58706960a1385a595b3098e9e4c5530 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Mon, 22 Jul 2019 09:55:33 -0500 Subject: [PATCH 05/24] indexing --- doc/source/development/roadmap.rst | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index 4194e0403e0ba..51524d0b0d83e 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -72,6 +72,21 @@ By replacing the BlockManager we hope to achieve See `these design documents `__ for more. +Decoupling of Indexing and Internals +------------------------------------ + +The code for getting and setting values in pandas' data structures needs refactoring. +In particular, a clear separation must be made between code that +converts indexes (e.g., the argument to ``DataFrame.loc``) to positions from code that uses +uses these positions to get or set values. This is related to the proposed BlockManager rewrite. +Currently, the BlockManager sometimes label-based, rather than position-based, indexing. +We propose that it should only work with positional indexing, and the translation of keys +to positions should be entirely done at a higher level. + +Indexing is a complicated API with many subtleties. This refactor will require care +and attention. More details are discussed at +https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restructuring-indexing-code + Weighted Operations ------------------- From 4aef936389c1312e4f0151a18ec869307976fc0e Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Tue, 23 Jul 2019 05:42:33 -0500 Subject: [PATCH 06/24] typos --- doc/source/development/roadmap.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index 51524d0b0d83e..3dd294de572f1 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -77,9 +77,9 @@ Decoupling of Indexing and Internals The code for getting and setting values in pandas' data structures needs refactoring. In particular, a clear separation must be made between code that -converts indexes (e.g., the argument to ``DataFrame.loc``) to positions from code that uses +converts keys (e.g., the argument to ``DataFrame.loc``) to positions from code that uses uses these positions to get or set values. This is related to the proposed BlockManager rewrite. -Currently, the BlockManager sometimes label-based, rather than position-based, indexing. +Currently, the BlockManager sometimes uses label-based, rather than position-based, indexing. We propose that it should only work with positional indexing, and the translation of keys to positions should be entirely done at a higher level. From 200ac632ee67dd834be2912bae439c0d23c31483 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Tue, 23 Jul 2019 08:36:24 -0500 Subject: [PATCH 07/24] numba --- doc/source/development/roadmap.rst | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index 3dd294de572f1..e4cacabe2ff5e 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -87,6 +87,15 @@ Indexing is a complicated API with many subtleties. This refactor will require c and attention. More details are discussed at https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restructuring-indexing-code +Numba-Accelerated Operations +---------------------------- + +[Numba](https://numba.pydata.org) is a JIT compiler for Python code. We'd like to provide +ways for users to apply their own Numba-jitted functions within pandas' groupby and window +contexts. This will improve the performance of user-defined-functions in these operations +by staying within compiled code. + + Weighted Operations ------------------- From 8dbd981be3377d91480d0f635a99a177bb722e73 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Tue, 23 Jul 2019 08:39:39 -0500 Subject: [PATCH 08/24] reword --- doc/source/development/roadmap.rst | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index e4cacabe2ff5e..fa8a166114e18 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -10,15 +10,13 @@ of these goals may be hastened with dedicated funding. Extensibility ------------- -Pandas Extension Arrays provide 3rd party libraries the ability to -extend pandas' supported types. In theory, these 3rd party types can do -everything one of pandas. In practice, many places in pandas will unintentionally -convert the ExtensionArray to a NumPy array of Python objects, causing -performance issues and the loss of type information. These problems are especially -pronounced for nested data. +Pandas :ref:`extending.extension-types` provide 3rd-party libraries the ability to +extend pandas' supported data types. Many parts of pandas still unintentionally +convert data to a NumPy array. These problems are especially pronounced for nested data. We'd like to improve the handling of extension arrays throughout the library, -making their behavior more consistent with the handling of NumPy arrays. +making their behavior more consistent with the handling of NumPy arrays. We'll do this +by cleaning up pandas' internals and adding new methods do the extension array interface. String Dtype ------------ @@ -42,9 +40,9 @@ Apache Arrow Interoperability platform for in-memory data. The Arrow logical types are closely aligned with typical pandas use cases (for example, support for nullable integers). -We'd like have a pandas DataFrame be backed by Arrow memory and data types -by default. This should simplify pandas internals and ensure more consistent -handling of data types through operations. +We'd like better support for a DataFrame be backed by Arrow memory and data types. +This should simplify pandas internals and ensure more consistent handling +of data types through operations. Block Manager Rewrite --------------------- From 9ac38f06f3c1ec57304bb3e8b045581c357db003 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Thu, 25 Jul 2019 09:49:43 -0500 Subject: [PATCH 09/24] arrow --- doc/source/development/roadmap.rst | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index fa8a166114e18..160d16ce14c01 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -38,11 +38,12 @@ Apache Arrow Interoperability `Apache Arrow `__ is a cross-language development platform for in-memory data. The Arrow logical types are closely aligned with -typical pandas use cases (for example, support for nullable integers). +typical pandas use cases. -We'd like better support for a DataFrame be backed by Arrow memory and data types. -This should simplify pandas internals and ensure more consistent handling -of data types through operations. +We'd like to provide better-integrated support for Arrow memory and data types +within pandas. This will let us take advantage of its I/O capabilities and +provides for better interoperability with other languages and libraries +using Arrow. Block Manager Rewrite --------------------- From 4e1af8226e0de252f0904adee3890e3c099c80f9 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Mon, 29 Jul 2019 08:40:24 -0500 Subject: [PATCH 10/24] Intro --- doc/source/development/roadmap.rst | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index 160d16ce14c01..8371420de80f5 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -4,15 +4,22 @@ Roadmap ======= -This page provides an overview of the major themes pandas' development. Implementation -of these goals may be hastened with dedicated funding. +This page provides an overview of the major themes in pandas' development. Implementation +of these goals may be hastened with dedicated funding or interest from contributors. + +An item being on the roadmap does not mean that pandas will *necessarily* happen, even +with unlimited funding. During the implementation period we may discover issues +preventing the adoption of the feature. Extensibility ------------- -Pandas :ref:`extending.extension-types` provide 3rd-party libraries the ability to -extend pandas' supported data types. Many parts of pandas still unintentionally -convert data to a NumPy array. These problems are especially pronounced for nested data. +Pandas :ref:`extending.extension-types` allow for extending NumPy types with custom +data types and array storage. Pandas uses extension types internally, and provides +an interface for 3rd-party libraries to define their own custom data types. + +Many parts of pandas still unintentionally convert data to a NumPy array. +These problems are especially pronounced for nested data. We'd like to improve the handling of extension arrays throughout the library, making their behavior more consistent with the handling of NumPy arrays. We'll do this From 5702a18717da515075893f952697bc3608af0df2 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Mon, 29 Jul 2019 08:42:01 -0500 Subject: [PATCH 11/24] cleanup --- doc/source/development/roadmap.rst | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index 8371420de80f5..1ca95a585d825 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -49,7 +49,7 @@ typical pandas use cases. We'd like to provide better-integrated support for Arrow memory and data types within pandas. This will let us take advantage of its I/O capabilities and -provides for better interoperability with other languages and libraries +provide for better interoperability with other languages and libraries using Arrow. Block Manager Rewrite @@ -82,10 +82,10 @@ Decoupling of Indexing and Internals ------------------------------------ The code for getting and setting values in pandas' data structures needs refactoring. -In particular, a clear separation must be made between code that -converts keys (e.g., the argument to ``DataFrame.loc``) to positions from code that uses -uses these positions to get or set values. This is related to the proposed BlockManager rewrite. -Currently, the BlockManager sometimes uses label-based, rather than position-based, indexing. +In particular, we must clearly separate code that converts keys (e.g., the argument +to ``DataFrame.loc``) to positions from code that uses uses these positions to get +or set values. This is related to the proposed BlockManager rewrite. Currently, the +BlockManager sometimes uses label-based, rather than position-based, indexing. We propose that it should only work with positional indexing, and the translation of keys to positions should be entirely done at a higher level. @@ -97,9 +97,10 @@ Numba-Accelerated Operations ---------------------------- [Numba](https://numba.pydata.org) is a JIT compiler for Python code. We'd like to provide -ways for users to apply their own Numba-jitted functions within pandas' groupby and window -contexts. This will improve the performance of user-defined-functions in these operations -by staying within compiled code. +ways for users to apply their own Numba-jitted where pandas accepts user-defined functions +(for example, :meth:`Series.apply`, :meth:`DataFrame.apply`, :meth:`DataFrame.applymap`, +and in groupby and window contexts). This will improve the performance of +user-defined-functions in these operations by staying within compiled code. Weighted Operations From 755a5e471d4198ad670902de0efaca978b8ecd77 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Mon, 29 Jul 2019 08:43:22 -0500 Subject: [PATCH 12/24] case --- doc/source/development/roadmap.rst | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index 1ca95a585d825..b7f0f03f64a09 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -25,8 +25,8 @@ We'd like to improve the handling of extension arrays throughout the library, making their behavior more consistent with the handling of NumPy arrays. We'll do this by cleaning up pandas' internals and adding new methods do the extension array interface. -String Dtype ------------- +String data type +---------------- Currently, pandas stores text data in an ``object`` -dtype NumPy array. Each array stores Python strings. While pragmatic, since we rely on NumPy @@ -40,7 +40,7 @@ These operations could be implemented in Numba ( as prototyped in `Fletcher `__) or in the Apache Arrow C++ library. -Apache Arrow Interoperability +Apache Arrow interoperability ----------------------------- `Apache Arrow `__ is a cross-language development @@ -52,7 +52,7 @@ within pandas. This will let us take advantage of its I/O capabilities and provide for better interoperability with other languages and libraries using Arrow. -Block Manager Rewrite +Block manager rewrite --------------------- We'd like to replace pandas current internal data structures (a collection of @@ -78,7 +78,7 @@ By replacing the BlockManager we hope to achieve See `these design documents `__ for more. -Decoupling of Indexing and Internals +Decoupling of indexing and internals ------------------------------------ The code for getting and setting values in pandas' data structures needs refactoring. @@ -93,7 +93,7 @@ Indexing is a complicated API with many subtleties. This refactor will require c and attention. More details are discussed at https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restructuring-indexing-code -Numba-Accelerated Operations +Numba-accelerated operations ---------------------------- [Numba](https://numba.pydata.org) is a JIT compiler for Python code. We'd like to provide @@ -103,7 +103,7 @@ and in groupby and window contexts). This will improve the performance of user-defined-functions in these operations by staying within compiled code. -Weighted Operations +Weighted operations ------------------- In many fields, sample weights are necessary to correctly estimate population @@ -112,7 +112,7 @@ etc.), possibly with an API similar to `DataFrame.groupby`. See https://github.com/pandas-dev/pandas/issues/10030 for more. -Documentation Improvements +Documentation improvements -------------------------- We'd like to improve the content, structure, and presentation of pandas documentation. @@ -125,7 +125,7 @@ Some specific goals include * Improve the overall organization of the documentation and specific subsections of the documentation to make navigation and finding content easier. -Package Docstring Validation +Package docstring validation ---------------------------- To improve the quality and consistency of pandas docstrings, we've developed @@ -139,7 +139,7 @@ docstrings. With the collaboration of the numpydoc maintainers, we'd like to move the checks to a package other than pandas so that other projects can easily use them as well. -Performance Monitoring +Performance monitoring ---------------------- Pandas uses `airspeed velocity `__ to From a549cf7b0662d719b39641312220285d98741c27 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Mon, 29 Jul 2019 08:58:45 -0500 Subject: [PATCH 13/24] str --- doc/source/development/roadmap.rst | 25 +++++++++++++++---------- 1 file changed, 15 insertions(+), 10 deletions(-) diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index b7f0f03f64a09..a2fef6b3f1984 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -29,16 +29,21 @@ String data type ---------------- Currently, pandas stores text data in an ``object`` -dtype NumPy array. -Each array stores Python strings. While pragmatic, since we rely on NumPy -for storage and Python for string operations, this is memory inefficient -and slow. We'd like to provide a native string type for pandas. - -The most obvious alternative is Apache Arrow. Currently, Arrow provides -storage for string data. We can work with the Arrow developers to implement -the operations needed for pandas users (for example, ``Series.str.upper``). -These operations could be implemented in Numba ( -as prototyped in `Fletcher `__) -or in the Apache Arrow C++ library. +The current implementation has two primary drawbacks: First, ``object`` -dtype +is not specific to strings: any Python object can be stored in an ``object`` -dtype +array, not just strings. Second: this is not efficient. The NumPy memory model +isn't especially well-suited to variable width text data. + +To solve the first issue, we propose a new extension type for string data. This +will initially be opt-in, with users explicitly requesting ``dtype="string"``. +The array backing this string dtype may initially be the current implementation: +an ``object`` -dtype NumPy array of Python strings. + +To solve the second issue (performance), we'll explore alternative in-memory +array libraries (for example, Apache Arrow). As part of the work, we may +need to implement certain operations expected by pandas users (for example +the algorithm used in, ``Series.str.upper``). That work may be done outside of +pandas. Apache Arrow interoperability ----------------------------- From da01cb42cc658d9ebdc5a4419f3fc053494759b3 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Mon, 29 Jul 2019 09:11:34 -0500 Subject: [PATCH 14/24] added evolution --- doc/source/development/roadmap.rst | 39 ++++++++++++++++++++---------- 1 file changed, 26 insertions(+), 13 deletions(-) diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index a2fef6b3f1984..f0cd83322de0b 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -64,13 +64,13 @@ We'd like to replace pandas current internal data structures (a collection of 1 or 2-D arrays) with a simpler collection of 1-D arrays. Pandas internal data model is quite complex. A DataFrame is made up of -one or more 2-dimension "blocks", with one or more blocks per dtype. This +one or more 2-dimensional "blocks", with one or more blocks per dtype. This collection of 2-D arrays is managed by the BlockManager. The primary benefit of the BlockManager is improved performance on certain operations (construction from a 2D array, binary operations, reductions across the columns), especially for wide DataFrames. However, the BlockManager substantially increases the -complexity and maintenance burden of pandas'. +complexity and maintenance burden of pandas. By replacing the BlockManager we hope to achieve @@ -108,22 +108,13 @@ and in groupby and window contexts). This will improve the performance of user-defined-functions in these operations by staying within compiled code. -Weighted operations -------------------- - -In many fields, sample weights are necessary to correctly estimate population -statistics. We'd like to support weighted operations (like `mean`, `sum`, `std`, -etc.), possibly with an API similar to `DataFrame.groupby`. - -See https://github.com/pandas-dev/pandas/issues/10030 for more. - Documentation improvements -------------------------- -We'd like to improve the content, structure, and presentation of pandas documentation. +We'd like to improve the content, structure, and presentation of the pandas documentation. Some specific goals include -* Overhaul the HTML theme with a modern, responsive design. +* Overhaul the HTML theme with a modern, responsive design (:issue:`15556`) * Improve the "Getting Started" documentation, designing and writing learning paths for users different backgrounds (e.g. brand new to programming, familiar with other languages like R, already familiar with Python). @@ -164,3 +155,25 @@ We'd like to fund improvements and maintenance of these tools to https://pyperf.readthedocs.io/en/latest/system.html * Build a GitHub bot to request ASV runs *before* a PR is merged. Currently, the benchmarks are only run nightly. + +Roadmap Evolution +----------------- + +Pandas continues to evolve. The direction is primarily determined by community +interest. Everyone is welcome to review exissting items on the roadmap and +to propose a new item. + +Each item on the roadmap should be a short summary of a larger design proposal. +The proposal should include + +1. Short summary of the changes, which would be appropriate for inclusion in + the roadmap if accepted. +2. Motivation for the changes. +3. Detailed design: Preferably with example-usage (even if not implemented yet) + and API documentation +4. API Change: Any API changes that may result from the proposal. + +That proposal may then be submitted as a GitHub issue, where the pandas maintainers +can review and comment on the design. When there's agreement that an implementation +would be welcome, the roadmap can be updated to include the summary and a +link to the discussion issue. From b52d6b9b3d245aa9f154052ef400c370d22935b3 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Mon, 29 Jul 2019 09:34:36 -0500 Subject: [PATCH 15/24] typos --- doc/source/development/roadmap.rst | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index f0cd83322de0b..d54d05b779df4 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -11,6 +11,8 @@ An item being on the roadmap does not mean that pandas will *necessarily* happen with unlimited funding. During the implementation period we may discover issues preventing the adoption of the feature. +See :ref:`roadmap.evolution` for proposing changes to this document. + Extensibility ------------- @@ -77,7 +79,7 @@ By replacing the BlockManager we hope to achieve * Substantially simpler code * Easier extensibility with new logical types * Better user control over memory use and layout -* Improved microperformance +* Improved micro-performance * Option to provide a C / Cython API to pandas' internals See `these design documents `__ @@ -156,11 +158,13 @@ We'd like to fund improvements and maintenance of these tools to * Build a GitHub bot to request ASV runs *before* a PR is merged. Currently, the benchmarks are only run nightly. +.. _roadmap.evolution: + Roadmap Evolution ----------------- Pandas continues to evolve. The direction is primarily determined by community -interest. Everyone is welcome to review exissting items on the roadmap and +interest. Everyone is welcome to review existing items on the roadmap and to propose a new item. Each item on the roadmap should be a short summary of a larger design proposal. From bf1338b7041239a403896a8475bc6dd8749658b7 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Mon, 29 Jul 2019 11:43:22 -0500 Subject: [PATCH 16/24] missing function --- doc/source/development/roadmap.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index d54d05b779df4..e9c1011cd3c9f 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -104,7 +104,7 @@ Numba-accelerated operations ---------------------------- [Numba](https://numba.pydata.org) is a JIT compiler for Python code. We'd like to provide -ways for users to apply their own Numba-jitted where pandas accepts user-defined functions +ways for users to apply their own Numba-jitted functions where pandas accepts user-defined functions (for example, :meth:`Series.apply`, :meth:`DataFrame.apply`, :meth:`DataFrame.applymap`, and in groupby and window contexts). This will improve the performance of user-defined-functions in these operations by staying within compiled code. From fb6980ca0ebc8d839e620e71d75412d0441003bd Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Mon, 29 Jul 2019 16:36:01 -0500 Subject: [PATCH 17/24] scope and ML --- doc/source/development/roadmap.rst | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index e9c1011cd3c9f..2d06f1ee5ef48 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -173,11 +173,15 @@ The proposal should include 1. Short summary of the changes, which would be appropriate for inclusion in the roadmap if accepted. 2. Motivation for the changes. -3. Detailed design: Preferably with example-usage (even if not implemented yet) +3. An explanation of why the change is in scope for pandas. +4. Detailed design: Preferably with example-usage (even if not implemented yet) and API documentation -4. API Change: Any API changes that may result from the proposal. +5. API Change: Any API changes that may result from the proposal. That proposal may then be submitted as a GitHub issue, where the pandas maintainers -can review and comment on the design. When there's agreement that an implementation -would be welcome, the roadmap can be updated to include the summary and a +can review and comment on the design. The `pandas mailing `_ +should be notified of the proposal. + +When there's agreement that an implementation +would be welcome, the roadmap should be updated to include the summary and a link to the discussion issue. From 65653ee8c201e7013eba390f73c1cff23b866da1 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Wed, 31 Jul 2019 09:49:07 -0500 Subject: [PATCH 18/24] add note on in / out --- doc/source/development/roadmap.rst | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index 2d06f1ee5ef48..a510180688e63 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -4,13 +4,18 @@ Roadmap ======= -This page provides an overview of the major themes in pandas' development. Implementation -of these goals may be hastened with dedicated funding or interest from contributors. +This page provides an overview of the major themes in pandas' development. Each of +these items requires a relatively large amount of effort to implement. Implementation +may be hastened with dedicated funding or interest from contributors. -An item being on the roadmap does not mean that pandas will *necessarily* happen, even +An item being on the roadmap does not mean that it will *necessarily* happen, even with unlimited funding. During the implementation period we may discover issues preventing the adoption of the feature. +Additionally, an item *not* being on the roadmap does not exclude it from inclusion +in pandas. The roadmap is intended for larger, fundamental changes to the project that +are likely to take months or years of developer time. + See :ref:`roadmap.evolution` for proposing changes to this document. Extensibility From c3b5b5f59748ed615dccd5016694fc5b8c21c485 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Thu, 1 Aug 2019 09:00:48 -0500 Subject: [PATCH 19/24] Update doc/source/development/roadmap.rst Co-Authored-By: Simon Hawkins --- doc/source/development/roadmap.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index a510180688e63..e766658ffce83 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -6,7 +6,7 @@ Roadmap This page provides an overview of the major themes in pandas' development. Each of these items requires a relatively large amount of effort to implement. Implementation -may be hastened with dedicated funding or interest from contributors. +may be expedited with dedicated funding or interest from contributors. An item being on the roadmap does not mean that it will *necessarily* happen, even with unlimited funding. During the implementation period we may discover issues From d3c9424d7a410ab8a0371e941a9b858e337214cd Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Thu, 1 Aug 2019 09:01:01 -0500 Subject: [PATCH 20/24] Update doc/source/development/roadmap.rst Co-Authored-By: Simon Hawkins --- doc/source/development/roadmap.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index e766658ffce83..3671aef0a6df4 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -30,7 +30,7 @@ These problems are especially pronounced for nested data. We'd like to improve the handling of extension arrays throughout the library, making their behavior more consistent with the handling of NumPy arrays. We'll do this -by cleaning up pandas' internals and adding new methods do the extension array interface. +by cleaning up pandas' internals and adding new methods to the extension array interface. String data type ---------------- From a10f78cbcfd075240c8bfa870743f1dbf5a76343 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Thu, 1 Aug 2019 09:01:29 -0500 Subject: [PATCH 21/24] Update doc/source/development/roadmap.rst Co-Authored-By: Simon Hawkins --- doc/source/development/roadmap.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index 3671aef0a6df4..0473ff6880339 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -184,7 +184,7 @@ The proposal should include 5. API Change: Any API changes that may result from the proposal. That proposal may then be submitted as a GitHub issue, where the pandas maintainers -can review and comment on the design. The `pandas mailing `_ +can review and comment on the design. The `pandas mailing list`_ should be notified of the proposal. When there's agreement that an implementation From 6a05c2b00e253b07077e400ee5450b8084ee9b44 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Thu, 1 Aug 2019 09:13:10 -0500 Subject: [PATCH 22/24] link to tracker --- doc/source/development/roadmap.rst | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index 0473ff6880339..75af00a9fe7fa 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -5,8 +5,8 @@ Roadmap ======= This page provides an overview of the major themes in pandas' development. Each of -these items requires a relatively large amount of effort to implement. Implementation -may be expedited with dedicated funding or interest from contributors. +these items requires a relatively large amount of effort to implement. These may +be achieved more quickly with dedicated funding or interest from contributors. An item being on the roadmap does not mean that it will *necessarily* happen, even with unlimited funding. During the implementation period we may discover issues @@ -14,7 +14,8 @@ preventing the adoption of the feature. Additionally, an item *not* being on the roadmap does not exclude it from inclusion in pandas. The roadmap is intended for larger, fundamental changes to the project that -are likely to take months or years of developer time. +are likely to take months or years of developer time. Smaller-scoped items will continue +to be tracked on our `issue tracker `__. See :ref:`roadmap.evolution` for proposing changes to this document. From ce5a2e00b77be35223159eacd30118d1e065d1f5 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Thu, 1 Aug 2019 09:13:47 -0500 Subject: [PATCH 23/24] numba link --- doc/source/development/roadmap.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index 75af00a9fe7fa..dcd9fbcbae668 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -109,7 +109,7 @@ https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restructuring-in Numba-accelerated operations ---------------------------- -[Numba](https://numba.pydata.org) is a JIT compiler for Python code. We'd like to provide +`Numba `__ is a JIT compiler for Python code. We'd like to provide ways for users to apply their own Numba-jitted functions where pandas accepts user-defined functions (for example, :meth:`Series.apply`, :meth:`DataFrame.apply`, :meth:`DataFrame.applymap`, and in groupby and window contexts). This will improve the performance of From ecdffeb8835c3cf59d32150d8e1f0787ffee00f3 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Thu, 1 Aug 2019 09:53:17 -0500 Subject: [PATCH 24/24] fix link --- doc/source/development/roadmap.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index dcd9fbcbae668..88e0a18e6b81a 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -185,7 +185,7 @@ The proposal should include 5. API Change: Any API changes that may result from the proposal. That proposal may then be submitted as a GitHub issue, where the pandas maintainers -can review and comment on the design. The `pandas mailing list`_ +can review and comment on the design. The `pandas mailing list `__ should be notified of the proposal. When there's agreement that an implementation