-
Notifications
You must be signed in to change notification settings - Fork 21
Adding introduction, goals, scope and use cases to the RFC #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
991ab82
cc92c2c
3d998f9
d047f57
1ac9a15
e9472c8
837f87d
293c652
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -2,19 +2,219 @@ | |||||
|
||||||
## Introduction | ||||||
|
||||||
This document defines a Python dataframe API. | ||||||
|
||||||
A dataframe is a programming interface for expressing data manipulations over a | ||||||
data structure consisting of rows and columns. Columns are named, and values in a | ||||||
column share a common data type. This definition is intentionally left broad. | ||||||
|
||||||
## History | ||||||
## History and dataframe implementations | ||||||
|
||||||
Data frame libraries in several programming language exist, such as | ||||||
[R](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame), | ||||||
[Scala](https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-scala.html), | ||||||
[Julia](https://juliadata.github.io/DataFrames.jl/stable/) and others. | ||||||
|
||||||
In Python, the most popular dataframe library is [pandas](https://pandas.pydata.org/). | ||||||
pandas was initially developed at a hedge fund, with a focus on | ||||||
[panel data](https://en.wikipedia.org/wiki/Panel_data) and financial time series. | ||||||
It was open sourced in 2009, and since then it has been growing in popularity, including | ||||||
many other domains outside time series and financial data. While still rich in time series | ||||||
functionality, today is considered a general-purpose dataframe library. The original | ||||||
`Panel` class that gave name to the library was deprecated in 2017 and removed in 2019, | ||||||
to focus on the main `DataFrame` class. | ||||||
|
||||||
## Scope (includes out-of-scope / non-goals) | ||||||
Internally, pandas is implemented on top of NumPy, which is used to store the data | ||||||
and to perform many of the operations. Some parts of pandas are written in Cython. | ||||||
|
||||||
As of 2020 the pandas website has around one million and a half visitors per month. | ||||||
|
||||||
Other libraries emerged in the last years, to address some of the limitations of pandas. | ||||||
But in most cases, the libraries implemented a public API very similar to pandas, to | ||||||
make the transition to their libraries easier. Next, there is a short description of | ||||||
the main dataframe libraries in Python. | ||||||
|
||||||
[Dask](https://dask.org/) is a task scheduler built in Python, which implements a | ||||||
dataframe interface. Dask dataframe use pandas internally in the workers, and it provides | ||||||
datapythonista marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
an API similar to pandas, adapted to its distributed and lazy nature. | ||||||
|
||||||
[Vaex](https://vaex.io/) is an out-of-core alternative to pandas. Vaex uses hdf5 to | ||||||
create memory maps that avoid loading data sets to memory. Some parts of Vaex are | ||||||
implemented in C++. | ||||||
|
||||||
[Modin](https://github.com/modin-project/modin) is another distributed dataframe | ||||||
datapythonista marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
library based originally on [Ray](https://github.com/ray-project/ray). But built in | ||||||
datapythonista marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
a more modular way, that allows it to also use Dask as a scheduler, or replace the | ||||||
pandas-like public API by a SQLite-like one. | ||||||
|
||||||
[cuDF](https://github.com/rapidsai/cudf) is a GPU dataframe library built on top | ||||||
of Apache Arrow and RAPIDS. It provides an API similar to pandas. | ||||||
|
||||||
[PySpark](https://spark.apache.org/docs/latest/api/python/index.html) is a | ||||||
dataframe library that uses Spark as a backend. PySpark public API is based on the | ||||||
original Spark API, and not in pandas. | ||||||
|
||||||
[Koalas](https://github.com/databricks/koalas) is a dataframe library built on | ||||||
top of PySpark that provides a pandas-like API. | ||||||
|
||||||
[Ibis](https://ibis-project.org/) is a dataframe library with multiple SQL backends. | ||||||
It uses SQLAlchemy and a custom SQL compiler to translate its pandas-like API to | ||||||
SQL statements, executed by the backends. It supports conventional DBMS, as well | ||||||
as big data systems such as Apache Impala or BigQuery. | ||||||
|
||||||
Given the growing Python dataframe ecosystem, and its complexity, this document provides | ||||||
a standard Python dataframe API. Until recently, pandas has been a de-facto standard for | ||||||
Python dataframes. But currently there are a growing number of not only dataframe libraries, | ||||||
but also libraries that interact with dataframes (visualization, statistical or machine learning | ||||||
libraries for example). Interactions among libraries are becoming complex, and the pandas | ||||||
public API is suboptimal as a standard, for its size, complexity, and implementation details | ||||||
it exposes (for example, using NumPy data types or `NaN`). | ||||||
|
||||||
|
||||||
## Scope | ||||||
|
||||||
It is in the scope of this document the different elements of the API. This includes signatures | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is a verb missing in the first sentence ("to describe" ?) |
||||||
and semantics. To be more specific: | ||||||
|
||||||
- Data structures and Python classes | ||||||
- Functions, methods, attributes and other API elements | ||||||
- Expected returns of the different operations | ||||||
- Data types (Python and low-level types) | ||||||
|
||||||
The scope of this document is limited to generic dataframes, and not dataframes specific to | ||||||
certain domains. | ||||||
|
||||||
|
||||||
### Goals | ||||||
|
||||||
The goal of the API described in this document is to provide a standard interface that encapsulates | ||||||
implementation details of dataframe libraries. This will allow users and third-party libraries to | ||||||
write code that interacts with a standard dataframe, and not with specific implementations. | ||||||
|
||||||
The main goals for the API defined in this document are: | ||||||
|
||||||
- Provide a common API for dataframes so software can be developed to communicate with it | ||||||
- Provide a common API for dataframes to build user interfaces on top of it, for example | ||||||
libraries for interactive use or specific domains and industries | ||||||
- Simplify interactions between the projects of the ecosystem, for example, software that | ||||||
receives data as a dataframe | ||||||
- Make conversion of data among different implementations easier | ||||||
- Help user transition from one dataframe library to another | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the goal is to initially limit the scope to a "data exchange protocol", the above description of the goals still sound too general, and not making it clear what the specific scope/goals are IMO. |
||||||
|
||||||
See the [use cases](02_use_cases.html) section for details on the exact use cases considered. | ||||||
|
||||||
|
||||||
### Out-of-scope | ||||||
|
||||||
#### Execution details | ||||||
|
||||||
Implementation details of the dataframes and execution of operations. This includes: | ||||||
|
||||||
- How data is represented and stored (whether the data is in memory, disk, distributed) | ||||||
- Expectations on when the execution is happening (in an eager or lazy way) | ||||||
- Other execution details | ||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd state here that an API designed for interactive usage is out of scope. |
||||||
**Rationale:** The API defined in this document needs to be used by libraries as diverse as Ibis, | ||||||
Dask, Vaex or cuDF. The data can live in databases, distributed systems, disk or GPU memory. | ||||||
Any decision that involves assumptions on where the data is stored, or where execution happens | ||||||
could prevent implementation from adopting the standard. | ||||||
|
||||||
#### High level APIs | ||||||
|
||||||
It is out of scope to provide an API designed for interactive use. While interactive use | ||||||
is a key aspect of dataframes, an API designed for interactive use can be built on top | ||||||
of the API defined in this document. | ||||||
|
||||||
Domain or industry specific APIs are also out of scope, but can benefit from the standard | ||||||
to better interact with the different dataframe implementation. | ||||||
|
||||||
**Rationale:** Interactive or domain specific users are key in the Python dataframe ecosystem. | ||||||
But the amount and diversity of users makes it unfeasible to standardize every dataframe feature | ||||||
that is currently used. In particular, functionality built as syntactic sugar for convenience in | ||||||
interactive use, or heavily overloaded create very complex APIs. For example, the pandas dataframe | ||||||
constructor, which accepts a huge number of formats, or its `__getitem__` (e.g. `df[something]`) | ||||||
which is heavily overloaded. Implementations can provide convenient functionality like this one | ||||||
for the users they are targeting, but it is out-of-scope for the standard, so the standard is | ||||||
simple and easy to adopt. | ||||||
|
||||||
|
||||||
### Non-goals | ||||||
|
||||||
- Build an API that is appropriate to all users | ||||||
- Have a unique dataframe implementation for Python | ||||||
- Standardize functionalities specific to a domain or industry | ||||||
|
||||||
|
||||||
## Stakeholders | ||||||
|
||||||
This section provides the list of stakeholders considered for the definition of this API. | ||||||
|
||||||
|
||||||
### Data frame library authors | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
We encourage dataframe libraries in Python to implement the API defined in this document | ||||||
in their libraries. | ||||||
|
||||||
The list of known Python dataframe libraries at the time of writing this document is next: | ||||||
|
||||||
- [cuDF](https://github.com/rapidsai/cudf) | ||||||
- [Dask](https://dask.org/) | ||||||
- [datatable](https://github.com/h2oai/datatable) | ||||||
- [dexplo](https://github.com/dexplo/dexplo/) | ||||||
- [Eland](https://github.com/elastic/eland) | ||||||
- [Grizzly](https://github.com/weld-project/weld#grizzly) | ||||||
- [Ibis](https://ibis-project.org/) | ||||||
- [Koalas](https://github.com/databricks/koalas) | ||||||
- [Mars](https://docs.pymars.org/en/latest/) | ||||||
- [Modin](https://github.com/modin-project/modin) | ||||||
- [pandas](https://pandas.pydata.org/) | ||||||
- [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) | ||||||
- [StaticFrame](https://static-frame.readthedocs.io/en/latest/) | ||||||
- [Turi Create](https://github.com/apple/turicreate) | ||||||
- [Vaex](https://vaex.io/) | ||||||
|
||||||
|
||||||
### Downstream library authors | ||||||
|
||||||
Authors of libraries that consume dataframes. They can use the API defined in this document | ||||||
to know how the data contained in a dataframe can be consumed, and which operations are implemented. | ||||||
|
||||||
A non-exhaustive list of downstream library categories is next: | ||||||
|
||||||
- Plotting and visualization (e.g. Matplotlib, Bokeh, Altair, Plotly) | ||||||
- Statistical libraries (e.g. statsmodels) | ||||||
- Machine learning libraries (e.g. scikit-learn) | ||||||
|
||||||
|
||||||
### Upstream library authors | ||||||
|
||||||
Authors of libraries that provide functionality used by dataframes. | ||||||
|
||||||
A non-exhaustive list of upstream categories is next: | ||||||
|
||||||
- Data formats, protocols and libraries for data analytics (e.g. Apache Arrow) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Numpy as well? It's used by dataframe libraries for their implementation |
||||||
- Task schedulers (e.g. Dask, Ray, Mars) | ||||||
- Big data systems (e.g. Spark, Hive, Impala, Presto) | ||||||
- Libraries for database access (e.g. SQLAlchemy) | ||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we add Database and Big Data systems? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. I don't think we are planning to engage with developer of PostgreSQL, MySQL... I'm adding for now big data systems, and also Python libraries to access databases, which I guess we're more likely to engage with. But I'm open to further changes if there are different points of view. |
||||||
|
||||||
### Data frame power users | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
|
||||||
This group considers developers of reusable code that use dataframes. For example, developers of | ||||||
applications that use dataframes. Or authors of libraries that provide specialized dataframe | ||||||
APIs to be built on top of the standard API. | ||||||
|
||||||
People using dataframes in an interactive way are considered out of scope. These users include data | ||||||
analysts, data scientists and other users that are key for dataframes. But this type of user may need | ||||||
shortcuts, or libraries that take decisions for them to save them time. For example automatic type | ||||||
inference, or excessive use of very compact syntax like Python squared brackets / `__getitem__`. | ||||||
Standardizing on such practices can be extremely difficult, and it is out of scope. | ||||||
|
||||||
With the development of a standard API that targets developers writing reusable code we expected | ||||||
to also serve data analysts and other interactive users. But in an indirect way, by providing a | ||||||
standard API where other libraries can be built on top. Including libraries with the syntactic | ||||||
sugar required for fast analysis of data. | ||||||
|
||||||
|
||||||
## High-level API overview | ||||||
|
@@ -38,4 +238,3 @@ | |||||
|
||||||
|
||||||
## References | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.