Skip to content

Adding introduction, goals, scope and use cases to the RFC #27

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Sep 13, 2020
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
216 changes: 213 additions & 3 deletions spec/01_purpose_and_scope.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,230 @@

## Introduction

This document defines a Python dataframe API.

A dataframe is a programming interface for expressing data manipulations over a
data structure consisting of rows and columns. Columns are named, and values in a
column share a common data type. This definition is intentionally left broad.

## History
## History and dataframe implementations

Dataframe libraries in several programming language exist, such as
[R](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame),
[Scala](https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-scala.html),
[Julia](https://juliadata.github.io/DataFrames.jl/stable/) and others.

In Python, the most popular dataframe library is [pandas](https://pandas.pydata.org/).
pandas was initially developed at a hedge fund, with a focus on
[panel data](https://en.wikipedia.org/wiki/Panel_data) and financial time series.
It was open sourced in 2009, and since then it has been growing in popularity, including
many other domains outside time series and financial data. While still rich in time series
functionality, today is considered a general-purpose dataframe library. The original
`Panel` class that gave name to the library was deprecated in 2017 and removed in 2019,
to focus on the main `DataFrame` class.

## Scope (includes out-of-scope / non-goals)
Internally, pandas is implemented on top of NumPy, which is used to store the data
and to perform many of the operations. Some parts of pandas are written in Cython.

As of 2020 the pandas website has around one million and a half visitors per month.

Other libraries emerged in the last years, to address some of the limitations of pandas.
But in most cases, the libraries implemented a public API very similar to pandas, to
make the transition to their libraries easier. Next, there is a short description of
the main dataframe libraries in Python.

[Dask](https://dask.org/) is a task scheduler built in Python, which implements a
dataframe interface. Dask dataframe use pandas internally in the workers, and it provides
an API similar to pandas, adapted to its distributed and lazy nature.

[Vaex](https://vaex.io/) is an out-of-core alternative to pandas. Vaex uses hdf5 to
create memory maps that avoid loading data sets to memory. Some parts of Vaex are
implemented in C++.

[Modin](https://github.com/modin-project/modin) is another distributed dataframe
library based originally on [Ray](https://github.com/ray-project/ray). But built in
a more modular way, that allows it to also use Dask as a scheduler, or replace the
pandas-like public API by a SQLite-like one.

[cuDF](https://github.com/rapidsai/cudf) is a GPU dataframe library built on top
of Apache Arrow and RAPIDS. It provides an API similar to pandas.

[PySpark](https://spark.apache.org/docs/latest/api/python/index.html) is a
dataframe library that uses Spark as a backend. PySpark public API is based on the
original Spark API, and not in pandas.

[Koalas](https://github.com/databricks/koalas) is a dataframe library built on
top of PySpark that provides a pandas-like API.

[Ibis](https://ibis-project.org/) is a dataframe library with multiple SQL backends.
It uses SQLAlchemy and a custom SQL compiler to translate its pandas-like API to
SQL statements, executed by the backends. It supports conventional DBMS, as well
as big data systems such as Apache Impala or BigQuery.

Given the growing Python dataframe ecosystem, and its complexity, this document provides
a standard Python dataframe API. Until recently, pandas has been a de-facto standard for
Python dataframes. But currently there are a growing number of not only dataframe libraries,
but also libraries that interact with dataframes (visualization, statistical or machine learning
libraries for example). Interactions among libraries are becoming complex, and the pandas
public API is suboptimal as a standard, for its size, complexity, and implementation details
it exposes (for example, using NumPy data types or `NaN`).


## Scope

In the first iteration of the API standard, the scope is limited to create a data exchange
protocol. In future iterations the scope will be broader, including elements to operate with
the data.

It is in the scope of this document the different elements of the API. This includes signatures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a verb missing in the first sentence ("to describe" ?)

and semantics. To be more specific:

- Data structures and Python classes
- Functions, methods, attributes and other API elements
- Expected returns of the different operations
- Data types (Python and low-level types)

The scope of this document is limited to generic dataframes, and not dataframes specific to
certain domains.


### Goals

The goal of the first iteration is to provide a data exchange protocol, so consumers of dataframes
can interact with a standard interface to access their data.

The goal of the of future iterations will be to provide a standard interface that encapsulates
implementation details of dataframe libraries. This will allow users and third-party libraries to
write code that interacts and operates with a standard dataframe, and not with specific implementations.

The main goals for the API defined in this document are:

- Make conversion of data among different implementations easier
- Let third party libraries consuming dataframes receive dataframes from any implementations

In the future, besides a data exchange protocol, the standard aims to include common operations
done with dataframe, with the next goals in mind:

- Provide a common API for dataframes so software using dataframes can work with all
implementations
- Provide a common API for dataframes to build user interfaces on top of it, for example
libraries for interactive use or specific domains and industries
- Help user transition from one dataframe library to another

See the [use cases](02_use_cases.html) section for details on the exact use cases considered.


### Out-of-scope

#### Execution details

Implementation details of the dataframes and execution of operations. This includes:

- How data is represented and stored (whether the data is in memory, disk, distributed)
- Expectations on when the execution is happening (in an eager or lazy way)
- Other execution details

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd state here that an API designed for interactive usage is out of scope.

**Rationale:** The API defined in this document needs to be used by libraries as diverse as Ibis,
Dask, Vaex or cuDF. The data can live in databases, distributed systems, disk or GPU memory.
Any decision that involves assumptions on where the data is stored, or where execution happens
could prevent implementation from adopting the standard.

#### High level APIs

It is out of scope to provide an API designed for interactive use. While interactive use
is a key aspect of dataframes, an API designed for interactive use can be built on top
of the API defined in this document.

Domain or industry specific APIs are also out of scope, but can benefit from the standard
to better interact with the different dataframe implementation.

**Rationale:** Interactive or domain specific users are key in the Python dataframe ecosystem.
But the amount and diversity of users makes it unfeasible to standardize every dataframe feature
that is currently used. In particular, functionality built as syntactic sugar for convenience in
interactive use, or heavily overloaded create very complex APIs. For example, the pandas dataframe
constructor, which accepts a huge number of formats, or its `__getitem__` (e.g. `df[something]`)
which is heavily overloaded. Implementations can provide convenient functionality like this one
for the users they are targeting, but it is out-of-scope for the standard, so the standard is
simple and easy to adopt.


### Non-goals

- Build an API that is appropriate to all users
- Have a unique dataframe implementation for Python
- Standardize functionalities specific to a domain or industry


## Stakeholders

This section provides the list of stakeholders considered for the definition of this API.


### Dataframe library authors

We encourage dataframe libraries in Python to implement the API defined in this document
in their libraries.

The list of known Python dataframe libraries at the time of writing this document is next:

- [cuDF](https://github.com/rapidsai/cudf)
- [Dask](https://dask.org/)
- [datatable](https://github.com/h2oai/datatable)
- [dexplo](https://github.com/dexplo/dexplo/)
- [Eland](https://github.com/elastic/eland)
- [Grizzly](https://github.com/weld-project/weld#grizzly)
- [Ibis](https://ibis-project.org/)
- [Koalas](https://github.com/databricks/koalas)
- [Mars](https://docs.pymars.org/en/latest/)
- [Modin](https://github.com/modin-project/modin)
- [pandas](https://pandas.pydata.org/)
- [PySpark](https://spark.apache.org/docs/latest/api/python/index.html)
- [StaticFrame](https://static-frame.readthedocs.io/en/latest/)
- [Turi Create](https://github.com/apple/turicreate)
- [Vaex](https://vaex.io/)


### Downstream library authors

Authors of libraries that consume dataframes. They can use the API defined in this document
to know how the data contained in a dataframe can be consumed, and which operations are implemented.

A non-exhaustive list of downstream library categories is next:

- Plotting and visualization (e.g. Matplotlib, Bokeh, Altair, Plotly)
- Statistical libraries (e.g. statsmodels)
- Machine learning libraries (e.g. scikit-learn)


### Upstream library authors

Authors of libraries that provide functionality used by dataframes.

A non-exhaustive list of upstream categories is next:

- Data formats, protocols and libraries for data analytics (e.g. Apache Arrow, NumPy)
- Task schedulers (e.g. Dask, Ray, Mars)
- Big data systems (e.g. Spark, Hive, Impala, Presto)
- Libraries for database access (e.g. SQLAlchemy)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add Database and Big Data systems?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I don't think we are planning to engage with developer of PostgreSQL, MySQL... I'm adding for now big data systems, and also Python libraries to access databases, which I guess we're more likely to engage with. But I'm open to further changes if there are different points of view.


### Dataframe power users


This group considers developers of reusable code that use dataframes. For example, developers of
applications that use dataframes. Or authors of libraries that provide specialized dataframe
APIs to be built on top of the standard API.

People using dataframes in an interactive way are considered out of scope. These users include data
analysts, data scientists and other users that are key for dataframes. But this type of user may need
shortcuts, or libraries that take decisions for them to save them time. For example automatic type
inference, or excessive use of very compact syntax like Python squared brackets / `__getitem__`.
Standardizing on such practices can be extremely difficult, and it is out of scope.

With the development of a standard API that targets developers writing reusable code we expected
to also serve data analysts and other interactive users. But in an indirect way, by providing a
standard API where other libraries can be built on top. Including libraries with the syntactic
sugar required for fast analysis of data.


## High-level API overview
Expand All @@ -38,4 +249,3 @@


## References

Loading