Skip to content

Adding introduction, goals, scope and use cases to the RFC #27

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Sep 13, 2020
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
171 changes: 169 additions & 2 deletions spec/01_purpose_and_scope.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,186 @@

## Introduction

This document defines a Python data frame API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer dataframe as one word, like database.

I do not want to start a holy war, and I realize there are historical reasons to call it data frame, but data base was common even throughout the 90s. https://groups.google.com/g/alt.usage.english/c/jRB0g0zK85Q?pli=1


A data frame is a programming interface for expressing data manipulations over a
data structure consisting of rows and columns. Columns are named, and values in a
column share a common data type. This definition is intentionally left broad.

## History
## History and data frame implementations

Data frame libraries in several programming language exist, such as
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Data frame libraries in several programming language exist, such as
Dataframe libraries in several programming language exist, such as

[R](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame),
[Scala](https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-scala.html),
[Julia](https://juliadata.github.io/DataFrames.jl/stable/) and others.

In Python, the most popular data frame library is [pandas](https://pandas.pydata.org/).
pandas was initially develop at a hedge fund, with a focus on

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

develop -> developed

[panel data](https://en.wikipedia.org/wiki/Panel_data) and financial time series.
It was open sourced in 2009, and since then it has been growing in popularity, including
many other domains outside time series and financial data. While still rich in time series
functionality, today is considered a general-purpose data frame library. The original
`Panel` class that gave name to the library was deprecated in 2017 and removed in 2019,
to focus on the main `DataFrame` class.

## Scope (includes out-of-scope / non-goals)
Internally, pandas is implemented on top of NumPy, which is used to store the data
and to perform many of the operations. Some parts of pandas are writen in Cython.

As of 2020 the pandas website has around one million and a half visitors per month.

Other libraries emerged in the last years, to address some of the limitations of pandas.
But in most cases, the libraries implemented a public API very similar to pandas, to
make the transition to their libraries easier. Next, there is a short description of
the main data frame libraries in Python.

[Dask](https://dask.org/) is a task scheduler built in Python, which implements a data
frame interface. Dask data frame use pandas internally in the workers, and it provides
an API similar to pandas, adapted to its distributed and lazy nature.

[Vaex](https://vaex.io/) is an out-of-core alternative to pandas. Vaex uses hdf5 to
create memory maps that avoid loading data sets to memory. Some parts of Vaex are
implemented in C++.

[Modin](https://github.com/modin-project/modin) is another distributed data frame
library based originally on [Ray](https://github.com/ray-project/ray). But built in
a more modular way, that allows it to also use Dask as a scheduler, or replace the
pandas-like public API by a SQLite-like one.

[cuDF](https://github.com/rapidsai/cudf) is a GPU data frame library built on top
of Apache Arrow and RAPIDS. It provides an API similar to pandas.

[PySpark](https://spark.apache.org/docs/latest/api/python/index.html) is a data
frame library that uses Spark as a backend. PySpark public API is based on the
original Spark API, and not in pandas.

[Koalas](https://github.com/databricks/koalas) is a data frame library built on
top of PySpark that provides a pandas-like API.

[Ibis](https://ibis-project.org/) is a data frame library with multiple SQL backends.
It uses SQLAlchemy and a custom SQL compiler to translate its pandas-like API to
SQL statements, executed by the backends. It supports conventional DBMS, as well
as big data systems such as Apache Impala or BigQuery.


## Goals

Given the growing Python data frame ecosystem, and its complexity, this document provides
a standard Python data frame API. Until recently, pandas has been a de-facto standard for
Python data frames. But currently there are a growing number of not only data frame libraries,
but also libraries that interact with data frames (visualization, statistical or machine learning
libraries for example). Interactions among libraries are becoming complex, and the pandas
public API is suboptimal as a standard, for its size, complexity, and implementation details
it exposes (for example, using NumPy data types or `NaN`).


The goal of the API described in this document is to provide a standard interface that encapsulates
implementation details of data frame libraries. This will allow users and third-party libraries to
write code that interacts with a standard data frame, and not with specific implementations.

The defined API does not aim to be a convenient API for all users of data frames. Libraries targeting

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we leave the standardization of the end-user API as potential future work for us, or do we not plan on doing any of that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, and one that many readers will have. I think it would be good to explicitly this is out of scope for this version of the standard, but may be in scope for a future version. With a rationale that it's also important, one of the longer-term goals should be (I think) to make the learning curve for users less steep when switching from one library to another one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure of:

  • Goals
  • Scope
  • Out-of-scope and non-goals
    is a little inconsistent, I'd suggest to make it symmetric (and add rationales as I just did in my array API scope PR), then this kind of thing may be easier to address.

specific users (data analysts, data scientists, quants, etc.) can be implemented on top of the
standard API. The standard API is targeted to software developers, who will write reusable code
(as opposed as users performing fast interactive analysis of data).

See the [scope](#Scope) section for detailed information on what is in scope, and the
[use cases](02_use_cases.html) section for details on the exact use cases considered.


## Scope

It is in the scope of this document the different elements of the API. This includes signatures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a verb missing in the first sentence ("to describe" ?)

and semantics. To be more specific:

- Data structures and Python classes
- Functions, methods, attributes and other API elements
- Expected returns of the different operations
- Data types (Python and low-level types)

The scope of this document is limited to generic data frames, and not data frames specific to
certain domains.


### Out-of-scope and non-goals

Implementation details of the data frames and execution of operations. This includes:

- How data is represented and stored (whether the data is in memory, disk, distributed)
- Expectations on when the execution is happening (in an eager or lazy way)
- Other execution details

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd state here that an API designed for interactive usage is out of scope.

The API defined in this document needs to be used by libraries as diverse as Ibis, Dask,
Vaex or cuDF. The data can live in databases, distributed systems, disk or GPU memory.
Any decision that involves assumptions on where the data is stored, or where execution
happens are out of the scope of this document.

## Stakeholders

This section provides the list of stakeholders considered for the definition of this API.


### Data frame library authors
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Data frame library authors
### Dataframe library authors


Authors of data frame libraries in Python are expected to implement the API defined
in this document in their libraries.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very heavy handed statement. Could we reword it to something a bit friendlier of:

We encourage data frame libraries in Python to implement the API defined in this document in their libraries


The list of known Python data frame libraries at the time of writing this document is next:

- [cuDF](https://github.com/rapidsai/cudf)
- [Dask](https://dask.org/)
- [datatable](https://github.com/h2oai/datatable)
- [dexplo](https://github.com/dexplo/dexplo/)
- [Eland](https://github.com/elastic/eland)
- [Grizzly](https://github.com/weld-project/weld#grizzly)
- [Ibis](https://ibis-project.org/)
- [Koalas](https://github.com/databricks/koalas)
- [Mars](https://docs.pymars.org/en/latest/)
- [Modin](https://github.com/modin-project/modin)
- [pandas](https://pandas.pydata.org/)
- [PySpark](https://spark.apache.org/docs/latest/api/python/index.html)
- [StaticFrame](https://static-frame.readthedocs.io/en/latest/)
- [Turi Create](https://github.com/apple/turicreate)
- [Vaex](https://vaex.io/)


### Downstream library authors

Authors of libraries that consume data frames. They can use the API defined in this document
to know how the data contained in a data frame can be consumed, and which operations are implemented.

A non-exhaustive list of downstream library categories is next:

- Plotting and visualization (e.g. Matplotlib, Bokeh, Altair, Plotly)
- Statistical libraries (e.g. statsmodels)
- Machine learning libraries (e.g. scikit-learn)


### Upstream library authors

Authors of libraries that provide functionality used by data frames.

A non-exhaustive list of upstream categories is next:

- Data formats, protocols and libraries for data analytics (e.g. Apache Arrow)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numpy as well? It's used by dataframe libraries for their implementation

- Task schedulers (e.g. Dask, Ray)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would include Mars (https://github.com/mars-project/mars) here as well.


Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add Database and Big Data systems?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I don't think we are planning to engage with developer of PostgreSQL, MySQL... I'm adding for now big data systems, and also Python libraries to access databases, which I guess we're more likely to engage with. But I'm open to further changes if there are different points of view.


### Data frame power users
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Data frame power users
### Dataframe power users



This group considers developers of reusable code that use data frames. For example, developers of
applications that use data frames. Or authors of libraries that provide specialized data frame
APIs to be built on top of the standard API.

People using data frames in an interactive way are considered out of scope. These users include data
analysts, data scientist and other users that are key for data frames. But this type of user may need
shortcuts, or libraries that take decisions for them to save them time. For example automatic type
inference, or excesive use of very compact syntax like Python squared brackets / `__getitem__`.
Standardizing on such practices can be extremely difficult, and it is out of scope.

With the development of a standard API that targets developers writing reusable code we expected
to also serve data analysts and other interactive users. But in an indirect way, by providing a
standard API where other libraries can be built on top. Including libraries with the syntactic
sugar required for fast analysis of data.


## High-level API overview
Expand Down
177 changes: 177 additions & 0 deletions spec/02_use_cases.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,184 @@
# Use cases

## Introduction

This section discusses the use cases considered for the standard data frame API.

The goals and scope of this API are defined in the [goals](01_purpose_and_scope.html#Goals),
and [scope](01_purpose_and_scope.html#Scope) sections.

The target audience and stakeholders are presented in the
[stakeholders](01_purpose_and_scope.html#Stakeholders) section.


## Types of use cases

The next types of use cases can be accomplished by the use of the standard Python data frame
API defined in this document:

- Downstream library receiving a data frame as a parameter
- Converting a data frame from one implementation to another (try to clarify)

Other types of uses cases not related to data interchange will be added later.


## Concrete use cases

In this section we define concrete examples of the types of use cases defined above.

### Plotting library receiving data as a data frame

One use case we facilitate with the API defined in this document is a plotting library
receiving the data to be plotted as a data frame object.

Consider the case of a scatter plot, that will be plotted with the data contained in a
data frame structure. For example, consider this data:

| petal length | petal width |
|--------------|-------------|
| 1.4 | 0.2 |
| 1.7 | 0.4 |
| 1.3 | 0.2 |
| 1.5 | 0.1 |

If we consider a pure Python implementation, we could for example receive the information
as two lists, one for the _petal length_ and one for the _petal width_.

```python
petal_length = [1.4, 1.7, 1.3, 1.5]
petal_width = [0.2, 0.4, 0.2, 0.1]

def scatter_plot(x: list, y: list):
"""
Generate a scatter plot with the information provided in `x` and `y`.
"""
...
```

When we consider data frames, we would like to provide them directly to the `scatter_plot`
function. And we would like the plotting library to be agnostic of what specific library
will be used when calling the function. We would like the code to work whether a pandas,
Dask, Vaex or other current or future implementation are used.

An implementation of the `scatter_plot` function could be:

```python
def scatter_plot(data: dataframe, x_column: str, y_column: str):
"""
Generate a scatter plot with the information provided in `x` and `y`.
"""
...
```

The API documented here describes what the developer of the plotting library can expect
from the object `data`. In which ways can interact with the data frame object to extract
the desired information.

An example of this are Seaborn plots. For example, the
[scatterplot](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) accepts a
parameter `data`, which is expected to be a `DataFrame`.

When providing a pandas `DataFrame`, the next code generates the intended scatter plot:

```python
import pandas
import seaborn

pandas_df = pandas.DataFrame({'bill': [15, 32, 28],
'tip': [2, 5, 3]})

seaborn.scatterplot(data=pandas_df, x='bill', y='tip')
```

But if we instead provide a Vaex data frame, then an exception occurs:

```python
import vaex

vaex_df = vaex.from_pandas(pandas_df)

seaborn.scatterplot(data=vaex_df, x='bill', y='tip')
```

This is caused by Seaborn expecting a pandas `DataFrame` object. And while Vaex
provides an interface very similar to pandas, it does not implement 100% of its
API, and Seaborn is trying to use parts that differ.

With the definition of the standard API, Seaborn developers should be able to
expect a generic data frame. And any library implementing the standard data frame
API could be plotted with the previous example (Vaex, cuDF, Ibis, Dask, Modin, etc.).


### Change object from one implementation to another

Another considered use case is transforming the data from one implementation to another.

As an example, consider we are using Dask data frames, given that our data is too big to
fit in memory, and we are working over a cluster. At some point in our pipeline, we
reduced the size of the data frame we are working on, by filtering and grouping. And
we are interested in transforming the data frame from Dask to pandas, to use some
functionalities that pandas implements but Dask does not.

Since Dask knows how the data in the data frame is represented, one option could be to
implement a `.to_pandas()` method in the Dask data frame. Another option could be to
implement this in pandas, in a `.from_dask()` method.

As the ecosystem grows, this solution implies that every implementation could end up
having a long list of functions or methods:

- `to_pandas()` / `from_pandas()`
- `to_vaex()` / `from_vaex()`
- `to_modin()` / `from_modin()`
- `to_dask()` / `from_dask()`
- ...

With a standard Python data frame API, every library could simply implement a method to
import a standard data frame. And since data frame libraries are expected to implement
this API, that would be enough to transform any data frame to one implementation.

So, the list above would be reduced to a single function or method in each implementation:

- `from_dataframe()`

Note that the function `from_dataframe()` is for illustration, and not proposed as part
of the standard at this point.
Comment on lines +139 to +144
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A dataframe protocol similar to wesm/dataframe-protocol#1 is a prerequisite to this being possible in my mind.

Without having a data exchange protocol defined as part of the spec / goal how can we define from_dataframe / to_dataframe APIs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point the data exchange protocol is what we're trying to define. This use case tries to illustrate why such a data exchange protocol is needed.

Do you think I should clarify this is the goal for the use cases? Or am I not understanding you?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that it was made clear that a dataframe data exchange protocol was in scope in this document. The only mention of a protocol is in talking about Apache Arrow as far as I can tell.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good as it is. We are talking about use cases in this document, not the implementation right? So we can loosely define what from_dataframe does, from a high-level point of view, to make the use case clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, forgot to comment here. I edited the scope since last comment from @kkraus14. I guess making clear in the goal/scope that we are defining a data exchange protocol solved your concern @kkraus14, or do you think this use case also needs editing?

Thanks both for the comments!


Every pair of data frame libraries could benefit from this conversion. But we can go
deeper with an actual example. The conversion from an xarray `DataArray` to a pandas
`DataFrame`, and the other way round.

Even if xarray is not a data frame library, but a miltidimensional labeled structure,
in cases where a 2-D is used, the data can be converted from and to a data frame.

Currently, xarray implements a `.to_pandas()` method to convert a `DataArray` to a
pandas `DataFrame`:

```python
import xarray

xarray_data = xarray.DataArray([[15, 2], [32, 5], [28, 3]],
dims=('diners', 'features'),
coords={'features': ['bill', 'tip']})

pandas_df = xarray_data.to_pandas()
```

To convert the pandas data frame to an xarray `Data Array`, both libraries have
implementations. Both lines below are equivalent:

```python
pandas_df.to_xarray()
xarray.DataArray(pandas_df)
```

Other data frame implementations may or may not implement a way to convert to xarray.
And passing a data frame to the `DataArray` constructor may or may not work.

The standard data frame API would allow pandas, xarray and other libraries to
implement the standard API. They could convert other representations via a single
`to_dataframe()` function or method. And they could be converted to other
representations that implement that function automatically.

This would make conversions very simple, not only among data frame libraries, but
also among other libraries which data can be expressed as tabular data, such as
xarray, SQLAlchemy and others.