data-apis · datapythonista · Sep 13, 2020 · Aug 16, 2020 · Aug 18, 2020 · Aug 25, 2020
diff --git a/spec/01_purpose_and_scope.md b/spec/01_purpose_and_scope.md
@@ -2,19 +2,186 @@
 
 ## Introduction
 
+This document defines a Python data frame API.
 
+A data frame is a programming interface for expressing data manipulations over a
+data structure consisting of rows and columns. Columns are named, and values in a
+column share a common data type. This definition is intentionally left broad.
 
-## History
+## History and data frame implementations
 
+Data frame libraries in several programming language exist, such as
-Data frame libraries in several programming language exist, such as
+Dataframe libraries in several programming language exist, such as
-Data frame libraries in several programming language exist, such as
+Dataframe libraries in several programming language exist, such as
+[R](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame),
+[Scala](https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-scala.html),
+[Julia](https://juliadata.github.io/DataFrames.jl/stable/) and others.
 
+In Python, the most popular data frame library is [pandas](https://pandas.pydata.org/).
+pandas was initially develop at a hedge fund, with a focus on
+[panel data](https://en.wikipedia.org/wiki/Panel_data) and financial time series.
+It was open sourced in 2009, and since then it has been growing in popularity, including
+many other domains outside time series and financial data. While still rich in time series
+functionality, today is considered a general-purpose data frame library. The original
+`Panel` class that gave name to the library was deprecated in 2017 and removed in 2019,
+to focus on the main `DataFrame` class.
 
-## Scope (includes out-of-scope / non-goals)
+Internally, pandas is implemented on top of NumPy, which is used to store the data
+and to perform many of the operations. Some parts of pandas are writen in Cython.
 
+As of 2020 the pandas website has around one million and a half visitors per month.
 
+Other libraries emerged in the last years, to address some of the limitations of pandas.
+But in most cases, the libraries implemented a public API very similar to pandas, to
+make the transition to their libraries easier. Next, there is a short description of
+the main data frame libraries in Python.
+
+[Dask](https://dask.org/) is a task scheduler built in Python, which implements a data
+frame interface. Dask data frame use pandas internally in the workers, and it provides
+an API similar to pandas, adapted to its distributed and lazy nature.
+
+[Vaex](https://vaex.io/) is an out-of-core alternative to pandas. Vaex uses hdf5 to
+create memory maps that avoid loading data sets to memory. Some parts of Vaex are
+implemented in C++.
+
+[Modin](https://github.com/modin-project/modin) is another distributed data frame
+library based originally on [Ray](https://github.com/ray-project/ray). But built in
+a more modular way, that allows it to also use Dask as a scheduler, or replace the
+pandas-like public API by a SQLite-like one.
+
+[cuDF](https://github.com/rapidsai/cudf) is a GPU data frame library built on top
+of Apache Arrow and RAPIDS. It provides an API similar to pandas.
+
+[PySpark](https://spark.apache.org/docs/latest/api/python/index.html) is a data
+frame library that uses Spark as a backend. PySpark public API is based on the
+original Spark API, and not in pandas.
+
+[Koalas](https://github.com/databricks/koalas) is a data frame library built on
+top of PySpark that provides a pandas-like API.
+
+[Ibis](https://ibis-project.org/) is a data frame library with multiple SQL backends.
+It uses SQLAlchemy and a custom SQL compiler to translate its pandas-like API to
+SQL statements, executed by the backends. It supports conventional DBMS, as well
+as big data systems such as Apache Impala or BigQuery.
+
+
+## Goals
+
+Given the growing Python data frame ecosystem, and its complexity, this document provides
+a standard Python data frame API. Until recently, pandas has been a de-facto standard for
+Python data frames. But currently there are a growing number of not only data frame libraries,
+but also libraries that interact with data frames (visualization, statistical or machine learning
+libraries for example). Interactions among libraries are becoming complex, and the pandas
+public API is suboptimal as a standard, for its size, complexity, and implementation details
+it exposes (for example, using NumPy data types or `NaN`).
+
+
+The goal of the API described in this document is to provide a standard interface that encapsulates
+implementation details of data frame libraries. This will allow users and third-party libraries to
+write code that interacts with a standard data frame, and not with specific implementations.
+
+The defined API does not aim to be a convenient API for all users of data frames. Libraries targeting
+specific users (data analysts, data scientists, quants, etc.) can be implemented on top of the
+standard API. The standard API is targeted to software developers, who will write reusable code
+(as opposed as users performing fast interactive analysis of data).
+
+See the [scope](#Scope) section for detailed information on what is in scope, and the
+[use cases](02_use_cases.html) section for details on the exact use cases considered.
+
+
+## Scope
+
+It is in the scope of this document the different elements of the API. This includes signatures
+and semantics. To be more specific:
+
+- Data structures and Python classes
+- Functions, methods, attributes and other API elements
+- Expected returns of the different operations
+- Data types (Python and low-level types)
+
+The scope of this document is limited to generic data frames, and not data frames specific to
+certain domains.
+
+
+### Out-of-scope and non-goals
+
+Implementation details of the data frames and execution of operations. This includes:
+
+- How data is represented and stored (whether the data is in memory, disk, distributed)
+- Expectations on when the execution is happening (in an eager or lazy way)
+- Other execution details
+
+The API defined in this document needs to be used by libraries as diverse as Ibis, Dask,
+Vaex or cuDF. The data can live in databases, distributed systems, disk or GPU memory.
+Any decision that involves assumptions on where the data is stored, or where execution
+happens are out of the scope of this document.
 
 ## Stakeholders
 
+This section provides the list of stakeholders considered for the definition of this API.
+
+
+### Data frame library authors
-### Data frame library authors
+### Dataframe library authors
-### Data frame library authors
+### Dataframe library authors
+
+Authors of data frame libraries in Python are expected to implement the API defined
+in this document in their libraries.
+
+The list of known Python data frame libraries at the time of writing this document is next:
+
+- [cuDF](https://github.com/rapidsai/cudf)
+- [Dask](https://dask.org/)
+- [datatable](https://github.com/h2oai/datatable)
+- [dexplo](https://github.com/dexplo/dexplo/)
+- [Eland](https://github.com/elastic/eland)
+- [Grizzly](https://github.com/weld-project/weld#grizzly)
+- [Ibis](https://ibis-project.org/)
+- [Koalas](https://github.com/databricks/koalas)
+- [Mars](https://docs.pymars.org/en/latest/)
+- [Modin](https://github.com/modin-project/modin)
+- [pandas](https://pandas.pydata.org/)
+- [PySpark](https://spark.apache.org/docs/latest/api/python/index.html)
+- [StaticFrame](https://static-frame.readthedocs.io/en/latest/)
+- [Turi Create](https://github.com/apple/turicreate)
+- [Vaex](https://vaex.io/)
+
+
+### Downstream library authors
+
+Authors of libraries that consume data frames. They can use the API defined in this document
+to know how the data contained in a data frame can be consumed, and which operations are implemented.
+
+A non-exhaustive list of downstream library categories is next:
+
+- Plotting and visualization (e.g. Matplotlib, Bokeh, Altair, Plotly)
+- Statistical libraries (e.g. statsmodels)
+- Machine learning libraries (e.g. scikit-learn)
+
+
+### Upstream library authors
+
+Authors of libraries that provide functionality used by data frames.
+
+A non-exhaustive list of upstream categories is next:
+
+- Data formats, protocols and libraries for data analytics (e.g. Apache Arrow)
+- Task schedulers (e.g. Dask, Ray)
+
+
+### Data frame power users
-### Data frame power users
+### Dataframe power users
-### Data frame power users
+### Dataframe power users
+
+
+This group considers developers of reusable code that use data frames. For example, developers of
+applications that use data frames. Or authors of libraries that provide specialized data frame
+APIs to be built on top of the standard API.
+
+People using data frames in an interactive way are considered out of scope. These users include data
+analysts, data scientist and other users that are key for data frames. But this type of user may need
+shortcuts, or libraries that take decisions for them to save them time. For example automatic type
+inference, or excesive use of very compact syntax like Python squared brackets / `__getitem__`.
+Standardizing on such practices can be extremely difficult, and it is out of scope.
 
+With the development of a standard API that targets developers writing reusable code we expected
+to also serve data analysts and other interactive users. But in an indirect way, by providing a
+standard API where other libraries can be built on top. Including libraries with the syntactic
+sugar required for fast analysis of data.
 
 
 ## High-level API overview

diff --git a/spec/02_use_cases.md b/spec/02_use_cases.md
@@ -1,7 +1,184 @@
 # Use cases
 
+## Introduction
+
+This section discusses the use cases considered for the standard data frame API.
+
+The goals and scope of this API are defined in the [goals](01_purpose_and_scope.html#Goals),
+and [scope](01_purpose_and_scope.html#Scope) sections.
+
+The target audience and stakeholders are presented in the
+[stakeholders](01_purpose_and_scope.html#Stakeholders) section.
+
+
 ## Types of use cases
 
+The next types of use cases can be accomplished by the use of the standard Python data frame
+API defined in this document:
+
+- Downstream library receiving a data frame as a parameter
+- Converting a data frame from one implementation to another (try to clarify)
+
+Other types of uses cases not related to data interchange will be added later.
 
 
 ## Concrete use cases
+
+In this section we define concrete examples of the types of use cases defined above.
+
+### Plotting library receiving data as a data frame
+
+One use case we facilitate with the API defined in this document is a plotting library
+receiving the data to be plotted as a data frame object.
+
+Consider the case of a scatter plot, that will be plotted with the data contained in a
+data frame structure. For example, consider this data:
+
+| petal length | petal width |
+|--------------|-------------|
+|          1.4 |         0.2 |
+|          1.7 |         0.4 |
+|          1.3 |         0.2 |
+|          1.5 |         0.1 |
+
+If we consider a pure Python implementation, we could for example receive the information
+as two lists, one for the _petal length_ and one for the _petal width_.
+
+```python
+petal_length = [1.4, 1.7, 1.3, 1.5]
+petal_width = [0.2, 0.4, 0.2, 0.1]
+
+def scatter_plot(x: list, y: list):
+    """
+    Generate a scatter plot with the information provided in `x` and `y`.
+    """
+    ...
+```
+
+When we consider data frames, we would like to provide them directly to the `scatter_plot`
+function. And we would like the plotting library to be agnostic of what specific library
+will be used when calling the function. We would like the code to work whether a pandas,
+Dask, Vaex or other current or future implementation are used.
+
+An implementation of the `scatter_plot` function could be:
+
+```python
+def scatter_plot(data: dataframe, x_column: str, y_column: str):
+    """
+    Generate a scatter plot with the information provided in `x` and `y`.
+    """
+    ...
+```
+
+The API documented here describes what the developer of the plotting library can expect
+from the object `data`. In which ways can interact with the data frame object to extract
+the desired information.
+
+An example of this are Seaborn plots. For example, the
+[scatterplot](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) accepts a
+parameter `data`, which is expected to be a `DataFrame`.
+
+When providing a pandas `DataFrame`, the next code generates the intended scatter plot:
+
+```python
+import pandas
+import seaborn
+
+pandas_df = pandas.DataFrame({'bill': [15, 32, 28],
+                              'tip': [2, 5, 3]})
+
+seaborn.scatterplot(data=pandas_df, x='bill', y='tip')
+```
+
+But if we instead provide a Vaex data frame, then an exception occurs:
+
+```python
+import vaex
+
+vaex_df = vaex.from_pandas(pandas_df)
+
+seaborn.scatterplot(data=vaex_df, x='bill', y='tip')
+```
+
+This is caused by Seaborn expecting a pandas `DataFrame` object. And while Vaex
+provides an interface very similar to pandas, it does not implement 100% of its
+API, and Seaborn is trying to use parts that differ.
+
+With the definition of the standard API, Seaborn developers should be able to
+expect a generic data frame. And any library implementing the standard data frame
+API could be plotted with the previous example (Vaex, cuDF, Ibis, Dask, Modin, etc.).
+
+
+### Change object from one implementation to another
+
+Another considered use case is transforming the data from one implementation to another.
+
+As an example, consider we are using Dask data frames, given that our data is too big to
+fit in memory, and we are working over a cluster. At some point in our pipeline, we
+reduced the size of the data frame we are working on, by filtering and grouping. And
+we are interested in transforming the data frame from Dask to pandas, to use some
+functionalities that pandas implements but Dask does not.
+
+Since Dask knows how the data in the data frame is represented, one option could be to
+implement a `.to_pandas()` method in the Dask data frame. Another option could be to
+implement this in pandas, in a `.from_dask()` method.
+
+As the ecosystem grows, this solution implies that every implementation could end up
+having a long list of functions or methods:
+
+- `to_pandas()` / `from_pandas()`
+- `to_vaex()` / `from_vaex()`
+- `to_modin()` / `from_modin()`
+- `to_dask()` / `from_dask()`
+- ...
+
+With a standard Python data frame API, every library could simply implement a method to
+import a standard data frame. And since data frame libraries are expected to implement
+this API, that would be enough to transform any data frame to one implementation.
+
+So, the list above would be reduced to a single function or method in each implementation:
+
+- `from_dataframe()`
+
+Note that the function `from_dataframe()` is for illustration, and not proposed as part
+of the standard at this point.
+
+Every pair of data frame libraries could benefit from this conversion. But we can go
+deeper with an actual example. The conversion from an xarray `DataArray` to a pandas
+`DataFrame`, and the other way round.
+
+Even if xarray is not a data frame library, but a miltidimensional labeled structure,
+in cases where a 2-D is used, the data can be converted from and to a data frame.
+
+Currently, xarray implements a `.to_pandas()` method to convert a `DataArray` to a
+pandas `DataFrame`:
+
+```python
+import xarray
+
+xarray_data = xarray.DataArray([[15, 2], [32, 5], [28, 3]],
+                               dims=('diners', 'features'),
+                               coords={'features': ['bill', 'tip']})
+
+pandas_df = xarray_data.to_pandas()
+```
+
+To convert the pandas data frame to an xarray `Data Array`, both libraries have
+implementations. Both lines below are equivalent:
+
+```python
+pandas_df.to_xarray()
+xarray.DataArray(pandas_df)
+```
+
+Other data frame implementations may or may not implement a way to convert to xarray.
+And passing a data frame to the `DataArray` constructor may or may not work.
+
+The standard data frame API would allow pandas, xarray and other libraries to
+implement the standard API. They could convert other representations via a single
+`to_dataframe()` function or method. And they could be converted to other
+representations that implement that function automatically.
+
+This would make conversions very simple, not only among data frame libraries, but
+also among other libraries which data can be expressed as tabular data, such as
+xarray, SQLAlchemy and others.