-
Notifications
You must be signed in to change notification settings - Fork 21
Adding introduction, goals, scope and use cases to the RFC #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
991ab82
cc92c2c
3d998f9
d047f57
1ac9a15
e9472c8
837f87d
293c652
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -2,19 +2,186 @@ | |||||
|
||||||
## Introduction | ||||||
|
||||||
This document defines a Python data frame API. | ||||||
|
||||||
A data frame is a programming interface for expressing data manipulations over a | ||||||
data structure consisting of rows and columns. Columns are named, and values in a | ||||||
column share a common data type. This definition is intentionally left broad. | ||||||
|
||||||
## History | ||||||
## History and data frame implementations | ||||||
|
||||||
Data frame libraries in several programming language exist, such as | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
[R](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame), | ||||||
[Scala](https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-scala.html), | ||||||
[Julia](https://juliadata.github.io/DataFrames.jl/stable/) and others. | ||||||
|
||||||
In Python, the most popular data frame library is [pandas](https://pandas.pydata.org/). | ||||||
pandas was initially develop at a hedge fund, with a focus on | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||||||
[panel data](https://en.wikipedia.org/wiki/Panel_data) and financial time series. | ||||||
It was open sourced in 2009, and since then it has been growing in popularity, including | ||||||
many other domains outside time series and financial data. While still rich in time series | ||||||
functionality, today is considered a general-purpose data frame library. The original | ||||||
`Panel` class that gave name to the library was deprecated in 2017 and removed in 2019, | ||||||
to focus on the main `DataFrame` class. | ||||||
|
||||||
## Scope (includes out-of-scope / non-goals) | ||||||
Internally, pandas is implemented on top of NumPy, which is used to store the data | ||||||
and to perform many of the operations. Some parts of pandas are writen in Cython. | ||||||
datapythonista marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
As of 2020 the pandas website has around one million and a half visitors per month. | ||||||
|
||||||
Other libraries emerged in the last years, to address some of the limitations of pandas. | ||||||
But in most cases, the libraries implemented a public API very similar to pandas, to | ||||||
make the transition to their libraries easier. Next, there is a short description of | ||||||
the main data frame libraries in Python. | ||||||
|
||||||
[Dask](https://dask.org/) is a task scheduler built in Python, which implements a data | ||||||
frame interface. Dask data frame use pandas internally in the workers, and it provides | ||||||
an API similar to pandas, adapted to its distributed and lazy nature. | ||||||
|
||||||
[Vaex](https://vaex.io/) is an out-of-core alternative to pandas. Vaex uses hdf5 to | ||||||
create memory maps that avoid loading data sets to memory. Some parts of Vaex are | ||||||
implemented in C++. | ||||||
|
||||||
[Modin](https://github.com/modin-project/modin) is another distributed data frame | ||||||
library based originally on [Ray](https://github.com/ray-project/ray). But built in | ||||||
datapythonista marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
a more modular way, that allows it to also use Dask as a scheduler, or replace the | ||||||
pandas-like public API by a SQLite-like one. | ||||||
|
||||||
[cuDF](https://github.com/rapidsai/cudf) is a GPU data frame library built on top | ||||||
of Apache Arrow and RAPIDS. It provides an API similar to pandas. | ||||||
|
||||||
[PySpark](https://spark.apache.org/docs/latest/api/python/index.html) is a data | ||||||
frame library that uses Spark as a backend. PySpark public API is based on the | ||||||
original Spark API, and not in pandas. | ||||||
|
||||||
[Koalas](https://github.com/databricks/koalas) is a data frame library built on | ||||||
top of PySpark that provides a pandas-like API. | ||||||
|
||||||
[Ibis](https://ibis-project.org/) is a data frame library with multiple SQL backends. | ||||||
It uses SQLAlchemy and a custom SQL compiler to translate its pandas-like API to | ||||||
SQL statements, executed by the backends. It supports conventional DBMS, as well | ||||||
as big data systems such as Apache Impala or BigQuery. | ||||||
|
||||||
|
||||||
## Goals | ||||||
|
||||||
Given the growing Python data frame ecosystem, and its complexity, this document provides | ||||||
a standard Python data frame API. Until recently, pandas has been a de-facto standard for | ||||||
Python data frames. But currently there are a growing number of not only data frame libraries, | ||||||
but also libraries that interact with data frames (visualization, statistical or machine learning | ||||||
libraries for example). Interactions among libraries are becoming complex, and the pandas | ||||||
public API is suboptimal as a standard, for its size, complexity, and implementation details | ||||||
it exposes (for example, using NumPy data types or `NaN`). | ||||||
|
||||||
|
||||||
The goal of the API described in this document is to provide a standard interface that encapsulates | ||||||
implementation details of data frame libraries. This will allow users and third-party libraries to | ||||||
write code that interacts with a standard data frame, and not with specific implementations. | ||||||
|
||||||
The defined API does not aim to be a convenient API for all users of data frames. Libraries targeting | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we leave the standardization of the end-user API as potential future work for us, or do we not plan on doing any of that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good question, and one that many readers will have. I think it would be good to explicitly this is out of scope for this version of the standard, but may be in scope for a future version. With a rationale that it's also important, one of the longer-term goals should be (I think) to make the learning curve for users less steep when switching from one library to another one. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The structure of:
|
||||||
specific users (data analysts, data scientists, quants, etc.) can be implemented on top of the | ||||||
standard API. The standard API is targeted to software developers, who will write reusable code | ||||||
(as opposed as users performing fast interactive analysis of data). | ||||||
|
||||||
See the [scope](#Scope) section for detailed information on what is in scope, and the | ||||||
[use cases](02_use_cases.html) section for details on the exact use cases considered. | ||||||
|
||||||
|
||||||
## Scope | ||||||
|
||||||
It is in the scope of this document the different elements of the API. This includes signatures | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is a verb missing in the first sentence ("to describe" ?) |
||||||
and semantics. To be more specific: | ||||||
|
||||||
- Data structures and Python classes | ||||||
- Functions, methods, attributes and other API elements | ||||||
- Expected returns of the different operations | ||||||
- Data types (Python and low-level types) | ||||||
|
||||||
The scope of this document is limited to generic data frames, and not data frames specific to | ||||||
certain domains. | ||||||
|
||||||
|
||||||
### Out-of-scope and non-goals | ||||||
|
||||||
Implementation details of the data frames and execution of operations. This includes: | ||||||
|
||||||
- How data is represented and stored (whether the data is in memory, disk, distributed) | ||||||
- Expectations on when the execution is happening (in an eager or lazy way) | ||||||
- Other execution details | ||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd state here that an API designed for interactive usage is out of scope. |
||||||
The API defined in this document needs to be used by libraries as diverse as Ibis, Dask, | ||||||
Vaex or cuDF. The data can live in databases, distributed systems, disk or GPU memory. | ||||||
Any decision that involves assumptions on where the data is stored, or where execution | ||||||
happens are out of the scope of this document. | ||||||
|
||||||
## Stakeholders | ||||||
|
||||||
This section provides the list of stakeholders considered for the definition of this API. | ||||||
|
||||||
|
||||||
### Data frame library authors | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
Authors of data frame libraries in Python are expected to implement the API defined | ||||||
in this document in their libraries. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a very heavy handed statement. Could we reword it to something a bit friendlier of:
|
||||||
|
||||||
The list of known Python data frame libraries at the time of writing this document is next: | ||||||
|
||||||
- [cuDF](https://github.com/rapidsai/cudf) | ||||||
- [Dask](https://dask.org/) | ||||||
- [datatable](https://github.com/h2oai/datatable) | ||||||
- [dexplo](https://github.com/dexplo/dexplo/) | ||||||
- [Eland](https://github.com/elastic/eland) | ||||||
- [Grizzly](https://github.com/weld-project/weld#grizzly) | ||||||
- [Ibis](https://ibis-project.org/) | ||||||
- [Koalas](https://github.com/databricks/koalas) | ||||||
- [Mars](https://docs.pymars.org/en/latest/) | ||||||
- [Modin](https://github.com/modin-project/modin) | ||||||
- [pandas](https://pandas.pydata.org/) | ||||||
- [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) | ||||||
- [StaticFrame](https://static-frame.readthedocs.io/en/latest/) | ||||||
- [Turi Create](https://github.com/apple/turicreate) | ||||||
- [Vaex](https://vaex.io/) | ||||||
|
||||||
|
||||||
### Downstream library authors | ||||||
|
||||||
Authors of libraries that consume data frames. They can use the API defined in this document | ||||||
to know how the data contained in a data frame can be consumed, and which operations are implemented. | ||||||
|
||||||
A non-exhaustive list of downstream library categories is next: | ||||||
|
||||||
- Plotting and visualization (e.g. Matplotlib, Bokeh, Altair, Plotly) | ||||||
- Statistical libraries (e.g. statsmodels) | ||||||
- Machine learning libraries (e.g. scikit-learn) | ||||||
|
||||||
|
||||||
### Upstream library authors | ||||||
|
||||||
Authors of libraries that provide functionality used by data frames. | ||||||
|
||||||
A non-exhaustive list of upstream categories is next: | ||||||
|
||||||
- Data formats, protocols and libraries for data analytics (e.g. Apache Arrow) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Numpy as well? It's used by dataframe libraries for their implementation |
||||||
- Task schedulers (e.g. Dask, Ray) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would include Mars (https://github.com/mars-project/mars) here as well. |
||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we add Database and Big Data systems? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. I don't think we are planning to engage with developer of PostgreSQL, MySQL... I'm adding for now big data systems, and also Python libraries to access databases, which I guess we're more likely to engage with. But I'm open to further changes if there are different points of view. |
||||||
|
||||||
### Data frame power users | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
|
||||||
This group considers developers of reusable code that use data frames. For example, developers of | ||||||
applications that use data frames. Or authors of libraries that provide specialized data frame | ||||||
APIs to be built on top of the standard API. | ||||||
|
||||||
People using data frames in an interactive way are considered out of scope. These users include data | ||||||
analysts, data scientist and other users that are key for data frames. But this type of user may need | ||||||
datapythonista marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
shortcuts, or libraries that take decisions for them to save them time. For example automatic type | ||||||
inference, or excesive use of very compact syntax like Python squared brackets / `__getitem__`. | ||||||
datapythonista marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
Standardizing on such practices can be extremely difficult, and it is out of scope. | ||||||
|
||||||
With the development of a standard API that targets developers writing reusable code we expected | ||||||
to also serve data analysts and other interactive users. But in an indirect way, by providing a | ||||||
standard API where other libraries can be built on top. Including libraries with the syntactic | ||||||
sugar required for fast analysis of data. | ||||||
|
||||||
|
||||||
## High-level API overview | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,184 @@ | ||
# Use cases | ||
|
||
## Introduction | ||
|
||
This section discusses the use cases considered for the standard data frame API. | ||
|
||
The goals and scope of this API are defined in the [goals](01_purpose_and_scope.html#Goals), | ||
and [scope](01_purpose_and_scope.html#Scope) sections. | ||
|
||
The target audience and stakeholders are presented in the | ||
[stakeholders](01_purpose_and_scope.html#Stakeholders) section. | ||
|
||
|
||
## Types of use cases | ||
|
||
The next types of use cases can be accomplished by the use of the standard Python data frame | ||
API defined in this document: | ||
|
||
- Downstream library receiving a data frame as a parameter | ||
- Converting a data frame from one implementation to another (try to clarify) | ||
|
||
Other types of uses cases not related to data interchange will be added later. | ||
|
||
|
||
## Concrete use cases | ||
|
||
In this section we define concrete examples of the types of use cases defined above. | ||
|
||
### Plotting library receiving data as a data frame | ||
|
||
One use case we facilitate with the API defined in this document is a plotting library | ||
receiving the data to be plotted as a data frame object. | ||
|
||
Consider the case of a scatter plot, that will be plotted with the data contained in a | ||
data frame structure. For example, consider this data: | ||
|
||
| petal length | petal width | | ||
|--------------|-------------| | ||
| 1.4 | 0.2 | | ||
| 1.7 | 0.4 | | ||
| 1.3 | 0.2 | | ||
| 1.5 | 0.1 | | ||
|
||
If we consider a pure Python implementation, we could for example receive the information | ||
as two lists, one for the _petal length_ and one for the _petal width_. | ||
|
||
```python | ||
petal_length = [1.4, 1.7, 1.3, 1.5] | ||
petal_width = [0.2, 0.4, 0.2, 0.1] | ||
|
||
def scatter_plot(x: list, y: list): | ||
""" | ||
Generate a scatter plot with the information provided in `x` and `y`. | ||
""" | ||
... | ||
``` | ||
|
||
When we consider data frames, we would like to provide them directly to the `scatter_plot` | ||
function. And we would like the plotting library to be agnostic of what specific library | ||
will be used when calling the function. We would like the code to work whether a pandas, | ||
Dask, Vaex or other current or future implementation are used. | ||
|
||
An implementation of the `scatter_plot` function could be: | ||
|
||
```python | ||
def scatter_plot(data: dataframe, x_column: str, y_column: str): | ||
""" | ||
Generate a scatter plot with the information provided in `x` and `y`. | ||
""" | ||
... | ||
``` | ||
|
||
The API documented here describes what the developer of the plotting library can expect | ||
from the object `data`. In which ways can interact with the data frame object to extract | ||
the desired information. | ||
|
||
An example of this are Seaborn plots. For example, the | ||
[scatterplot](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) accepts a | ||
parameter `data`, which is expected to be a `DataFrame`. | ||
|
||
When providing a pandas `DataFrame`, the next code generates the intended scatter plot: | ||
|
||
```python | ||
import pandas | ||
import seaborn | ||
|
||
pandas_df = pandas.DataFrame({'bill': [15, 32, 28], | ||
'tip': [2, 5, 3]}) | ||
|
||
seaborn.scatterplot(data=pandas_df, x='bill', y='tip') | ||
``` | ||
|
||
But if we instead provide a Vaex data frame, then an exception occurs: | ||
|
||
```python | ||
import vaex | ||
|
||
vaex_df = vaex.from_pandas(pandas_df) | ||
|
||
seaborn.scatterplot(data=vaex_df, x='bill', y='tip') | ||
``` | ||
|
||
This is caused by Seaborn expecting a pandas `DataFrame` object. And while Vaex | ||
provides an interface very similar to pandas, it does not implement 100% of its | ||
API, and Seaborn is trying to use parts that differ. | ||
|
||
With the definition of the standard API, Seaborn developers should be able to | ||
expect a generic data frame. And any library implementing the standard data frame | ||
API could be plotted with the previous example (Vaex, cuDF, Ibis, Dask, Modin, etc.). | ||
|
||
|
||
### Change object from one implementation to another | ||
|
||
Another considered use case is transforming the data from one implementation to another. | ||
|
||
As an example, consider we are using Dask data frames, given that our data is too big to | ||
fit in memory, and we are working over a cluster. At some point in our pipeline, we | ||
reduced the size of the data frame we are working on, by filtering and grouping. And | ||
we are interested in transforming the data frame from Dask to pandas, to use some | ||
functionalities that pandas implements but Dask does not. | ||
|
||
Since Dask knows how the data in the data frame is represented, one option could be to | ||
implement a `.to_pandas()` method in the Dask data frame. Another option could be to | ||
implement this in pandas, in a `.from_dask()` method. | ||
|
||
As the ecosystem grows, this solution implies that every implementation could end up | ||
having a long list of functions or methods: | ||
|
||
- `to_pandas()` / `from_pandas()` | ||
- `to_vaex()` / `from_vaex()` | ||
- `to_modin()` / `from_modin()` | ||
- `to_dask()` / `from_dask()` | ||
- ... | ||
|
||
With a standard Python data frame API, every library could simply implement a method to | ||
import a standard data frame. And since data frame libraries are expected to implement | ||
this API, that would be enough to transform any data frame to one implementation. | ||
|
||
So, the list above would be reduced to a single function or method in each implementation: | ||
|
||
- `from_dataframe()` | ||
|
||
Note that the function `from_dataframe()` is for illustration, and not proposed as part | ||
of the standard at this point. | ||
Comment on lines
+139
to
+144
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A dataframe protocol similar to wesm/dataframe-protocol#1 is a prerequisite to this being possible in my mind. Without having a data exchange protocol defined as part of the spec / goal how can we define There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At this point the data exchange protocol is what we're trying to define. This use case tries to illustrate why such a data exchange protocol is needed. Do you think I should clarify this is the goal for the use cases? Or am I not understanding you? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think that it was made clear that a dataframe data exchange protocol was in scope in this document. The only mention of a protocol is in talking about Apache Arrow as far as I can tell. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's good as it is. We are talking about use cases in this document, not the implementation right? So we can loosely define what There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
||
Every pair of data frame libraries could benefit from this conversion. But we can go | ||
deeper with an actual example. The conversion from an xarray `DataArray` to a pandas | ||
`DataFrame`, and the other way round. | ||
|
||
Even if xarray is not a data frame library, but a miltidimensional labeled structure, | ||
datapythonista marked this conversation as resolved.
Show resolved
Hide resolved
|
||
in cases where a 2-D is used, the data can be converted from and to a data frame. | ||
|
||
Currently, xarray implements a `.to_pandas()` method to convert a `DataArray` to a | ||
pandas `DataFrame`: | ||
|
||
```python | ||
import xarray | ||
|
||
xarray_data = xarray.DataArray([[15, 2], [32, 5], [28, 3]], | ||
dims=('diners', 'features'), | ||
coords={'features': ['bill', 'tip']}) | ||
|
||
pandas_df = xarray_data.to_pandas() | ||
``` | ||
|
||
To convert the pandas data frame to an xarray `Data Array`, both libraries have | ||
implementations. Both lines below are equivalent: | ||
|
||
```python | ||
pandas_df.to_xarray() | ||
xarray.DataArray(pandas_df) | ||
``` | ||
|
||
Other data frame implementations may or may not implement a way to convert to xarray. | ||
And passing a data frame to the `DataArray` constructor may or may not work. | ||
|
||
The standard data frame API would allow pandas, xarray and other libraries to | ||
implement the standard API. They could convert other representations via a single | ||
`to_dataframe()` function or method. And they could be converted to other | ||
kkraus14 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
representations that implement that function automatically. | ||
|
||
This would make conversions very simple, not only among data frame libraries, but | ||
also among other libraries which data can be expressed as tabular data, such as | ||
xarray, SQLAlchemy and others. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer
dataframe
as one word, likedatabase
.I do not want to start a holy war, and I realize there are historical reasons to call it
data frame
, butdata base
was common even throughout the 90s. https://groups.google.com/g/alt.usage.english/c/jRB0g0zK85Q?pli=1