Skip to content

Add a summary document for the dataframe interchange protocol #30

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 34 additions & 21 deletions protocol/dataframe_protocol_summary.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,12 @@ requirements/principles and functionality it needs to support._

## Purpose of `__dataframe__`

The purpose of `__dataframe__` is to be a _data interchange_ protocol. I.e., a way to convert one type of dataframe into another type (for example, convert a Koalas dataframe into a Pandas dataframe, or a cuDF dataframe into a Vaex dataframe).
The purpose of `__dataframe__` is to be a _data interchange_ protocol. I.e.,
a way to convert one type of dataframe into another type (for example,
convert a Koalas dataframe into a Pandas dataframe, or a cuDF dataframe into
a Vaex dataframe).

Currently (Sep'20) there is no way to do this in an implementation-independent way.
Currently (Nov'20) there is no way to do this in an implementation-independent way.

The main use case this protocol intends to enable is to make it possible to
write code that can accept any type of dataframe instead of being tied to a
Expand All @@ -19,7 +22,9 @@ def somefunc(df, ...):
"""`df` can be any dataframe supporting the protocol, rather than (say)
only a pandas.DataFrame"""
# could also be `cudf.from_dataframe(df)`, or `vaex.from_dataframe(df)`
df = pd.from_dataframe(df)
# note: this should throw a TypeError if it cannot be done without a device
# transfer (e.g. move data from GPU to CPU) - add `force=True` in that case
new_pandas_df = pd.from_dataframe(df)
# From now on, use Pandas dataframe internally
```

Expand Down Expand Up @@ -85,28 +90,35 @@ this is a consequence, and that that should be acceptable to them.

## Protocol design requirements

1. Must be a standard API that is unambiguously specified, and not rely on
implementation details of any particular dataframe library.
1. Must be a standard Python-level API that is unambiguously specified, and
not rely on implementation details of any particular dataframe library.
2. Must treat dataframes as a collection of columns (which are 1-D arrays
with a dtype and missing data support).
3. Must include device support
4. Must avoid device transfers by default (e.g. copy data from GPU to CPU),
_Note: this related to the API for `__dataframe__`, and does not imply
that the underlying implementation must use columnar storage!_
3. Must allow the consumer to select a specific set of columns for conversion.
4. Must allow the consumer to access the following "metadata" of the dataframe:
number of rows, number of columns, column names, column data types.
TBD: column data types wasn't clearly decided on, nor is it present in https://github.com/wesm/dataframe-protocol
5. Must include device support
6. Must avoid device transfers by default (e.g. copy data from GPU to CPU),
and provide an explicit way to force such transfers (e.g. a `force=` or
`copy=` keyword that the caller can set to `True`).
5. Must be zero-copy if possible.
6. Must be able to support "virtual columns" (e.g., a library like Vaex which
7. Must be zero-copy if possible.
8. Must be able to support "virtual columns" (e.g., a library like Vaex which
may not have data in memory because it uses lazy evaluation).
7. Must support missing values (`NA`) for all supported dtypes.
8. Must supports string and categorical dtypes
(_TBD: not discussed a lot, is this a hard requirement?_)
9. Must support missing values (`NA`) for all supported dtypes.
10. Must supports string and categorical dtypes

We'll also list some things that were discussed but are not requirements:

1. Object dtype does not need to be supported (_TBD: this is what Joris said,
but doesn't Pandas use object dtype to represent strings?_).
2. Heterogeneous/structured dtypes within a single column does not need to be
1. Object dtype does not need to be supported
2. Nested/structured dtypes within a single column does not need to be
supported.
_Rationale: not used a lot, additional design complexity not justified._
_Rationale: not used a lot, additional design complexity not justified.
May be added in the future (does have support in the Arrow C Data Interface)._
3. Extension dtypes do not need to be supported.
_Rationale: same as (2)_


## Frequently asked questions
Expand All @@ -116,11 +128,12 @@ We'll also list some things that were discussed but are not requirements:
What we are aiming for is quite similar to the Arrow C Data Interface (see
the [rationale for the Arrow C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html#rationale)),
except `__dataframe__` is a Python-level rather than C-level interface.
_TODO: one key thing is Arrow C Data interface relies on providing a deletion
/ finalization method similar to DLPack. The desired semantics here need to
be ironed out. See Arrow docs on [release callback semantics](https://arrow.apache.org/docs/format/CDataInterface.html#release-callback-semantics-for-consumers)_

The limitations seem to be:
The main (only?) limitation seems to be:
- No device support (@kkraus14 will bring this up on the Arrow dev mailing list)
- Specific to columnar data (_at least, this is what its docs say_).
TODO: are there any concerns for, e.g., Koalas or Ibis.

Note that categoricals are supported, Arrow uses the phrasing
"dictionary-encoded types" for categorical.
Expand All @@ -142,8 +155,8 @@ cuDF or Vaex).
It is _not_ analogous to `__array__`, which is NumPy-specific. `__array__` is a
method attached to array/tensor-like objects, and calling it is requesting
the object it's attached to to turn itself into a NumPy array. Hence, the
library that implements `__array__` must depend on NumPy, and call a NumPy
`ndarray` constructor itself from within `__array__`.
library that implements `__array__` must depend (optionally at least) on
NumPy, and call a NumPy `ndarray` constructor itself from within `__array__`.


### What is wrong with `.to_numpy?` and `.to_arrow()`?
Expand Down