From a1ddad8773fa26c011e926d69bf73d2b867af6e8 Mon Sep 17 00:00:00 2001 From: Ralf Gommers Date: Tue, 15 Sep 2020 17:20:11 +0100 Subject: [PATCH 1/6] Add a summary document for the dataframe interchange protocol Summarizes the various discussions about and goals/non-goals and requirements for the `__dataframe__` data interchange protocol. The intended audience for this document is Consortium members and dataframe library maintainers who may want to support this protocol. @datapythonista will add a companion document that's a more gentle introduction/tutorial in a "from zero to a protocol" style. The aim is to keep updating this till we have captured all the requirements and answered all the FAQs, so we can actually design the protocol after and verify it meets all our requirements. Closes gh-29 --- protocol/dataframe_protocol_summary.md | 193 +++++++++++++++++++++++++ 1 file changed, 193 insertions(+) create mode 100644 protocol/dataframe_protocol_summary.md diff --git a/protocol/dataframe_protocol_summary.md b/protocol/dataframe_protocol_summary.md new file mode 100644 index 00000000..5132ef97 --- /dev/null +++ b/protocol/dataframe_protocol_summary.md @@ -0,0 +1,193 @@ +# `__dataframe__` protocol - summary + +_We've had a lot of discussion in a couple of GitHub issues and in meetings. +This description attempts to summarize that, and extract the essential design +requirements/principles and functionality it needs to support._ + +## Purpose of `__dataframe__` + +The purpose of `__dataframe__` is to be a _data interchange_ protocol. I.e., a way to convert one type of dataframe into another type (for example, convert a Koalas dataframe into a Pandas dataframe, or a cuDF dataframe into a Vaex dataframe). + +Currently (Sep'20) there is no way to do this in an implementation-independent way. + +The main use case this protocol intends to enable is to make it possible to +write code that can accept any type of dataframe instead of being tied to a +single type of dataframe. To illustrate that: + +```python +def somefunc(df, ...): + """`df` can be any dataframe supporting the protocol, rather than (say) + only a pandas.DataFrame""" + # could also be `cudf.from_dataframe(df)`, or `vaex.from_dataframe(df)` + df = pd.from_dataframe(df) + # From now on, use Pandas dataframe internally +``` + +### Non-goals + +Providing a _complete standardized dataframe API_ is not a goal of the +`__dataframe__` protocol. Instead, this is a goal of the full dataframe API +standard, which the Consortium for Python Data API Standards aims to provide +in the future. When that full API standard is implemented by dataframe +libraries, the example above can change to: + +```python +def get_df_module(df): + """Utility function to support programming against a dataframe API""" + if hasattr(df, '__dataframe_namespace__'): + # Retrieve the namespace + pdx = df.__dataframe_namespace__() + else: + # Here we can raise an exception if we only want to support compliant dataframes, + # or convert to our default choice of dataframe if we want to accept (e.g.) dicts + pdx = pd + df = pd.DataFrame(df) + + return pdx, df + + +def somefunc(df, ...): + """`df` can be any dataframe conforming to the dataframe API standard""" + pdx, df = get_df_module(df) + # From now on, use `df` methods and `pdx` functions/objects +``` + +### Constraints + +An important constraint on the `__dataframe__` protocol is that it should not +make achieving the goal of the complete standardized dataframe API more +difficult to achieve. + +There is a small concern here. Say that a library adopts `__dataframe__` first, +and it goes from supporting only Pandas to officially supporting other +dataframes like `modin.pandas.DataFrame`. At that point, changing to +supporting the full dataframe API standard as a next step _implies a +backwards compatibility break_ for users that now start relying on Modin +dataframe support. E.g., the second transition will change from returning a +Pandas dataframe from `somefunc(df_modin)` to returning a Modin dataframe +later. It must be made very clear to libraries accepting `__dataframe__` that +this is a consequence, and that that should be acceptable to them. + + +### Progression / timeline + +- **Current status**: most dataframe-consuming libraries work _only_ with + Pandas, and rely on many Pandas-specific functions, methods and behavior. +- **Status after `__dataframe__`**: with minor code changes (as in first + example above), libraries can start supporting all conforming dataframes, + convert them to Pandas dataframes, and still rely on the same + Pandas-specific functions, methods and behavior. +- **Status after standard dataframe API adoption**: libraries can start + supporting all conforming dataframes _without converting to Pandas or + relying on its implementation details_. At this point, it's possible to + "program to an interface" rather than to a specific library like Pandas. + + +## Protocol design requirements + +1. Must be a standard API that is unambiguously specified, and not rely on + implementation details of any particular dataframe library. +2. Must treat dataframes as a collection of columns (which are 1-D arrays + with a dtype and missing data support). +3. Must include device support +4. Must avoid device transfers by default (e.g. copy data from GPU to CPU), + and provide an explicit way to force such transfers (e.g. a `force=` or + `copy=` keyword that the caller can set to `True`). +5. Must be zero-copy if possible. +6. Must be able to support "virtual columns" (e.g., a library like Vaex which + may not have data in memory because it uses lazy evaluation). +7. Must support missing values (`NA`) for all supported dtypes. +8. Must supports string and categorical dtypes + (_TBD: not discussed a lot, is this a hard requirement?_) + +We'll also list some things that were discussed but are not requirements: + +1. Object dtype does not need to be supported (_TBD: this is what Joris said, + but doesn't Pandas use object dtype to represent strings?_). +2. Heterogeneous/structured dtypes within a single column does not need to be + supported. + _Rationale: not used a lot, additional design complexity not justified._ + + +## Frequently asked questions + +### Can the Arrow C Data Interface be used for this? + +What we are aiming for is quite similar to the Arrow C Data Interface (see +the [rationale for the Arrow C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html#rationale)), +except `__dataframe__` is a Python-level rather than C-level interface. + +The limitations seem to be: +- No device support (@kkraus14 will bring this up on the Arrow dev mailing list) +- Specific to columnar data (_at least, this is what its docs say_). + TODO: are there any concerns for, e.g., Koalas or Ibis. + +Note that categoricals are supported, Arrow uses the phrasing +"dictionary-encoded types" for categorical. + +The Arrow C Data Interface says specifically it was inspired by [Python's +buffer protocol](https://docs.python.org/3/c-api/buffer.html), which is also +a C-only and CPU-only interface. See `__array_interface__` below for a +Python-level equivalent of the buffer protocol. + + +### Is `__dataframe__` analogous to `__array__` or `__array_interface__`? + +Yes, it is fairly analogous to `__array_interface__`. There will be some +differences though, for example `__array_interface__` doesn't know about +devices, and it's a `dict` with a pointer to memory so there's an assumption +that the data lives in CPU memory (which may not be true, e.g. in the case of +cuDF or Vaex). + +It is _not_ analogous to `__array__`, which is NumPy-specific. `__array__` is a +method attached to array/tensor-like objects, and calling it is requesting +the object it's attached to to turn itself into a NumPy array. Hence, the +library that implements `__array__` must depend on NumPy, and call a NumPy +`ndarray` constructor itself from within `__array__`. + + +### What is wrong with `.to_numpy?` and `.to_arrow()`? + +Such methods ask the object it is attached to to turn itself into a NumPy or +Arrow array. Which means each library must have at least an optional +dependency on NumPy and on Arrow if it implements those methods. This leads +to unnecessary coupling between libraries, and hence is a suboptimal choice - +we'd like to avoid this if we can. + +Instead, it should be dataframe consumers that rely on NumPy or Arrow, since +they are the ones that need such a particular format. So, it can call the +constructor it needs. For example, `x = np.asarray(df['colname'])` (where +`df` supports `__dataframe__`). + + +### Does an interface describing memory work for virtual columns? + +Vaex is an example of a library that can have "virtual columns" (see @maartenbreddels +[comment here](https://github.com/data-apis/dataframe-api/issues/29#issuecomment-686373569)). +If the protocol includes a description of data layout in memory, does that +work for such a virtual column? + +Yes. Virtual columns need to be materialized in memory before they can be +turned into a column for a different type of dataframe - that will be true +for every discussed form of the protocol; whether there's a `to_arrow()` or +something else does not matter. Vaex can choose _how_ to materialize (e.g., +to an Arrow array, a NumPy array, or a raw memory buffer) - as long as the +returned description of memory layout is valid, all those options can later +be turned into the desired column format without a data copy, so the +implementation choice here really doesn't matter much. + +_Note: the above statement on materialization assumes that there are many +forms a virtual column can be implemented, and that those are all +custom/different and that at this point it makes little sense to standardize +that. For example, one could do this with a simple string DSL (`'col_C = +col_A + col_B'`, with a fancier C++-style lazy evaluation, with a +computational graph approach like Dask uses, etc.)._ + + +## Possible direction for implementation + +The `cuDFDataFrame`, `cuDFColumn` and `cuDFBuffer` sketched out by @kkraus14 +[here](https://github.com/data-apis/dataframe-api/issues/29#issuecomment-685123386) +seems to be in the right direction. + +TODO: work this out after making sure we're all on the same page regarding requirements. From 9bf13db8a4a4c1ab5e9cc2ac191754bca7a91795 Mon Sep 17 00:00:00 2001 From: Ralf Gommers Date: Thu, 17 Sep 2020 16:52:49 +0100 Subject: [PATCH 2/6] Process some review comments --- protocol/dataframe_protocol_summary.md | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/protocol/dataframe_protocol_summary.md b/protocol/dataframe_protocol_summary.md index 5132ef97..4f50bff7 100644 --- a/protocol/dataframe_protocol_summary.md +++ b/protocol/dataframe_protocol_summary.md @@ -19,7 +19,9 @@ def somefunc(df, ...): """`df` can be any dataframe supporting the protocol, rather than (say) only a pandas.DataFrame""" # could also be `cudf.from_dataframe(df)`, or `vaex.from_dataframe(df)` - df = pd.from_dataframe(df) + # note: this should throw a TypeError if it cannot be done without a device + # transfer (e.g. move data from GPU to CPU) - add `force=True` in that case + new_pandas_df = pd.from_dataframe(df) # From now on, use Pandas dataframe internally ``` @@ -98,12 +100,10 @@ this is a consequence, and that that should be acceptable to them. may not have data in memory because it uses lazy evaluation). 7. Must support missing values (`NA`) for all supported dtypes. 8. Must supports string and categorical dtypes - (_TBD: not discussed a lot, is this a hard requirement?_) We'll also list some things that were discussed but are not requirements: -1. Object dtype does not need to be supported (_TBD: this is what Joris said, - but doesn't Pandas use object dtype to represent strings?_). +1. Object dtype does not need to be supported 2. Heterogeneous/structured dtypes within a single column does not need to be supported. _Rationale: not used a lot, additional design complexity not justified._ @@ -117,10 +117,8 @@ What we are aiming for is quite similar to the Arrow C Data Interface (see the [rationale for the Arrow C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html#rationale)), except `__dataframe__` is a Python-level rather than C-level interface. -The limitations seem to be: +The main (only?) limitation seems to be: - No device support (@kkraus14 will bring this up on the Arrow dev mailing list) -- Specific to columnar data (_at least, this is what its docs say_). - TODO: are there any concerns for, e.g., Koalas or Ibis. Note that categoricals are supported, Arrow uses the phrasing "dictionary-encoded types" for categorical. From 220388d7b41344c16c41fef79e16c8da6885f5d2 Mon Sep 17 00:00:00 2001 From: Ralf Gommers Date: Thu, 5 Nov 2020 15:00:14 +0000 Subject: [PATCH 3/6] Process a few more review comments. --- protocol/dataframe_protocol_summary.md | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/protocol/dataframe_protocol_summary.md b/protocol/dataframe_protocol_summary.md index 4f50bff7..d1ce9d19 100644 --- a/protocol/dataframe_protocol_summary.md +++ b/protocol/dataframe_protocol_summary.md @@ -6,9 +6,12 @@ requirements/principles and functionality it needs to support._ ## Purpose of `__dataframe__` -The purpose of `__dataframe__` is to be a _data interchange_ protocol. I.e., a way to convert one type of dataframe into another type (for example, convert a Koalas dataframe into a Pandas dataframe, or a cuDF dataframe into a Vaex dataframe). +The purpose of `__dataframe__` is to be a _data interchange_ protocol. I.e., +a way to convert one type of dataframe into another type (for example, +convert a Koalas dataframe into a Pandas dataframe, or a cuDF dataframe into +a Vaex dataframe). -Currently (Sep'20) there is no way to do this in an implementation-independent way. +Currently (Nov'20) there is no way to do this in an implementation-independent way. The main use case this protocol intends to enable is to make it possible to write code that can accept any type of dataframe instead of being tied to a @@ -87,10 +90,12 @@ this is a consequence, and that that should be acceptable to them. ## Protocol design requirements -1. Must be a standard API that is unambiguously specified, and not rely on - implementation details of any particular dataframe library. +1. Must be a standard Python-level API that is unambiguously specified, and + not rely on implementation details of any particular dataframe library. 2. Must treat dataframes as a collection of columns (which are 1-D arrays with a dtype and missing data support). + _Note: this related to the API for `__dataframe__`, and does not imply + that the underlying implementation must use columnar storage!_ 3. Must include device support 4. Must avoid device transfers by default (e.g. copy data from GPU to CPU), and provide an explicit way to force such transfers (e.g. a `force=` or @@ -116,6 +121,9 @@ We'll also list some things that were discussed but are not requirements: What we are aiming for is quite similar to the Arrow C Data Interface (see the [rationale for the Arrow C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html#rationale)), except `__dataframe__` is a Python-level rather than C-level interface. +_TODO: one key thing is Arrow C Data interface relies on providing a deletion +/ finalization method similar to DLPack. The desired semantics here need to +be ironed out._ The main (only?) limitation seems to be: - No device support (@kkraus14 will bring this up on the Arrow dev mailing list) @@ -140,8 +148,8 @@ cuDF or Vaex). It is _not_ analogous to `__array__`, which is NumPy-specific. `__array__` is a method attached to array/tensor-like objects, and calling it is requesting the object it's attached to to turn itself into a NumPy array. Hence, the -library that implements `__array__` must depend on NumPy, and call a NumPy -`ndarray` constructor itself from within `__array__`. +library that implements `__array__` must depend (optionally at least) on +NumPy, and call a NumPy `ndarray` constructor itself from within `__array__`. ### What is wrong with `.to_numpy?` and `.to_arrow()`? From 645b26ce93c39b0317a08dd06a7e41ebd8824145 Mon Sep 17 00:00:00 2001 From: Ralf Gommers Date: Thu, 5 Nov 2020 15:17:54 +0000 Subject: [PATCH 4/6] Link to Release callback semantics in Arrow C Data Interface docs --- protocol/dataframe_protocol_summary.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/protocol/dataframe_protocol_summary.md b/protocol/dataframe_protocol_summary.md index d1ce9d19..b7310ae2 100644 --- a/protocol/dataframe_protocol_summary.md +++ b/protocol/dataframe_protocol_summary.md @@ -123,7 +123,7 @@ the [rationale for the Arrow C Data Interface](https://arrow.apache.org/docs/for except `__dataframe__` is a Python-level rather than C-level interface. _TODO: one key thing is Arrow C Data interface relies on providing a deletion / finalization method similar to DLPack. The desired semantics here need to -be ironed out._ +be ironed out. See Arrow docs on [release callback semantics](https://arrow.apache.org/docs/format/CDataInterface.html#release-callback-semantics-for-consumers)_ The main (only?) limitation seems to be: - No device support (@kkraus14 will bring this up on the Arrow dev mailing list) From 56103326cea9f4ec53af46cf6f5127d5b364f7ad Mon Sep 17 00:00:00 2001 From: Ralf Gommers Date: Thu, 5 Nov 2020 15:35:37 +0000 Subject: [PATCH 5/6] Add design requirements for column selection and df metadata --- protocol/dataframe_protocol_summary.md | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/protocol/dataframe_protocol_summary.md b/protocol/dataframe_protocol_summary.md index b7310ae2..925d49c3 100644 --- a/protocol/dataframe_protocol_summary.md +++ b/protocol/dataframe_protocol_summary.md @@ -96,15 +96,19 @@ this is a consequence, and that that should be acceptable to them. with a dtype and missing data support). _Note: this related to the API for `__dataframe__`, and does not imply that the underlying implementation must use columnar storage!_ -3. Must include device support -4. Must avoid device transfers by default (e.g. copy data from GPU to CPU), +3. Must allow the consumer to select a specific set of columns for conversion. +4. Must allow the consumer to access the following "metadata" of the dataframe: + number of rows, number of columns, column names, column data types. + TBD: column data types wasn't clearly decided on, nor is it present in https://github.com/wesm/dataframe-protocol +5. Must include device support +6. Must avoid device transfers by default (e.g. copy data from GPU to CPU), and provide an explicit way to force such transfers (e.g. a `force=` or `copy=` keyword that the caller can set to `True`). -5. Must be zero-copy if possible. -6. Must be able to support "virtual columns" (e.g., a library like Vaex which +7. Must be zero-copy if possible. +8. Must be able to support "virtual columns" (e.g., a library like Vaex which may not have data in memory because it uses lazy evaluation). -7. Must support missing values (`NA`) for all supported dtypes. -8. Must supports string and categorical dtypes +9. Must support missing values (`NA`) for all supported dtypes. +10. Must supports string and categorical dtypes We'll also list some things that were discussed but are not requirements: From 4a8e6dd1aa9bca64822bf8756cfb0911d455ab85 Mon Sep 17 00:00:00 2001 From: Ralf Gommers Date: Thu, 5 Nov 2020 15:38:06 +0000 Subject: [PATCH 6/6] Edit the nested/heterogeneous dtypes non-requirement --- protocol/dataframe_protocol_summary.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/protocol/dataframe_protocol_summary.md b/protocol/dataframe_protocol_summary.md index 925d49c3..708a8f17 100644 --- a/protocol/dataframe_protocol_summary.md +++ b/protocol/dataframe_protocol_summary.md @@ -113,9 +113,12 @@ this is a consequence, and that that should be acceptable to them. We'll also list some things that were discussed but are not requirements: 1. Object dtype does not need to be supported -2. Heterogeneous/structured dtypes within a single column does not need to be +2. Nested/structured dtypes within a single column does not need to be supported. - _Rationale: not used a lot, additional design complexity not justified._ + _Rationale: not used a lot, additional design complexity not justified. + May be added in the future (does have support in the Arrow C Data Interface)._ +3. Extension dtypes do not need to be supported. + _Rationale: same as (2)_ ## Frequently asked questions