data-apis · rgommers · Sep 15, 2020 · Sep 17, 2020 · Nov 5, 2020 · Nov 5, 2020
diff --git a/protocol/dataframe_protocol_summary.md b/protocol/dataframe_protocol_summary.md
@@ -6,9 +6,12 @@ requirements/principles and functionality it needs to support._
 
 ## Purpose of `__dataframe__`
 
-The purpose of `__dataframe__` is to be a _data interchange_ protocol. I.e., a way to convert one type of dataframe into another type (for example, convert a Koalas dataframe into a Pandas dataframe, or a cuDF dataframe into a Vaex dataframe). 
+The purpose of `__dataframe__` is to be a _data interchange_ protocol. I.e.,
+a way to convert one type of dataframe into another type (for example,
+convert a Koalas dataframe into a Pandas dataframe, or a cuDF dataframe into
+a Vaex dataframe).
 
-Currently (Sep'20) there is no way to do this in an implementation-independent way.
+Currently (Nov'20) there is no way to do this in an implementation-independent way.
 
 The main use case this protocol intends to enable is to make it possible to
 write code that can accept any type of dataframe instead of being tied to a
@@ -19,7 +22,9 @@ def somefunc(df, ...):
     """`df` can be any dataframe supporting the protocol, rather than (say)
     only a pandas.DataFrame"""
     # could also be `cudf.from_dataframe(df)`, or `vaex.from_dataframe(df)`
-    df = pd.from_dataframe(df)
+    # note: this should throw a TypeError if it cannot be done without a device
+    # transfer (e.g. move data from GPU to CPU) - add `force=True` in that case
+    new_pandas_df = pd.from_dataframe(df)
     # From now on, use Pandas dataframe internally
 ```
 
@@ -85,28 +90,35 @@ this is a consequence, and that that should be acceptable to them.
 
 ## Protocol design requirements
 
-1. Must be a standard API that is unambiguously specified, and not rely on
-   implementation details of any particular dataframe library.
+1. Must be a standard Python-level API that is unambiguously specified, and
+   not rely on implementation details of any particular dataframe library.
 2. Must treat dataframes as a collection of columns (which are 1-D arrays
    with a dtype and missing data support).
-3. Must include device support
-4. Must avoid device transfers by default (e.g. copy data from GPU to CPU),
+   _Note: this related to the API for `__dataframe__`, and does not imply
+   that the underlying implementation must use columnar storage!_
+3. Must allow the consumer to select a specific set of columns for conversion.
+4. Must allow the consumer to access the following "metadata" of the dataframe:
+   number of rows, number of columns, column names, column data types.
+   TBD: column data types wasn't clearly decided on, nor is it present in https://github.com/wesm/dataframe-protocol
+5. Must include device support
+6. Must avoid device transfers by default (e.g. copy data from GPU to CPU),
    and provide an explicit way to force such transfers (e.g. a `force=` or
    `copy=` keyword that the caller can set to `True`).
-5. Must be zero-copy if possible.
-6. Must be able to support "virtual columns" (e.g., a library like Vaex which
+7. Must be zero-copy if possible.
+8. Must be able to support "virtual columns" (e.g., a library like Vaex which
    may not have data in memory because it uses lazy evaluation).
-7. Must support missing values (`NA`) for all supported dtypes.
-8. Must supports string and categorical dtypes
-   (_TBD: not discussed a lot, is this a hard requirement?_)
+9. Must support missing values (`NA`) for all supported dtypes.
+10. Must supports string and categorical dtypes
 
 We'll also list some things that were discussed but are not requirements:
 
-1. Object dtype does not need to be supported (_TBD: this is what Joris said,
-   but doesn't Pandas use object dtype to represent strings?_).
-2. Heterogeneous/structured dtypes within a single column does not need to be
+1. Object dtype does not need to be supported
+2. Nested/structured dtypes within a single column does not need to be
    supported.
-   _Rationale: not used a lot, additional design complexity not justified._
+   _Rationale: not used a lot, additional design complexity not justified.
+   May be added in the future (does have support in the Arrow C Data Interface)._
+3. Extension dtypes do not need to be supported.
+   _Rationale: same as (2)_
 
 
 ## Frequently asked questions
@@ -116,11 +128,12 @@ We'll also list some things that were discussed but are not requirements:
 What we are aiming for is quite similar to the Arrow C Data Interface (see
 the [rationale for the Arrow C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html#rationale)),
 except `__dataframe__` is a Python-level rather than C-level interface.
+_TODO: one key thing is Arrow C Data interface relies on providing a deletion
+/ finalization method similar to DLPack. The desired semantics here need to
+be ironed out. See Arrow docs on [release callback semantics](https://arrow.apache.org/docs/format/CDataInterface.html#release-callback-semantics-for-consumers)_
 
-The limitations seem to be:
+The main (only?) limitation seems to be:
 - No device support (@kkraus14 will bring this up on the Arrow dev mailing list)
-- Specific to columnar data (_at least, this is what its docs say_).
-  TODO: are there any concerns for, e.g., Koalas or Ibis.
 
 Note that categoricals are supported, Arrow uses the phrasing
 "dictionary-encoded types" for categorical.
@@ -142,8 +155,8 @@ cuDF or Vaex).
 It is _not_ analogous to `__array__`, which is NumPy-specific. `__array__` is a
 method attached to array/tensor-like objects, and calling it is requesting
 the object it's attached to to turn itself into a NumPy array. Hence, the
-library that implements `__array__` must depend on NumPy, and call a NumPy
-`ndarray` constructor itself from within `__array__`.
+library that implements `__array__` must depend (optionally at least) on
+NumPy, and call a NumPy `ndarray` constructor itself from within `__array__`.
 
 
 ### What is wrong with `.to_numpy?` and `.to_arrow()`?