Skip to content

Commit c7575c1

Browse files
committed
Add TBD notes on dataframe-array connection and from_dataframe
Also add more details on the Arrow C Data Interface.
1 parent 183851d commit c7575c1

File tree

1 file changed

+31
-6
lines changed

1 file changed

+31
-6
lines changed

protocol/dataframe_protocol_summary.md

Lines changed: 31 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ this is a consequence, and that that should be acceptable to them.
106106
`copy=` keyword that the caller can set to `True`).
107107
7. Must be zero-copy if possible.
108108
8. Must support missing values (`NA`) for all supported dtypes.
109-
9. Must supports string and categorical dtypes.
109+
9. Must supports string, categorical and datetime dtypes.
110110
10. Must allow the consumer to inspect the representation for missing values
111111
that the producer uses for each column or data type.
112112
_Rationale: this enables the consumer to control how conversion happens,
@@ -145,6 +145,24 @@ We'll also list some things that were discussed but are not requirements:
145145
"programming to an interface"; this data interchange protocol is
146146
fundamentally built around describing data in memory_.
147147

148+
### To be decided
149+
150+
_The connection between dataframe and array interchange protocols_. If we
151+
treat a dataframe as a set of 1-D arrays, it may be expected that there is a
152+
connection to be made with the array data interchange method. The array
153+
interchange is based on DLPack; its major limitation from the point of view
154+
of dataframes is the lack of support of all required data types (string,
155+
categorical, datetime) and missing data. A requirement could be added that
156+
`__dlpack__` should be supported in case the data types in a column are
157+
supported by DLPack. Missing data via a boolean mask as a separate array
158+
could also be supported.
159+
160+
_Should there be a standard `from_dataframe` constructor function?_ This
161+
isn't completely necessary, however it's expected that a full dataframe API
162+
standard will have such a function. The array API standard also has such a
163+
function, namely `from_dlpack`. Adding at least a recommendation on syntax
164+
for this function would make sense, e.g., `from_dataframe(df, stream=None)`.
165+
148166

149167
## Frequently asked questions
150168

@@ -153,12 +171,13 @@ We'll also list some things that were discussed but are not requirements:
153171
What we are aiming for is quite similar to the Arrow C Data Interface (see
154172
the [rationale for the Arrow C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html#rationale)),
155173
except `__dataframe__` is a Python-level rather than C-level interface.
156-
_TODO: one key thing is Arrow C Data interface relies on providing a deletion
157-
/ finalization method similar to DLPack. The desired semantics here need to
158-
be ironed out. See Arrow docs on [release callback semantics](https://arrow.apache.org/docs/format/CDataInterface.html#release-callback-semantics-for-consumers)_
174+
The data types format specification of that interface is something that could be used unchanged.
159175

160-
The main (only?) limitation seems to be:
161-
- No device support (@kkraus14 will bring this up on the Arrow dev mailing list)
176+
The main (only?) limitation seems to be that it does not have device support
177+
- @kkraus14 will bring this up on the Arrow dev mailing list. Also note that
178+
that interface only talks about arrays; dataframes, chunking and the metadata
179+
inspection can all be layered on top in this Python-level protocol, but are
180+
not discussed in the interface itself.
162181

163182
Note that categoricals are supported, Arrow uses the phrasing
164183
"dictionary-encoded types" for categorical.
@@ -168,6 +187,12 @@ buffer protocol](https://docs.python.org/3/c-api/buffer.html), which is also
168187
a C-only and CPU-only interface. See `__array_interface__` below for a
169188
Python-level equivalent of the buffer protocol.
170189

190+
Note that specifying the precise semantics for implementers (both producing
191+
and consuming libraries) will be important. The Arrow C Data interface relies
192+
on providing a deletion / finalization method similar to DLPack. The desired
193+
semantics here need to be ironed out. See Arrow docs on
194+
[release callback semantics](https://arrow.apache.org/docs/format/CDataInterface.html#release-callback-semantics-for-consumers)_
195+
171196

172197
### Is `__dataframe__` analogous to `__array__` or `__array_interface__`?
173198

0 commit comments

Comments
 (0)