You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: protocol/dataframe_protocol_summary.md
+31-6Lines changed: 31 additions & 6 deletions
Original file line number
Diff line number
Diff line change
@@ -106,7 +106,7 @@ this is a consequence, and that that should be acceptable to them.
106
106
`copy=` keyword that the caller can set to `True`).
107
107
7. Must be zero-copy if possible.
108
108
8. Must support missing values (`NA`) for all supported dtypes.
109
-
9. Must supports stringand categorical dtypes.
109
+
9. Must supports string, categorical and datetime dtypes.
110
110
10. Must allow the consumer to inspect the representation for missing values
111
111
that the producer uses for each column or data type.
112
112
_Rationale: this enables the consumer to control how conversion happens,
@@ -145,6 +145,24 @@ We'll also list some things that were discussed but are not requirements:
145
145
"programming to an interface"; this data interchange protocol is
146
146
fundamentally built around describing data in memory_.
147
147
148
+
### To be decided
149
+
150
+
_The connection between dataframe and array interchange protocols_. If we
151
+
treat a dataframe as a set of 1-D arrays, it may be expected that there is a
152
+
connection to be made with the array data interchange method. The array
153
+
interchange is based on DLPack; its major limitation from the point of view
154
+
of dataframes is the lack of support of all required data types (string,
155
+
categorical, datetime) and missing data. A requirement could be added that
156
+
`__dlpack__` should be supported in case the data types in a column are
157
+
supported by DLPack. Missing data via a boolean mask as a separate array
158
+
could also be supported.
159
+
160
+
_Should there be a standard `from_dataframe` constructor function?_ This
161
+
isn't completely necessary, however it's expected that a full dataframe API
162
+
standard will have such a function. The array API standard also has such a
163
+
function, namely `from_dlpack`. Adding at least a recommendation on syntax
164
+
for this function would make sense, e.g., `from_dataframe(df, stream=None)`.
165
+
148
166
149
167
## Frequently asked questions
150
168
@@ -153,12 +171,13 @@ We'll also list some things that were discussed but are not requirements:
153
171
What we are aiming for is quite similar to the Arrow C Data Interface (see
154
172
the [rationale for the Arrow C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html#rationale)),
155
173
except `__dataframe__` is a Python-level rather than C-level interface.
156
-
_TODO: one key thing is Arrow C Data interface relies on providing a deletion
157
-
/ finalization method similar to DLPack. The desired semantics here need to
158
-
be ironed out. See Arrow docs on [release callback semantics](https://arrow.apache.org/docs/format/CDataInterface.html#release-callback-semantics-for-consumers)_
174
+
The data types format specification of that interface is something that could be used unchanged.
159
175
160
-
The main (only?) limitation seems to be:
161
-
- No device support (@kkraus14 will bring this up on the Arrow dev mailing list)
176
+
The main (only?) limitation seems to be that it does not have device support
177
+
-@kkraus14 will bring this up on the Arrow dev mailing list. Also note that
178
+
that interface only talks about arrays; dataframes, chunking and the metadata
179
+
inspection can all be layered on top in this Python-level protocol, but are
180
+
not discussed in the interface itself.
162
181
163
182
Note that categoricals are supported, Arrow uses the phrasing
164
183
"dictionary-encoded types" for categorical.
@@ -168,6 +187,12 @@ buffer protocol](https://docs.python.org/3/c-api/buffer.html), which is also
168
187
a C-only and CPU-only interface. See `__array_interface__` below for a
169
188
Python-level equivalent of the buffer protocol.
170
189
190
+
Note that specifying the precise semantics for implementers (both producing
191
+
and consuming libraries) will be important. The Arrow C Data interface relies
192
+
on providing a deletion / finalization method similar to DLPack. The desired
193
+
semantics here need to be ironed out. See Arrow docs on
0 commit comments