Skip to content

Commit bd2975a

Browse files
committed
Add a prototype of the dataframe interchange protocol
Related to requirements in gh-35. TBD (to be discussed) comments and design decisions at the top of the file indicate topics for closer review/discussion.
1 parent 6af8c2a commit bd2975a

File tree

1 file changed

+382
-0
lines changed

1 file changed

+382
-0
lines changed

protocol/dataframe_protocol.py

Lines changed: 382 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,382 @@
1+
"""
2+
Specification for objects to be accessed, for the purpose of dataframe
3+
interchange between libraries, via the ``__dataframe__`` method on a libraries'
4+
data frame object.
5+
6+
For guiding requirements, see https://github.com/data-apis/dataframe-api/pull/35
7+
8+
Design decisions
9+
----------------
10+
11+
**1. Use a separate column abstraction in addition to a dataframe interface.**
12+
13+
Rationales:
14+
- This is how it works in R, Julia and Apache Arrow.
15+
- Semantically most existing applications and users treat a column similar to a 1-D array
16+
- We should be able to connect a column to the array data interchange mechanism(s)
17+
18+
Note that this not apply a library must have such a public user-facing
19+
abstraction (ex. ``pandas.Series``) - it can only be accessed via ``__dataframe__``.
20+
21+
**2. Use methods and properties on an opaque object rather than returning
22+
hierarchical dictionaries describing memory**
23+
24+
This is better for implementation that may rely on, for example, lazy computation.
25+
Other small detail: plain attributes cannot be checked for (e.g. with
26+
`hasattr`) without side effects.
27+
28+
**3. No row names. If a library uses row names, use a regular column for them.**
29+
30+
See discussion at https://github.com/wesm/dataframe-protocol/pull/1/files#r394316241
31+
Optional row names are not a good idea, because people will assume they're present
32+
(see cuDF experience, forced to add because pandas has them).
33+
Requiring row names seems worse than leaving them out.
34+
35+
"""
36+
37+
38+
class Buffer:
39+
"""
40+
Data in the buffer is guaranteed to be contiguous in memory.
41+
"""
42+
43+
@property
44+
def bufsize(self) -> int:
45+
"""
46+
Buffer size in bytes
47+
"""
48+
pass
49+
50+
@property
51+
def ptr(self) -> int:
52+
"""
53+
Pointer to start of the buffer as an integer
54+
"""
55+
pass
56+
57+
def __dlpack__(self):
58+
"""
59+
Produce DLPack capsule (see array API standard).
60+
61+
Raises:
62+
63+
- TypeError : if the buffer contains unsupported dtypes.
64+
- NotImplementedError : if DLPack support is not implemented
65+
66+
Useful to have to connect to array libraries. Support optional because
67+
it's not completely trivial to implement for a Python-only library.
68+
"""
69+
raise NotImplementedError("__dlpack__")
70+
71+
def __array_interface__(self):
72+
"""
73+
TBD: implement or not? Will work for all dtypes except bit masks.
74+
"""
75+
raise NotImplementedError("__array_interface__")
76+
77+
78+
class Column:
79+
"""
80+
A column object, with only the methods and properties required by the
81+
interchange protocol defined.
82+
83+
A column can contain one or more chunks. Each chunk can contain either one
84+
or two buffers - one data buffer and (depending on null representation) it
85+
may have a mask buffer.
86+
87+
TBD: Arrow has a separate "null" dtype, and has no separate mask concept.
88+
Instead, it seems to use "children" for both columns with a bit mask,
89+
and for nested dtypes. Unclear whether this is elegant or confusing.
90+
This design requires checking the null representation explicitly.
91+
92+
The Arrow design requires checking:
93+
1. the ARROW_FLAG_NULLABLE (for sentinel values)
94+
2. if a column has two children, combined with one of those children
95+
having a null dtype.
96+
97+
Making the mask concept explicit seems useful. One null dtype would
98+
not be enough to cover both bit and byte masks, so that would mean
99+
even more checking if we did it the Arrow way.
100+
101+
TBD: there's also the "chunk" concept here, which is implicit in Arrow as
102+
multiple buffers per array (= column here). Semantically it may make
103+
sense to have both: chunks were meant for example for lazy evaluation
104+
of data which doesn't fit in memory, while multiple buffers per column
105+
could also come from doing a selection operation on a single
106+
contiguous buffer.
107+
108+
Given these concepts, one would expect chunks to be all of the same
109+
size (say a 10,000 row dataframe could have 10 chunks of 1,000 rows),
110+
while multiple buffers could have data-dependent lengths. Not an issue
111+
in pandas if one column is backed by a single NumPy array, but in
112+
Arrow it seems possible.
113+
Are multiple chunks *and* multiple buffers per column necessary for
114+
the purposes of this interchange protocol, or must producers either
115+
reuse the chunk concept for this or copy the data?
116+
117+
Note: this Column object can only be produced by ``__dataframe__``, so
118+
doesn't need its own version or ``__column__`` protocol.
119+
120+
"""
121+
@property
122+
def name(self) -> str:
123+
pass
124+
125+
@property
126+
def size(self) -> Optional[int]:
127+
"""
128+
Size of the column, in elements.
129+
130+
Corresponds to DataFrame.num_rows() if column is a single chunk;
131+
equal to size of this current chunk otherwise.
132+
"""
133+
pass
134+
135+
@property
136+
def offset(self) -> int:
137+
"""
138+
Offset of first element
139+
140+
May be > 0 if using chunks; for example for a column with N chunks of
141+
equal size M (only the last chunk may be shorter),
142+
``offset = n * M``, ``n = 0 .. N-1``.
143+
"""
144+
pass
145+
146+
@property
147+
def dtype(self) -> Tuple[int, int, str, str]:
148+
"""
149+
Dtype description as a tuple ``(kind, bit-width, format string, endianness)``
150+
151+
Kind :
152+
153+
- 0 : signed integer
154+
- 1 : unsigned integer
155+
- 2 : IEEE floating point
156+
- 20 : boolean
157+
- 21 : string (UTF-8)
158+
- 22 : datetime
159+
- 23 : categorical
160+
161+
Bit-width : the number of bits as an integer
162+
Format string : data type description format string in Apache Arrow C
163+
Data Interface format.
164+
Endianness : current only native endianness (``=``) is supported
165+
166+
Notes:
167+
168+
- Kind specifiers are aligned with DLPack where possible (hence the
169+
jump to 20, leave enough room for future extension)
170+
- Masks must be specified as boolean with either bit width 1 (for bit
171+
masks) or 8 (for byte masks).
172+
- Dtype width in bits was preferred over bytes
173+
- Endianness isn't too useful, but included now in case in the future
174+
we need to support non-native endianness
175+
- Went with Apache Arrow format strings over NumPy format strings
176+
because they're more complete from a dataframe perspective
177+
- Format strings are mostly useful for datetime specification, and
178+
for categoricals.
179+
- For categoricals, the format string describes the type of the
180+
categorical in the data buffer. In case of a separate encoding of
181+
the categorical (e.g. an integer to string mapping), this can
182+
be derived from ``self.describe_categorical``.
183+
- Data types not included: complex, Arrow-style null, binary, decimal,
184+
and nested (list, struct, map, union) dtypes.
185+
"""
186+
pass
187+
188+
@property
189+
def describe_categorical(self) -> dict[bool, bool, Optional[dict]]:
190+
"""
191+
If the dtype is categorical, there are two options:
192+
193+
- There are only values in the data buffer.
194+
- There is a separate dictionary-style encoding for categorical values.
195+
196+
Raises RuntimeError if the dtype is not categorical
197+
198+
Content of returned dict:
199+
200+
- "is_ordered" : bool, whether the ordering of dictionary indices is
201+
semantically meaningful.
202+
- "is_dictionary" : bool, whether a dictionary-style mapping of
203+
categorical values to other objects exists
204+
- "mapping" : dict, Python-level only (e.g. ``{int: str}``).
205+
None if not a dictionary-style categorical.
206+
207+
TBD: are there any other in-memory representations that are needed?
208+
"""
209+
pass
210+
211+
@property
212+
def describe_null(self) -> Tuple[int, Any]:
213+
"""
214+
Return the missing value (or "null") representation the column dtype
215+
uses, as a tuple ``(kind, value)``.
216+
217+
Kind:
218+
219+
- 0 : NaN/NaT
220+
- 1 : sentinel value
221+
- 2 : bit mask
222+
- 3 : byte mask
223+
224+
Value : if kind is "sentinel value", the actual value. None otherwise.
225+
"""
226+
pass
227+
228+
@property
229+
def null_count(self) -> Optional[int]:
230+
"""
231+
Number of null elements, if known.
232+
233+
Note: Arrow uses -1 to indicate "unknown", but None seems cleaner.
234+
"""
235+
pass
236+
237+
def num_chunks(self) -> int:
238+
"""
239+
Return the number of chunks the column consists of.
240+
"""
241+
pass
242+
243+
def get_chunks(self, n_chunks : Optional[int] = None) -> Iterable[Column]:
244+
"""
245+
Return an iterator yielding the chunks.
246+
247+
See `DataFrame.get_chunks` for details on ``n_chunks``.
248+
"""
249+
pass
250+
251+
def get_buffer(self) -> Buffer:
252+
"""
253+
Return the buffer containing the data.
254+
"""
255+
pass
256+
257+
def get_mask(self) -> Buffer:
258+
"""
259+
Return the buffer containing the mask values indicating missing data.
260+
261+
Raises RuntimeError if null representation is not a bit or byte mask.
262+
"""
263+
pass
264+
265+
# # NOTE: not needed unless one considers nested dtypes
266+
# def get_children(self) -> Iterable[Column]:
267+
# """
268+
# Children columns underneath the column, each object in this iterator
269+
# must adhere to the column specification
270+
# """
271+
# pass
272+
273+
274+
class DataFrame:
275+
"""
276+
A data frame class, with only the methods required by the interchange
277+
protocol defined.
278+
279+
A "data frame" represents an ordered collection of named columns.
280+
A column's "name" must be a unique string.
281+
Columns may be accessed by name or by position.
282+
283+
This could be a public data frame class, or an object with the methods and
284+
attributes defined on this DataFrame class could be returned from the
285+
``__dataframe__`` method of a public data frame class in a library adhering
286+
to the dataframe interchange protocol specification.
287+
"""
288+
def __dataframe__(self, nan_as_null : bool = False) -> dict:
289+
"""
290+
Produces a dictionary object following the dataframe protocol spec
291+
"""
292+
self._nan_as_null = nan_as_null
293+
return {
294+
"dataframe": self, # DataFrame object adhering to the protocol
295+
"version": 0 # Version number of the protocol
296+
}
297+
298+
def num_columns(self) -> int:
299+
"""
300+
Return the number of columns in the DataFrame
301+
"""
302+
pass
303+
304+
def num_rows(self) -> Optional[int]:
305+
# TODO: not happy with Optional, but need to flag it may be expensive
306+
# why include it if it may be None - what do we expect consumers
307+
# to do here?
308+
"""
309+
Return the number of rows in the DataFrame, if available
310+
"""
311+
pass
312+
313+
def num_chunks(self) -> int:
314+
"""
315+
Return the number of chunks the DataFrame consists of
316+
"""
317+
pass
318+
319+
def column_names(self) -> Iterable[str]:
320+
"""
321+
Return an iterator yielding the column names.
322+
"""
323+
pass
324+
325+
def get_column(self, i: int) -> Column:
326+
"""
327+
Return the column at the indicated position.
328+
"""
329+
pass
330+
331+
def get_column_by_name(self, name: str) -> Column:
332+
"""
333+
Return the column whose name is the indicated name.
334+
"""
335+
pass
336+
337+
def get_columns(self) -> Iterable[Column]:
338+
"""
339+
Return an iterator yielding the columns.
340+
"""
341+
pass
342+
343+
def select_columns(self, indices: Sequence[int]) -> DataFrame:
344+
"""
345+
Create a new DataFrame by selecting a subset of columns by index
346+
"""
347+
pass
348+
349+
def select_columns_by_name(self, names: Sequence[str]) -> DataFrame:
350+
"""
351+
Create a new DataFrame by selecting a subset of columns by name.
352+
"""
353+
pass
354+
355+
def get_chunks(self, n_chunks : Optional[int] = None) -> Iterable[DataFrame]:
356+
"""
357+
Return an iterator yielding the chunks.
358+
359+
By default (None), yields the chunks that the data is stored as by the
360+
producer. If given, ``n_chunks`` must be a multiple of
361+
``self.num_chunks()``, meaning the producer must subdivide each chunk
362+
before yielding it.
363+
"""
364+
pass
365+
366+
@property
367+
def device(self) -> int:
368+
"""
369+
Device type the dataframe resides on.
370+
371+
Uses device type codes matching DLPack:
372+
373+
- 1 : CPU
374+
- 2 : CUDA
375+
- 3 : CPU pinned
376+
- 4 : OpenCL
377+
- 7 : Vulkan
378+
- 8 : Metal
379+
- 9 : Verilog
380+
- 10 : ROCm
381+
"""
382+
pass

0 commit comments

Comments
 (0)