Skip to content

Add a prototype of the dataframe interchange protocol #38

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jun 25, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 65 additions & 17 deletions protocol/dataframe_protocol.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,32 @@

For guiding requirements, see https://github.com/data-apis/dataframe-api/pull/35


Concepts in this design
-----------------------

1. A `Buffer` class. A *buffer* is a contiguous block of memory - this is the
only thing that actually maps to a 1-D array in a sense that it could be
converted to NumPy, CuPy, et al.
2. A `Column` class. A *column* has a name and a single dtype. It can consist
of multiple *chunks*. A single chunk of a column (which may be the whole
column if ``num_chunks == 1``) is modeled as again a `Column` instance, and
contains 1 data *buffer* and (optionally) one *mask* for missing data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saying a Column contains "1 data buffer" is a bit ambiguous. For a Column of strings is the characters buffer or the offsets buffer the data buffer?

Copy link
Member Author

@rgommers rgommers Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't update this bit yet, sorry - we decided already we needed Arrow-like children to properly support strings and categoricals.

3. A `DataFrame` class. A *data frame* is an ordered collection of *columns*.
It has a single device, and all its rows are the same length. It can consist
of multiple *chunks*. A single chunk of a data frame is modeled as
again a `DataFrame` instance.
4. A *mask* concept. A *mask* of a single-chunk column is a *buffer*.
5. A *chunk* concept. A *chunk* is a sub-dividing element that can be applied
to a *data frame* or a *column*.

Note that the only way to access these objects is through a call to
``__dataframe__`` on a data frame object. This is NOT meant as public API;
only think of instances of the different classes here to describe the API of
what is returned by a call to ``__dataframe__``. They are the concepts needed
to capture the memory layout and data access of a data frame.


Design decisions
----------------

Expand All @@ -31,12 +57,27 @@
(see cuDF experience, forced to add because pandas has them).
Requiring row names seems worse than leaving them out.

Note that row labels could be added in the future - right now there's no clear
requirements for more complex row labels that cannot be represented by a single
column. That do exist, for example Modin has has table and tree-based row
labels.

"""


class Buffer:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that didn't occur to me in the discussion we just had: if we get rid of this Buffer class and change it to a plain dict, then we cannot attach __dlpack__ to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could still be attached to the Column chunk, in which case it would only work if the column is backed by a single buffer. But I suppose that is limiting the use of dlpack too much? (because that would mean there is no easy way to get a separate buffer as a (numpy) array)

"""
Data in the buffer is guaranteed to be contiguous in memory.

Note that there is no dtype attribute present, a buffer can be thought of
as simply a block of memory. However, if the column that the buffer is
attached to has a dtype that's supported by DLPack and ``__dlpack__`` is
implemented, then that dtype information will be contained in the return
value from ``__dlpack__``.

This distinction is useful to support both data exchange via DLPack on a
buffer and (b) dtypes like variable-length strings which do not have a
fixed number of bytes per element.
"""

@property
Expand Down Expand Up @@ -67,6 +108,25 @@ def __dlpack__(self):
"""
raise NotImplementedError("__dlpack__")

def __dlpack_device__(self) -> Tuple[enum.IntEnum, int]:
"""
Device type and device ID for where the data in the buffer resides.

Uses device type codes matching DLPack. Enum members are::

- CPU = 1
- CUDA = 2
- CPU_PINNED = 3
- OPENCL = 4
- VULKAN = 7
- METAL = 8
- VPI = 9
- ROCM = 10

Note: must be implemented even if ``__dlpack__`` is not.
"""
pass


class Column:
"""
Expand Down Expand Up @@ -279,6 +339,11 @@ class DataFrame:
def __dataframe__(self, nan_as_null : bool = False) -> dict:
"""
Produces a dictionary object following the dataframe protocol spec

``nan_as_null`` is a keyword intended for the consumer to tell the
producer to overwrite null values in the data with ``NaN`` (or ``NaT``).
It is intended for cases where the consumer does not support the bit
mask or byte mask that is the producer's native representation.
"""
self._nan_as_null = nan_as_null
return {
Expand Down Expand Up @@ -354,20 +419,3 @@ def get_chunks(self, n_chunks : Optional[int] = None) -> Iterable[DataFrame]:
"""
pass

@property
def device(self) -> int:
"""
Device type the dataframe resides on.

Uses device type codes matching DLPack:

- 1 : CPU
- 2 : CUDA
- 3 : CPU pinned
- 4 : OpenCL
- 7 : Vulkan
- 8 : Metal
- 9 : Verilog
- 10 : ROCm
"""
pass
Loading