Skip to content

Add a prototype of the dataframe interchange protocol #38

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jun 25, 2021
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
373 changes: 373 additions & 0 deletions protocol/dataframe_protocol.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,373 @@
"""
Specification for objects to be accessed, for the purpose of dataframe
interchange between libraries, via the ``__dataframe__`` method on a libraries'
data frame object.

For guiding requirements, see https://github.com/data-apis/dataframe-api/pull/35

Design decisions
----------------

**1. Use a separate column abstraction in addition to a dataframe interface.**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is an interchange protocol I do not see value in having a column abstraction. If we want a way to treat a dataframe as an array we should make that explicit rather than through a column interface. The Modin implementation does not support a "column" abstraction, rather a dataframe with 1 column.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Devin, this is a good discussion to have. As the note 7 lines below says, this is not user-facing API. It seems a useful abstraction to in the end get down to a description of memory. There are multiple reasonable ways to model a dataframe though. What this design does is use 5 concepts in 3 classes:

  1. A Buffer class. A "buffer" is a contiguous block of memory - this is the only thing that actually maps to a 1-D array in a sense that it could be converted to NumPy, CuPy, et al.
  2. A Column class. A "column" has a name and a single dtype. It can consist of multiple "chunks". A single chunk of a column (which may be the whole column if num_chunks == 1) is modeled as again a Column instance, and contains 1 data buffer and (optionally) one "mask" for missing data.
  3. A DataFrame class. A "data frame" is an ordered collection of columns. It has a single device, and all its rows are the same length. It can consist of multiple "chunks". A single chunk of a data frame is modeled as again a DataFrame instance.
  4. A "mask" concept. A mask of a single-chunk column is a buffer.
  5. A "chunk" concept. A chunk is a sub-dividing element that can be applied to a data frame or a column.

These I think are all the actual abstractions we use when talking about data frames. It's fairly explicit and easy to understand. An alternative approach would be to do something like what Arrow did - its "buffer" is the same as in this design, but it doesn't use any of the other concepts - its "array" (which is a little like Column here and not at all like the "array" we use in the array API standard) and "children" are more ways of combining buffers than that they directly map to data frame concepts. Its approach seems to be to use the minimal number of concepts to describe memory layout of a data frame, and leave it up to its users to map that to their actual data frame concepts.

Taking into account our previous discussions of this topic as well as the early prototypes sketches out by @wesm and @kkraus14, I believe we did want something that looked more like a minimal dataframe API with explicitly named methods aimed at library developers. Given that, I think this is at least in the right direction. The main question regarding these concepts that I wasn't sure about is: should a column be allowed to have multiple buffers yes or no (in a single chunk)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the note 7 lines below says, this is not user-facing API

Yes, I understood this. A better wording of my statement would have been "I do not see value in having a column abstraction in an interchange protocol."

I guess I do not understand the rationales here and how that maps to the interchange protocol.

- This is how it works in R, Julia and Apache Arrow.
- Semantically most existing applications and users treat a column similar to a 1-D array
- We should be able to connect a column to the array data interchange mechanism(s)

Here you have given examples, and make a point about applications and users, along with the array API. This does not seem to be about the interchange protocol.

Maybe I misunderstood the purpose of this PR based on your comment. It seems you intend to extend this interchange protocol into the developer API we have been talking about. Is that correct?

I may have missed a week or two scrambling toward paper/talk deadlines 😄. Perhaps I would have pushed back (or at least forced a more directed discussion) on some things that were decided if I were there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have missed a week or two scrambling toward paper/talk deadlines 😄. Perhaps I would have pushed back (or at least forced a more directed discussion) on some things that were decided if I were there.

Now is a good time to push:) Nothing is set in stone.

I do not see value in having a column abstraction in an interchange protocol."

So what would you do with the methods that are now on Column but not on DataFrame? Attach them to DataFrame and make them raise an exception unless it's a single-column dataframe?

I guess I do not understand the rationales here and how that maps to the interchange protocol.

Good point. Those rationales were a summary of opinions on wesm/dataframe-protocol#1, I didn't put enough thought into whether they did or did not apply here.

Maybe I misunderstood the purpose of this PR based on your comment. It seems you intend to extend this interchange protocol into the developer API we have been talking about. Is that correct?

In our last meeting we discussed that. There's no such intention to extend currently I'd say, although it's a possibility. The mood was more "let's get this interchange protocol done, and then see based on experience and adoption what the right approach is for a more complete API".

Personally I think a developer API may have some similarities, but will not be a superset of this interchange protocol. For example, I'm not convinced it should have a column abstraction, and certainly not memory-oriented APIs like get_data_buffer or get_mask.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I am one of the guys who push for getting things out, I think we need to be careful when adding new abstractions. I am for a minimally viable protocol in the first version, and I second that adding a Column and even a Buffer seems to follow the path of the current affairs without critically challenging the need in the first version. This is especially dangerous as it has a tendency to become another version to please everyone, especially Pandas' accustomed user base. IMHO we need to keep the balance of usefulness and adoption.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can call it Array instead of Column, then we don't have a Column abstraction any more ;-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub does not have the range of emoji's needed to accurately reflect my initial reaction to that:)


Rationales:
- This is how it works in R, Julia and Apache Arrow.
- Semantically most existing applications and users treat a column similar to a 1-D array
- We should be able to connect a column to the array data interchange mechanism(s)

Note that this does not imply a library must have such a public user-facing
abstraction (ex. ``pandas.Series``) - it can only be accessed via ``__dataframe__``.

**2. Use methods and properties on an opaque object rather than returning
hierarchical dictionaries describing memory**

This is better for implementations that may rely on, for example, lazy
computation.

**3. No row names. If a library uses row names, use a regular column for them.**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with this. If two libraries support row labels it would generally be better for both libraries if there was a way to convert between row labels. In general, row labels are metadata and moving metadata into the data can be costly and require copying. I think this places too much of a constraint on dataframe implementations and assumes that moving metadata into data is cheap, which is certainly not the case in Modin.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion on this, this was my interpretation of the linked discussion. Happy to change, let's wait for a few more opinions though - @kkraus14, @maartenbreddels, @TomAugspurger and @datapythonista all seemed to lean towards "no row labels".

Maybe my adding "use a regular column for them" isn't correct though - perhaps should simply have been "is not supported".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly voiced my opinion on row names in that they should either be required and part of the spec or entirely not part of the spec. Making them optional just ends up with pushing everyone to implement them which is what we've seen with everyone following the Pandas API.

In the early days of cuDF we tried to take the stance that we won't implement row names (indexes), eventually enough people pushed back that for Pandas compatibility it's needed that we added a single column Index which then eventually succumbed to the same argument for eventually supporting MultiIndexes. Generally all of the arguments were always just for Pandas compatibility / minimizing changes to existing code as opposed to better implementations or better code paths in any way.

+1 to just saying "is not supported"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkraus14 I don't agree with this idea:

Generally all of the arguments were always just for Pandas compatibility / minimizing changes to existing code as opposed to better implementations or better code paths in any way.

All of the discussions against row labels ignore the users and questions about why they exist in the first place. Yes, it is easier to have an implementation without them, I agree. I would argue that it is not about "legacy compatibility" of changing existing code. It has become a part of the mental model of (millions of) users, and row names were a part of S and R dataframes long before pandas. I have never been a fan of the idea that users should change their mental models of a system to fit what the library authors found easy to implement or optimize. Maybe this is not the place to discuss the validity of row labels as a concept, because it likely boils down to deep seeded opinions on user centered designs vs performance centered designs. Sometimes those views are not compatible.

I do not think making row labels optional will force everyone to implement them. This is a developer API, so if something is optional, the developers using it will understand that and handle things appropriately. If this were an end user API things might be different.

I have a concern that narrowing this down into the least common denominator will result in something that is not particularly useful. I mentioned this in the live meetings but throwing specific exceptions or warnings should absolutely be a part of the spec and can help with this question about optional implementation details.

If there is no support for labels then the only two options are to (1) implicitly move the metadata into the data, which as I mentioned might be expensive, or (2) drop the row labels altogether, which might contain relevant data. Without row labels, there is no operator to explicitly move row labels into the data (reset_index in pandas), so I don't see any other way to handle it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the discussions against row labels ignore the users and questions about why they exist in the first place. Yes, it is easier to have an implementation without them, I agree. I would argue that it is not about "legacy compatibility" of changing existing code. It has become a part of the mental model of (millions of) users, and row names were a part of S and R dataframes long before pandas. I have never been a fan of the idea that users should change their mental models of a system to fit what the library authors found easy to implement or optimize. Maybe this is not the place to discuss the validity of row labels as a concept, because it likely boils down to deep seeded opinions on user centered designs vs performance centered designs. Sometimes those views are not compatible.

95+% of the use cases we saw were people not actually using indexes in any meaningful way. They were constantly using reset_index to get back to RangeIndex objects and then using __getitem__ calls to access row numbers instead of using iloc. Similarly, with MultiIndex we almost always saw just because an operation like groupby or join return a MultiIndex by default and instead of changing the code to use an as_index=False option or something similar, they'd just add a reset_index immediately afterwards.

I do not think making row labels optional will force everyone to implement them. This is a developer API, so if something is optional, the developers using it will understand that and handle things appropriately. If this were an end user API things might be different.

If row labels are optional then presumably Pandas will support it and developers will end up continuing to use it unnecessarily and inefficiently, which will then push other libraries to implement it unnecessarily from wanting to be able to integrate.

For what it's worth, on the cuDF side we now somewhat nicely handle the cases of indexing / multi-indexing, but I am pushing to avoid new DataFrame libraries from having to go through the same pain that we did unless they explicitly want developers to use indexing features.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, absent the argument is any discussion on the user's mental model.

I'm not sure that's the argument. My experience is the same as @kkraus14's with code I got from less experienced colleagues in the past: they just use the defaults (which can be as_index=True), then get to "this stuff is weird, how do I get rid of it", then hit upon .reset_index() via a StackOverflow post. There may be some advanced users who make good use of row labels, but I suspect it's a small-ish minority. For the rest, row labels do not seem to be part of their mental model.

Would be good to get some pandas maintainer opinions on this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I am a bit a biased pandas maintainer, as I would like to see pandas get rid of default row indices; "default" to be clear, still have the functionality if you explicitly ask for it)

I think regarding data exchange, the index can be passed through just like another column? In which case it might a question whether we want to make it possible to give an indication that a certain column(s) can be used as index (eg a describe_index() that returns the names of index columns, and someone not interested in that can just ignore that method)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisiting this, passing the row labels as a separate buffer (or a "column buffer") would work for me as long as the labels would not need to be moved into the data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, that's helpful @devin-petersohn. I think it could be combined with a metadata escape hatch perhaps, that would also be separate from the regular data, and that we discussed last time as being desirable to add.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update on this, since we had multiple discussions on this: in the end we settled on going with metadata (added in gh-43), and decided that libraries could store anything there (it's a Dict[str, Any] attribute) so indeed things like row labels or what the index column is can be stored there and it doesn't need to be moved into the set of regular columns. This seemed to work for everyone.


See discussion at https://github.com/wesm/dataframe-protocol/pull/1/files#r394316241
Optional row names are not a good idea, because people will assume they're present
(see cuDF experience, forced to add because pandas has them).
Requiring row names seems worse than leaving them out.

"""


class Buffer:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that didn't occur to me in the discussion we just had: if we get rid of this Buffer class and change it to a plain dict, then we cannot attach __dlpack__ to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could still be attached to the Column chunk, in which case it would only work if the column is backed by a single buffer. But I suppose that is limiting the use of dlpack too much? (because that would mean there is no easy way to get a separate buffer as a (numpy) array)

"""
Data in the buffer is guaranteed to be contiguous in memory.
"""

@property
def bufsize(self) -> int:
"""
Buffer size in bytes
"""
pass

@property
def ptr(self) -> int:
"""
Pointer to start of the buffer as an integer
"""
pass

def __dlpack__(self):
"""
Produce DLPack capsule (see array API standard).

Raises:

- TypeError : if the buffer contains unsupported dtypes.
- NotImplementedError : if DLPack support is not implemented

Useful to have to connect to array libraries. Support optional because
it's not completely trivial to implement for a Python-only library.
"""
raise NotImplementedError("__dlpack__")


class Column:
"""
A column object, with only the methods and properties required by the
interchange protocol defined.

A column can contain one or more chunks. Each chunk can contain either one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit confusing that we use the same object for a collection of chunks as the chunk itself.

Eg what does get_data_buffer return for a Column with multiple chunks?

(in Arrow there is the concept of Array and ChunkedArray (and this represents a column of a DataFrame), where the second is just a vector of same-typed arrays)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eg what does get_data_buffer return for a Column with multiple chunks?

That should raise.

I find it a bit confusing that we use the same object for a collection of chunks as the chunk itself.

It's possible to have a separate object like ChunkedColumn, but then for symmetry it should also have ChunkedDataFrame probably.

That'll be an interesting discussion to have today. @devin-petersohn preferred fewer concepts (no Column). I think the two extremes here are:

  • Only DataFrame and Buffer
  • All of DataFrame, ChunkedDataFrame, Column, ChunkedColumn, and Buffer.

In the end fewer separate classes just means more methods/properties on classes that will raise when they're called in a situation that's not appropriate.

Copy link
Member

@jorisvandenbossche jorisvandenbossche Feb 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DataFrame could always be chunked, so DataFrame -> ChunkedColumn (with potentially only 1 chunk), to avoid the additional class. But always having to first access the single chunk (if you don't work with multiple chunks) also has its disadvantages of course).

or two buffers - one data buffer and (depending on null representation) it
may have a mask buffer.

TBD: Arrow has a separate "null" dtype, and has no separate mask concept.
Instead, it seems to use "children" for both columns with a bit mask,
and for nested dtypes. Unclear whether this is elegant or confusing.
This design requires checking the null representation explicitly.

The Arrow design requires checking:
1. the ARROW_FLAG_NULLABLE (for sentinel values)
2. if a column has two children, combined with one of those children
having a null dtype.

Making the mask concept explicit seems useful. One null dtype would
not be enough to cover both bit and byte masks, so that would mean
even more checking if we did it the Arrow way.

TBD: there's also the "chunk" concept here, which is implicit in Arrow as
multiple buffers per array (= column here). Semantically it may make
sense to have both: chunks were meant for example for lazy evaluation
of data which doesn't fit in memory, while multiple buffers per column
could also come from doing a selection operation on a single
contiguous buffer.

Given these concepts, one would expect chunks to be all of the same
size (say a 10,000 row dataframe could have 10 chunks of 1,000 rows),
while multiple buffers could have data-dependent lengths. Not an issue
in pandas if one column is backed by a single NumPy array, but in
Arrow it seems possible.
Are multiple chunks *and* multiple buffers per column necessary for
the purposes of this interchange protocol, or must producers either
reuse the chunk concept for this or copy the data?

Note: this Column object can only be produced by ``__dataframe__``, so
doesn't need its own version or ``__column__`` protocol.

"""

@property
def size(self) -> Optional[int]:
"""
Size of the column, in elements.

Corresponds to DataFrame.num_rows() if column is a single chunk;
equal to size of this current chunk otherwise.
"""
pass

@property
def offset(self) -> int:
"""
Offset of first element

May be > 0 if using chunks; for example for a column with N chunks of
equal size M (only the last chunk may be shorter),
``offset = n * M``, ``n = 0 .. N-1``.
"""
pass

@property
def dtype(self) -> Tuple[int, int, str, str]:
"""
Dtype description as a tuple ``(kind, bit-width, format string, endianness)``

Kind :

- 0 : signed integer
- 1 : unsigned integer
- 2 : IEEE floating point
- 20 : boolean
- 21 : string (UTF-8)
- 22 : datetime
- 23 : categorical
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need both Kind and Format string. Seems like duplicate information except Kind is less flexible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format strings are quite annoying to parse, so for anything besides datetime I think library authors would have a preference for using kind + bit_width.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the answer be for us to expand Kind in order to sufficiently handle all of the cases that format string handles then? The way I read this spec is everyone will have to produce the format strings regardless as of now so they'll have to deal with the annoyance regardless.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, however producing format string is much easier than parsing - I think it's a minor annoyance.

I don't see how we'd cover all the datetime formats including time zones without a format string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use 16 kind integers for the 16 formats in https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings, then a separate string field for timezone?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about supporting things like Decimal or Fixed Size Lists in the future where we need to specify some semi arbitrary attributes alongside the dtype? If we do then we should make sure that whatever specification we decide on here can eventually support those gracefully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should at least keep the possibility open to support more data types like decimal and nested types (which probably means we need format strings?)


Bit-width : the number of bits as an integer
Format string : data type description format string in Apache Arrow C
Data Interface format.
Endianness : current only native endianness (``=``) is supported

Notes:

- Kind specifiers are aligned with DLPack where possible (hence the
jump to 20, leave enough room for future extension)
- Masks must be specified as boolean with either bit width 1 (for bit
masks) or 8 (for byte masks).
- Dtype width in bits was preferred over bytes
- Endianness isn't too useful, but included now in case in the future
we need to support non-native endianness
- Went with Apache Arrow format strings over NumPy format strings
because they're more complete from a dataframe perspective
- Format strings are mostly useful for datetime specification, and
for categoricals.
- For categoricals, the format string describes the type of the
categorical in the data buffer. In case of a separate encoding of
the categorical (e.g. an integer to string mapping), this can
be derived from ``self.describe_categorical``.
- Data types not included: complex, Arrow-style null, binary, decimal,
and nested (list, struct, map, union) dtypes.
"""
pass

@property
def describe_categorical(self) -> dict[bool, bool, Optional[dict]]:
"""
If the dtype is categorical, there are two options:

- There are only values in the data buffer.
- There is a separate dictionary-style encoding for categorical values.

Raises RuntimeError if the dtype is not categorical

Content of returned dict:

- "is_ordered" : bool, whether the ordering of dictionary indices is
semantically meaningful.
- "is_dictionary" : bool, whether a dictionary-style mapping of
categorical values to other objects exists
- "mapping" : dict, Python-level only (e.g. ``{int: str}``).
None if not a dictionary-style categorical.
Comment on lines +254 to +255
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the actual dictionary mapping? I.e. if I had a column of ['gold', 'bronze', 'silver', 'gold', 'silver', 'bronze'] with a dictionary of ['gold' < 'silver' < 'bronze'] would this mapping key be: {0: 'gold', 1: 'silver', 2: 'bronze'}?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a HUGE -1 on this. We'd want our dictionary to just be another column which is on the device for example.

Copy link
Member Author

@rgommers rgommers Feb 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you put a dictionary in a single column? It'd be a dataframe of two columns, index and value?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a categorical column of ['gold', 'bronze', 'silver', null, 'bronze', 'silver', 'gold'] with categories of ['gold' < 'silver' < 'bronze'] using Arrow formatting:

categorical column: {
    mask_buffer: [119], # 01110111 in binary
    data_buffer: [0, 2, 1, 127, 2, 1, 0], # the 127 value in here is undefined since it's null
    children: [
        string column: {
            mask_buffer: None,
            offsets_buffer: [0, 4, 10, 16],
            data_buffer: [103, 111, 108, 100, 115, 105, 108, 118, 101, 114, 98, 114, 111, 110, 122, 101]
        }
    ]
}

In the case of cuDF all of these buffers on the GPU and getting a host mapping of {0: 'gold', 1: 'silver', 2: 'bronze'} would be extremely expensive for any sufficiently large dictionary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note on the above example: children is actually named dictionary in ArrowSchema (a separate struct in the schema rather than being one of the child arrays; history for that decision a little unclear).


TBD: are there any other in-memory representations that are needed?
"""
pass

@property
def describe_null(self) -> Tuple[int, Any]:
"""
Return the missing value (or "null") representation the column dtype
uses, as a tuple ``(kind, value)``.

Kind:

- 0 : non-nullable
- 1 : NaN/NaT
- 2 : sentinel value
- 3 : bit mask
- 4 : byte mask

Value : if kind is "sentinel value", the actual value. None otherwise.
"""
pass

@property
def null_count(self) -> Optional[int]:
"""
Number of null elements, if known.

Note: Arrow uses -1 to indicate "unknown", but None seems cleaner.
"""
pass

def num_chunks(self) -> int:
"""
Return the number of chunks the column consists of.
"""
pass

def get_chunks(self, n_chunks : Optional[int] = None) -> Iterable[Column]:
"""
Return an iterator yielding the chunks.

See `DataFrame.get_chunks` for details on ``n_chunks``.
"""
pass

def get_data_buffer(self) -> Buffer:
"""
Return the buffer containing the data.
"""
pass

def get_mask(self) -> Buffer:
"""
Return the buffer containing the mask values indicating missing data.

Raises RuntimeError if null representation is not a bit or byte mask.
"""
pass

# # NOTE: not needed unless one considers nested dtypes
# def get_children(self) -> Iterable[Column]:
# """
# Children columns underneath the column, each object in this iterator
# must adhere to the column specification
# """
# pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we handling strings without this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing some context for this question. NumPy-like fixed length strings would be handled as a single buffer; for variable-length strings it's unclear to me how you would handle them even with get_children.

Copy link
Collaborator

@kkraus14 kkraus14 Feb 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For variable-length strings the way Arrow handles them is by having a Column of bytes (int8 / uint8 for example) and then a Column of integer offsets into the bytes. So for example, say you had a column of strings of: ['joe', null, 'bob', ''] then under the hood the column would look like:

string column: {
    data_buffer: None,
    mask_buffer: [13], # 00001101 in binary
    children: [
        characters int8 column: {
            data_buffer: [106, 111, 101, 98, 111, 98],
            mask_buffer: None
        },
        offsets int32 column: {
            data_buffer: [0, 3, 3, 6, 6],
            mask_buffer: None
        }
    ]
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also how we handle strings in cuDF and I suspect that generally people will want variable length strings as opposed to fixed length strings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For variable-length strings the way Arrow handles them is by having a Column of bytes (int8 / uint8 for example)

It has multiple ways of doing this? There's exactly zero specification of this in https://arrow.apache.org/docs/format/CDataInterface.html, and I can't find a separate description either. All I see is it uses format strings 'u' and 'U' for regular and large utf-8 strings.

Would you expect all libraries to handle multiple flavors of variable-length strings? So if the type specifiers say "utf-8 string", then there's some self-describing format for how that's laid out with multiple child buffers, each of which can have a different dtype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a specialization on Variable Binary https://arrow.apache.org/docs/format/Columnar.html#buffer-listing-for-each-layout.

:( It's like they want people to not understand them. The writing is so terse - there's zero connection to strings in the docs there.

So Arrow doesn't necessarily use child Columns in the case of Strings, but 3 buffers: validity, offsets, and data.

Thanks for the explanation. I understand now. Okay, that indeed requires a change from what I have in this PR currently. There's three options:

  1. add children columns
  2. add multiple buffers per column (actually we have data and mask, so just adding offsets would be enough).
  3. do both (1) and (2) - perhaps handy to future-proof it for nested dtypes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be in favor of option 3 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:( It's like they want people to not understand them. The writing is so terse - there's zero connection to strings in the docs there.

You are welcome to open an issue about that ;)
(but to be clear, you are totally right. I think the explanation of variable size binary is decent (https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout), but if you don't know that string is just variable size binary where the bytes represent UTF8, that doesn't help much ..)

Also in favor of option 3 (I would personally just follow Arrow's layout of multiple buffers + children arrays)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are welcome to open an issue about that ;)

I know:) Maybe more thinking about a whole docs section with examples and mapping dataframe concepts to Arrow concepts.

Also in favor of option 3 (I would personally just follow Arrow's layout of multiple buffers + children arrays)

A worry may be the amount of complexity that a consumer is expected to parse. Arbitrary many buffers and nested child arrays is hugely complex; right now there seems to be only the need for:

  • <=3 buffers (data, offsets, mask)
  • one child level for categoricals or strings.

So it might make sense to do that the same way as Arrow, but restrict it to that - everything else is either producer error or an unsupported dtype?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we take the 3 buffer approach for string columns we actually don't need a child level for strings as it would be represented directly by the 3 buffers.

If we don't take the 3 buffer approach for string columns and instead go with the children approach, then we'd need 2 child levels to handle a categorical column who's categories are a string column.



class DataFrame:
"""
A data frame class, with only the methods required by the interchange
protocol defined.

A "data frame" represents an ordered collection of named columns.
A column's "name" must be a unique string.
Columns may be accessed by name or by position.

This could be a public data frame class, or an object with the methods and
attributes defined on this DataFrame class could be returned from the
``__dataframe__`` method of a public data frame class in a library adhering
to the dataframe interchange protocol specification.
"""
def __dataframe__(self, nan_as_null : bool = False) -> dict:
"""
Produces a dictionary object following the dataframe protocol spec
"""
self._nan_as_null = nan_as_null
return {
"dataframe": self, # DataFrame object adhering to the protocol
"version": 0 # Version number of the protocol
}

def num_columns(self) -> int:
"""
Return the number of columns in the DataFrame
"""
pass

def num_rows(self) -> Optional[int]:
# TODO: not happy with Optional, but need to flag it may be expensive
# why include it if it may be None - what do we expect consumers
# to do here?
"""
Return the number of rows in the DataFrame, if available
"""
pass

def num_chunks(self) -> int:
"""
Return the number of chunks the DataFrame consists of
"""
pass

def column_names(self) -> Iterable[str]:
"""
Return an iterator yielding the column names.
"""
pass

def get_column(self, i: int) -> Column:
"""
Return the column at the indicated position.
"""
pass

def get_column_by_name(self, name: str) -> Column:
"""
Return the column whose name is the indicated name.
"""
pass

def get_columns(self) -> Iterable[Column]:
"""
Return an iterator yielding the columns.
"""
pass

def select_columns(self, indices: Sequence[int]) -> DataFrame:
"""
Create a new DataFrame by selecting a subset of columns by index
"""
pass

def select_columns_by_name(self, names: Sequence[str]) -> DataFrame:
"""
Create a new DataFrame by selecting a subset of columns by name.
"""
pass

def get_chunks(self, n_chunks : Optional[int] = None) -> Iterable[DataFrame]:
"""
Return an iterator yielding the chunks.

By default (None), yields the chunks that the data is stored as by the
producer. If given, ``n_chunks`` must be a multiple of
``self.num_chunks()``, meaning the producer must subdivide each chunk
before yielding it.
"""
pass

@property
def device(self) -> int:
"""
Device type the dataframe resides on.

Uses device type codes matching DLPack:

- 1 : CPU
- 2 : CUDA
- 3 : CPU pinned
- 4 : OpenCL
- 7 : Vulkan
- 8 : Metal
- 9 : Verilog
- 10 : ROCm
"""
pass