-
Notifications
You must be signed in to change notification settings - Fork 21
Add a prototype of the dataframe interchange protocol #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
9f82870
b201c68
61d84f3
dc3b373
cb338fd
35d3c0d
90b4f42
c08ec10
9c2717b
552b794
cfabb9f
1b6ef4e
81ec86e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,373 @@ | ||
""" | ||
Specification for objects to be accessed, for the purpose of dataframe | ||
interchange between libraries, via the ``__dataframe__`` method on a libraries' | ||
data frame object. | ||
|
||
For guiding requirements, see https://github.com/data-apis/dataframe-api/pull/35 | ||
|
||
Design decisions | ||
---------------- | ||
|
||
**1. Use a separate column abstraction in addition to a dataframe interface.** | ||
|
||
Rationales: | ||
- This is how it works in R, Julia and Apache Arrow. | ||
- Semantically most existing applications and users treat a column similar to a 1-D array | ||
- We should be able to connect a column to the array data interchange mechanism(s) | ||
|
||
Note that this does not imply a library must have such a public user-facing | ||
abstraction (ex. ``pandas.Series``) - it can only be accessed via ``__dataframe__``. | ||
|
||
**2. Use methods and properties on an opaque object rather than returning | ||
hierarchical dictionaries describing memory** | ||
|
||
This is better for implementations that may rely on, for example, lazy | ||
computation. | ||
|
||
**3. No row names. If a library uses row names, use a regular column for them.** | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I disagree with this. If two libraries support row labels it would generally be better for both libraries if there was a way to convert between row labels. In general, row labels are metadata and moving metadata into the data can be costly and require copying. I think this places too much of a constraint on dataframe implementations and assumes that moving metadata into data is cheap, which is certainly not the case in Modin. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't have a strong opinion on this, this was my interpretation of the linked discussion. Happy to change, let's wait for a few more opinions though - @kkraus14, @maartenbreddels, @TomAugspurger and @datapythonista all seemed to lean towards "no row labels". Maybe my adding "use a regular column for them" isn't correct though - perhaps should simply have been "is not supported". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I strongly voiced my opinion on row names in that they should either be required and part of the spec or entirely not part of the spec. Making them optional just ends up with pushing everyone to implement them which is what we've seen with everyone following the Pandas API. In the early days of cuDF we tried to take the stance that we won't implement row names (indexes), eventually enough people pushed back that for Pandas compatibility it's needed that we added a single column Index which then eventually succumbed to the same argument for eventually supporting MultiIndexes. Generally all of the arguments were always just for Pandas compatibility / minimizing changes to existing code as opposed to better implementations or better code paths in any way. +1 to just saying "is not supported" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kkraus14 I don't agree with this idea:
All of the discussions against row labels ignore the users and questions about why they exist in the first place. Yes, it is easier to have an implementation without them, I agree. I would argue that it is not about "legacy compatibility" of changing existing code. It has become a part of the mental model of (millions of) users, and row names were a part of S and R dataframes long before pandas. I have never been a fan of the idea that users should change their mental models of a system to fit what the library authors found easy to implement or optimize. Maybe this is not the place to discuss the validity of row labels as a concept, because it likely boils down to deep seeded opinions on user centered designs vs performance centered designs. Sometimes those views are not compatible. I do not think making row labels optional will force everyone to implement them. This is a developer API, so if something is optional, the developers using it will understand that and handle things appropriately. If this were an end user API things might be different. I have a concern that narrowing this down into the least common denominator will result in something that is not particularly useful. I mentioned this in the live meetings but throwing specific exceptions or warnings should absolutely be a part of the spec and can help with this question about optional implementation details. If there is no support for labels then the only two options are to (1) implicitly move the metadata into the data, which as I mentioned might be expensive, or (2) drop the row labels altogether, which might contain relevant data. Without row labels, there is no operator to explicitly move row labels into the data ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
95+% of the use cases we saw were people not actually using indexes in any meaningful way. They were constantly using
If row labels are optional then presumably Pandas will support it and developers will end up continuing to use it unnecessarily and inefficiently, which will then push other libraries to implement it unnecessarily from wanting to be able to integrate. For what it's worth, on the cuDF side we now somewhat nicely handle the cases of indexing / multi-indexing, but I am pushing to avoid new DataFrame libraries from having to go through the same pain that we did unless they explicitly want developers to use indexing features. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm not sure that's the argument. My experience is the same as @kkraus14's with code I got from less experienced colleagues in the past: they just use the defaults (which can be Would be good to get some pandas maintainer opinions on this one. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (I am a bit a biased pandas maintainer, as I would like to see pandas get rid of default row indices; "default" to be clear, still have the functionality if you explicitly ask for it) I think regarding data exchange, the index can be passed through just like another column? In which case it might a question whether we want to make it possible to give an indication that a certain column(s) can be used as index (eg a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revisiting this, passing the row labels as a separate buffer (or a "column buffer") would work for me as long as the labels would not need to be moved into the data. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cool, that's helpful @devin-petersohn. I think it could be combined with a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Update on this, since we had multiple discussions on this: in the end we settled on going with |
||
|
||
See discussion at https://github.com/wesm/dataframe-protocol/pull/1/files#r394316241 | ||
Optional row names are not a good idea, because people will assume they're present | ||
(see cuDF experience, forced to add because pandas has them). | ||
Requiring row names seems worse than leaving them out. | ||
|
||
""" | ||
|
||
|
||
class Buffer: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One thing that didn't occur to me in the discussion we just had: if we get rid of this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It could still be attached to the Column chunk, in which case it would only work if the column is backed by a single buffer. But I suppose that is limiting the use of dlpack too much? (because that would mean there is no easy way to get a separate buffer as a (numpy) array) |
||
""" | ||
Data in the buffer is guaranteed to be contiguous in memory. | ||
""" | ||
|
||
@property | ||
def bufsize(self) -> int: | ||
""" | ||
Buffer size in bytes | ||
""" | ||
pass | ||
|
||
@property | ||
def ptr(self) -> int: | ||
""" | ||
Pointer to start of the buffer as an integer | ||
""" | ||
pass | ||
|
||
def __dlpack__(self): | ||
""" | ||
Produce DLPack capsule (see array API standard). | ||
|
||
Raises: | ||
|
||
- TypeError : if the buffer contains unsupported dtypes. | ||
kkraus14 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- NotImplementedError : if DLPack support is not implemented | ||
|
||
Useful to have to connect to array libraries. Support optional because | ||
it's not completely trivial to implement for a Python-only library. | ||
""" | ||
raise NotImplementedError("__dlpack__") | ||
|
||
|
||
class Column: | ||
""" | ||
A column object, with only the methods and properties required by the | ||
interchange protocol defined. | ||
|
||
A column can contain one or more chunks. Each chunk can contain either one | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I find it a bit confusing that we use the same object for a collection of chunks as the chunk itself. Eg what does (in Arrow there is the concept of Array and ChunkedArray (and this represents a column of a DataFrame), where the second is just a vector of same-typed arrays) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That should raise.
It's possible to have a separate object like That'll be an interesting discussion to have today. @devin-petersohn preferred fewer concepts (no
In the end fewer separate classes just means more methods/properties on classes that will raise when they're called in a situation that's not appropriate. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The DataFrame could always be chunked, so DataFrame -> ChunkedColumn (with potentially only 1 chunk), to avoid the additional class. But always having to first access the single chunk (if you don't work with multiple chunks) also has its disadvantages of course). |
||
or two buffers - one data buffer and (depending on null representation) it | ||
may have a mask buffer. | ||
|
||
TBD: Arrow has a separate "null" dtype, and has no separate mask concept. | ||
Instead, it seems to use "children" for both columns with a bit mask, | ||
and for nested dtypes. Unclear whether this is elegant or confusing. | ||
This design requires checking the null representation explicitly. | ||
|
||
The Arrow design requires checking: | ||
1. the ARROW_FLAG_NULLABLE (for sentinel values) | ||
2. if a column has two children, combined with one of those children | ||
having a null dtype. | ||
|
||
Making the mask concept explicit seems useful. One null dtype would | ||
not be enough to cover both bit and byte masks, so that would mean | ||
even more checking if we did it the Arrow way. | ||
|
||
TBD: there's also the "chunk" concept here, which is implicit in Arrow as | ||
multiple buffers per array (= column here). Semantically it may make | ||
sense to have both: chunks were meant for example for lazy evaluation | ||
of data which doesn't fit in memory, while multiple buffers per column | ||
could also come from doing a selection operation on a single | ||
contiguous buffer. | ||
|
||
Given these concepts, one would expect chunks to be all of the same | ||
size (say a 10,000 row dataframe could have 10 chunks of 1,000 rows), | ||
while multiple buffers could have data-dependent lengths. Not an issue | ||
in pandas if one column is backed by a single NumPy array, but in | ||
Arrow it seems possible. | ||
Are multiple chunks *and* multiple buffers per column necessary for | ||
the purposes of this interchange protocol, or must producers either | ||
reuse the chunk concept for this or copy the data? | ||
|
||
Note: this Column object can only be produced by ``__dataframe__``, so | ||
doesn't need its own version or ``__column__`` protocol. | ||
|
||
""" | ||
|
||
@property | ||
def size(self) -> Optional[int]: | ||
""" | ||
Size of the column, in elements. | ||
|
||
Corresponds to DataFrame.num_rows() if column is a single chunk; | ||
equal to size of this current chunk otherwise. | ||
""" | ||
pass | ||
|
||
@property | ||
def offset(self) -> int: | ||
""" | ||
Offset of first element | ||
|
||
May be > 0 if using chunks; for example for a column with N chunks of | ||
equal size M (only the last chunk may be shorter), | ||
``offset = n * M``, ``n = 0 .. N-1``. | ||
""" | ||
pass | ||
|
||
@property | ||
def dtype(self) -> Tuple[int, int, str, str]: | ||
""" | ||
Dtype description as a tuple ``(kind, bit-width, format string, endianness)`` | ||
|
||
Kind : | ||
|
||
- 0 : signed integer | ||
- 1 : unsigned integer | ||
- 2 : IEEE floating point | ||
- 20 : boolean | ||
- 21 : string (UTF-8) | ||
- 22 : datetime | ||
- 23 : categorical | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need both There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Format strings are quite annoying to parse, so for anything besides datetime I think library authors would have a preference for using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should the answer be for us to expand There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. True, however producing format string is much easier than parsing - I think it's a minor annoyance. I don't see how we'd cover all the datetime formats including time zones without a format string. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe use 16 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we care about supporting things like Decimal or Fixed Size Lists in the future where we need to specify some semi arbitrary attributes alongside the dtype? If we do then we should make sure that whatever specification we decide on here can eventually support those gracefully. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should at least keep the possibility open to support more data types like decimal and nested types (which probably means we need format strings?) |
||
|
||
Bit-width : the number of bits as an integer | ||
Format string : data type description format string in Apache Arrow C | ||
Data Interface format. | ||
Endianness : current only native endianness (``=``) is supported | ||
|
||
Notes: | ||
|
||
- Kind specifiers are aligned with DLPack where possible (hence the | ||
jump to 20, leave enough room for future extension) | ||
- Masks must be specified as boolean with either bit width 1 (for bit | ||
masks) or 8 (for byte masks). | ||
- Dtype width in bits was preferred over bytes | ||
- Endianness isn't too useful, but included now in case in the future | ||
we need to support non-native endianness | ||
- Went with Apache Arrow format strings over NumPy format strings | ||
because they're more complete from a dataframe perspective | ||
- Format strings are mostly useful for datetime specification, and | ||
for categoricals. | ||
- For categoricals, the format string describes the type of the | ||
categorical in the data buffer. In case of a separate encoding of | ||
the categorical (e.g. an integer to string mapping), this can | ||
be derived from ``self.describe_categorical``. | ||
- Data types not included: complex, Arrow-style null, binary, decimal, | ||
and nested (list, struct, map, union) dtypes. | ||
""" | ||
pass | ||
|
||
@property | ||
def describe_categorical(self) -> dict[bool, bool, Optional[dict]]: | ||
""" | ||
If the dtype is categorical, there are two options: | ||
|
||
- There are only values in the data buffer. | ||
- There is a separate dictionary-style encoding for categorical values. | ||
|
||
Raises RuntimeError if the dtype is not categorical | ||
|
||
Content of returned dict: | ||
|
||
- "is_ordered" : bool, whether the ordering of dictionary indices is | ||
semantically meaningful. | ||
- "is_dictionary" : bool, whether a dictionary-style mapping of | ||
categorical values to other objects exists | ||
- "mapping" : dict, Python-level only (e.g. ``{int: str}``). | ||
None if not a dictionary-style categorical. | ||
Comment on lines
+254
to
+255
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this the actual dictionary mapping? I.e. if I had a column of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes indeed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm a HUGE -1 on this. We'd want our dictionary to just be another column which is on the device for example. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do you put a dictionary in a single column? It'd be a dataframe of two columns, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Take a categorical column of
In the case of cuDF all of these buffers on the GPU and getting a host mapping of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A note on the above example: |
||
|
||
TBD: are there any other in-memory representations that are needed? | ||
""" | ||
pass | ||
|
||
@property | ||
def describe_null(self) -> Tuple[int, Any]: | ||
""" | ||
Return the missing value (or "null") representation the column dtype | ||
uses, as a tuple ``(kind, value)``. | ||
|
||
Kind: | ||
|
||
- 0 : non-nullable | ||
- 1 : NaN/NaT | ||
- 2 : sentinel value | ||
- 3 : bit mask | ||
- 4 : byte mask | ||
|
||
Value : if kind is "sentinel value", the actual value. None otherwise. | ||
""" | ||
pass | ||
|
||
@property | ||
def null_count(self) -> Optional[int]: | ||
""" | ||
Number of null elements, if known. | ||
|
||
Note: Arrow uses -1 to indicate "unknown", but None seems cleaner. | ||
""" | ||
pass | ||
|
||
def num_chunks(self) -> int: | ||
""" | ||
Return the number of chunks the column consists of. | ||
""" | ||
pass | ||
|
||
def get_chunks(self, n_chunks : Optional[int] = None) -> Iterable[Column]: | ||
""" | ||
Return an iterator yielding the chunks. | ||
|
||
See `DataFrame.get_chunks` for details on ``n_chunks``. | ||
""" | ||
pass | ||
|
||
def get_data_buffer(self) -> Buffer: | ||
""" | ||
Return the buffer containing the data. | ||
""" | ||
pass | ||
|
||
def get_mask(self) -> Buffer: | ||
""" | ||
Return the buffer containing the mask values indicating missing data. | ||
|
||
Raises RuntimeError if null representation is not a bit or byte mask. | ||
""" | ||
pass | ||
|
||
# # NOTE: not needed unless one considers nested dtypes | ||
# def get_children(self) -> Iterable[Column]: | ||
# """ | ||
# Children columns underneath the column, each object in this iterator | ||
# must adhere to the column specification | ||
# """ | ||
# pass | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How are we handling strings without this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I may be missing some context for this question. NumPy-like fixed length strings would be handled as a single buffer; for variable-length strings it's unclear to me how you would handle them even with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For variable-length strings the way Arrow handles them is by having a Column of bytes (int8 / uint8 for example) and then a Column of integer offsets into the bytes. So for example, say you had a column of strings of:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is also how we handle strings in cuDF and I suspect that generally people will want variable length strings as opposed to fixed length strings. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It has multiple ways of doing this? There's exactly zero specification of this in https://arrow.apache.org/docs/format/CDataInterface.html, and I can't find a separate description either. All I see is it uses format strings Would you expect all libraries to handle multiple flavors of variable-length strings? So if the type specifiers say "utf-8 string", then there's some self-describing format for how that's laid out with multiple child buffers, each of which can have a different dtype? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
:( It's like they want people to not understand them. The writing is so terse - there's zero connection to strings in the docs there.
Thanks for the explanation. I understand now. Okay, that indeed requires a change from what I have in this PR currently. There's three options:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd be in favor of option 3 😄 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
You are welcome to open an issue about that ;) Also in favor of option 3 (I would personally just follow Arrow's layout of multiple buffers + children arrays) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I know:) Maybe more thinking about a whole docs section with examples and mapping dataframe concepts to Arrow concepts.
A worry may be the amount of complexity that a consumer is expected to parse. Arbitrary many buffers and nested child arrays is hugely complex; right now there seems to be only the need for:
So it might make sense to do that the same way as Arrow, but restrict it to that - everything else is either producer error or an unsupported dtype? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we take the 3 buffer approach for string columns we actually don't need a child level for strings as it would be represented directly by the 3 buffers. If we don't take the 3 buffer approach for string columns and instead go with the children approach, then we'd need 2 child levels to handle a categorical column who's categories are a string column. |
||
|
||
|
||
class DataFrame: | ||
""" | ||
A data frame class, with only the methods required by the interchange | ||
protocol defined. | ||
|
||
A "data frame" represents an ordered collection of named columns. | ||
A column's "name" must be a unique string. | ||
Columns may be accessed by name or by position. | ||
|
||
This could be a public data frame class, or an object with the methods and | ||
attributes defined on this DataFrame class could be returned from the | ||
``__dataframe__`` method of a public data frame class in a library adhering | ||
to the dataframe interchange protocol specification. | ||
""" | ||
def __dataframe__(self, nan_as_null : bool = False) -> dict: | ||
""" | ||
Produces a dictionary object following the dataframe protocol spec | ||
""" | ||
self._nan_as_null = nan_as_null | ||
rgommers marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return { | ||
"dataframe": self, # DataFrame object adhering to the protocol | ||
"version": 0 # Version number of the protocol | ||
} | ||
|
||
def num_columns(self) -> int: | ||
""" | ||
Return the number of columns in the DataFrame | ||
""" | ||
pass | ||
|
||
def num_rows(self) -> Optional[int]: | ||
# TODO: not happy with Optional, but need to flag it may be expensive | ||
# why include it if it may be None - what do we expect consumers | ||
# to do here? | ||
""" | ||
Return the number of rows in the DataFrame, if available | ||
""" | ||
pass | ||
|
||
def num_chunks(self) -> int: | ||
""" | ||
Return the number of chunks the DataFrame consists of | ||
""" | ||
pass | ||
|
||
def column_names(self) -> Iterable[str]: | ||
""" | ||
Return an iterator yielding the column names. | ||
""" | ||
pass | ||
|
||
def get_column(self, i: int) -> Column: | ||
""" | ||
Return the column at the indicated position. | ||
""" | ||
pass | ||
|
||
def get_column_by_name(self, name: str) -> Column: | ||
""" | ||
Return the column whose name is the indicated name. | ||
""" | ||
pass | ||
|
||
def get_columns(self) -> Iterable[Column]: | ||
""" | ||
Return an iterator yielding the columns. | ||
""" | ||
pass | ||
|
||
def select_columns(self, indices: Sequence[int]) -> DataFrame: | ||
""" | ||
Create a new DataFrame by selecting a subset of columns by index | ||
""" | ||
pass | ||
|
||
def select_columns_by_name(self, names: Sequence[str]) -> DataFrame: | ||
""" | ||
Create a new DataFrame by selecting a subset of columns by name. | ||
""" | ||
pass | ||
|
||
def get_chunks(self, n_chunks : Optional[int] = None) -> Iterable[DataFrame]: | ||
""" | ||
Return an iterator yielding the chunks. | ||
|
||
By default (None), yields the chunks that the data is stored as by the | ||
producer. If given, ``n_chunks`` must be a multiple of | ||
``self.num_chunks()``, meaning the producer must subdivide each chunk | ||
before yielding it. | ||
""" | ||
pass | ||
|
||
@property | ||
def device(self) -> int: | ||
rgommers marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
Device type the dataframe resides on. | ||
|
||
Uses device type codes matching DLPack: | ||
|
||
- 1 : CPU | ||
- 2 : CUDA | ||
- 3 : CPU pinned | ||
- 4 : OpenCL | ||
- 7 : Vulkan | ||
- 8 : Metal | ||
- 9 : Verilog | ||
- 10 : ROCm | ||
""" | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is an interchange protocol I do not see value in having a column abstraction. If we want a way to treat a dataframe as an array we should make that explicit rather than through a column interface. The Modin implementation does not support a "column" abstraction, rather a dataframe with 1 column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Devin, this is a good discussion to have. As the note 7 lines below says, this is not user-facing API. It seems a useful abstraction to in the end get down to a description of memory. There are multiple reasonable ways to model a dataframe though. What this design does is use 5 concepts in 3 classes:
Buffer
class. A "buffer" is a contiguous block of memory - this is the only thing that actually maps to a 1-D array in a sense that it could be converted to NumPy, CuPy, et al.Column
class. A "column" has a name and a single dtype. It can consist of multiple "chunks". A single chunk of a column (which may be the whole column ifnum_chunks == 1
) is modeled as again aColumn
instance, and contains 1 data buffer and (optionally) one "mask" for missing data.DataFrame
class. A "data frame" is an ordered collection of columns. It has a single device, and all its rows are the same length. It can consist of multiple "chunks". A single chunk of a data frame is modeled as again aDataFrame
instance.These I think are all the actual abstractions we use when talking about data frames. It's fairly explicit and easy to understand. An alternative approach would be to do something like what Arrow did - its "buffer" is the same as in this design, but it doesn't use any of the other concepts - its "array" (which is a little like
Column
here and not at all like the "array" we use in the array API standard) and "children" are more ways of combining buffers than that they directly map to data frame concepts. Its approach seems to be to use the minimal number of concepts to describe memory layout of a data frame, and leave it up to its users to map that to their actual data frame concepts.Taking into account our previous discussions of this topic as well as the early prototypes sketches out by @wesm and @kkraus14, I believe we did want something that looked more like a minimal dataframe API with explicitly named methods aimed at library developers. Given that, I think this is at least in the right direction. The main question regarding these concepts that I wasn't sure about is: should a column be allowed to have multiple buffers yes or no (in a single chunk)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I understood this. A better wording of my statement would have been "I do not see value in having a column abstraction in an interchange protocol."
I guess I do not understand the rationales here and how that maps to the interchange protocol.
Here you have given examples, and make a point about applications and users, along with the array API. This does not seem to be about the interchange protocol.
Maybe I misunderstood the purpose of this PR based on your comment. It seems you intend to extend this interchange protocol into the developer API we have been talking about. Is that correct?
I may have missed a week or two scrambling toward paper/talk deadlines 😄. Perhaps I would have pushed back (or at least forced a more directed discussion) on some things that were decided if I were there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now is a good time to push:) Nothing is set in stone.
So what would you do with the methods that are now on
Column
but not onDataFrame
? Attach them toDataFrame
and make them raise an exception unless it's a single-column dataframe?Good point. Those rationales were a summary of opinions on wesm/dataframe-protocol#1, I didn't put enough thought into whether they did or did not apply here.
In our last meeting we discussed that. There's no such intention to extend currently I'd say, although it's a possibility. The mood was more "let's get this interchange protocol done, and then see based on experience and adoption what the right approach is for a more complete API".
Personally I think a developer API may have some similarities, but will not be a superset of this interchange protocol. For example, I'm not convinced it should have a column abstraction, and certainly not memory-oriented APIs like
get_data_buffer
orget_mask
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I am one of the guys who push for getting things out, I think we need to be careful when adding new abstractions. I am for a minimally viable protocol in the first version, and I second that adding a Column and even a Buffer seems to follow the path of the current affairs without critically challenging the need in the first version. This is especially dangerous as it has a tendency to become another version to please everyone, especially Pandas' accustomed user base. IMHO we need to keep the balance of usefulness and adoption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can call it
Array
instead ofColumn
, then we don't have a Column abstraction any more ;-)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitHub does not have the range of emoji's needed to accurately reflect my initial reaction to that:)