Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Requirements document for the dataframe interchange protocol #35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requirements document for the dataframe interchange protocol #35
Changes from 8 commits
49f3b06
31e301f
a9b5e5c
82bc5ae
6197aee
2f51ba8
183851d
c7575c1
5f278b3
1708e03
3291dd9
e8caeba
c5de640
93d6e69
7d06066
c0b5759
6839642
53446ac
b37ac91
be3cd32
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm -1 on building Python only pieces on top of the C data interface. Ideally, if we're adopting a C data interface, we should work on extending the interface in the necessary ways desired. If we can't extend such a C data interface, maybe we shouldn't be building on top of it to begin with.
There's a lot of good discussion going on related to DLPack and extending it to support asynchronous device execution:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than building on top of the C spec, we could adopt the dtype format specification, which seems complete for our needs and well-designed.
Either way, I'm not really clear on why that interface only talks about arrays and not dataframes, I found that surprising given Arrow's focus. It seemed to me like chunking was thought about and a main reason for separating
ArrowArray
andArrowSchema
, but it's all very implicit. The idea seems to be that both multiple chunks and multiple columns are trivial to implement on top, so let's not bother with writing C structures for them.Or maybe I'm missing some history here and there are other reasons. @jorisvandenbossche do you know what the story is?
Yep that's starting to look good. Really we need the union of DLPack and the Arrow C Data Interface, plus support for multiple columns and chunking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the basic ABI is only an array and schema interface (
ArrowArray
andArrowSchema
), but those can be used to cover the more complex use cases (chunked arrays, dataframes (record batches)). Arrow itself (the C++ library and bindings in python and R) actually already supports that to export/import arrow Table or RecordBatches using the interface.A very short notice about this in the docs is at https://arrow.apache.org/docs/format/CDataInterface.html#record-batches
The array and schema structs are indeed separated to allow sending a single schema for multiple arrays (eg to support chunked array): https://arrow.apache.org/docs/format/CDataInterface.html#why-two-distinct-structures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I interpreted "A record batch can be trivially considered as an equivalent struct array with additional top-level metadata" to mean it's about nested dtypes in a single column - that's what a "struct array" in NumPy would be. The example at https://arrow.apache.org/docs/format/CDataInterface.html#exporting-a-struct-float32-utf8-array seems to imply that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but a StructArray is simply a collection of named fields with child arrays (and those child arrays can again be of any type, included nested types), which is basically the same as a "dataframe" (at least in the way we are defining it here for the interchange protocol)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that makes sense. That does mean nested/structures dtypes have different memory layouts in Arrow and NumPy, which is worth noting explicitly in the item discussion those dtypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I don't think we should support "structured dtypes" (record arrays) in numpy's sense (which are not columnar), and when considering nested data, only consider the Arrow-like notion of nested data types.
(but that's maybe a discussion for later, since that item is currently in the list of non-requirements. But so indeed might be worth noting that numpy's structured dtype is something completely different as arrow's struct type)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a huge fan of NumPy's way of doing this, but they are columnar. It's more that the definition of "dtype" is different - a NumPy dtype is seen as a single unit, which has fixed size and with 1-D arrays being contiguous memory blocks where the elements with the same dtype are repeated in adjacent memory locations.
In contrast, a 1-D Arrow array can consist of discontiguous blocks of memory because of its looser definition of "dtype".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, true (with my "not columnar" I wanted to say that the individual "fields" of the nested array are not columnar).