Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add a summary document for the dataframe interchange protocol #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a summary document for the dataframe interchange protocol #30
Changes from all commits
a1ddad8
9bf13db
220388d
645b26c
5610332
4a8e6dd
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One key thing is Arrow C Data interface relies on providing a deletion / finalization method similar to DLPack. That is something that hasn't been discussed too much, but we should iron out for this proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that's a good one. Since we're discussing a Python-level API, a deletion / finalization method seems a bit "foreign" / C-specific. But I agree that it's important. I have to admit I haven't fully figured out what all the observable behaviour differences to a Python user are between the deleter method and refcounting - should write a set of tests for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question here: is "Python-level" a design requirement? (if so, probably should be added in the list above)
For example also DLPack, considered as the exchange protocol format for the array standard, is a C-level interface?
(to be clear, I am not very familiar with those aspects. It might also be you can have a Python interface to a C-level exchange format?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a TODO for desired semantics (deletion/finalization vs. buffer protocol type behaviour) here.
I'd say yes, added to item 1 of the list of design requirements. A C-only interface would probably be asking too much from consumers here, and maximal performance doesn't seem too important compared to having this functionality available in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking into the semantics some more, I came to the conclusion that that Arrow spec doesn't define any particular semantics that matter to Python users. The release callback semantics matter a lot to library implementers (both producer and consumer), but in the end it doesn't say anything about whether memory is shared or not. It allows for zero-copy but doesn't mandate it - so the copy/view + mutation ambiguity is similar as we had the large discussion around for arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ignores an important aspect brought up in the discussion, I think. One of the arguments to have such dedicated methods is that you might need a numpy array in a specific memory layout (eg because your cython algo requires it). Numpy's type support is less rich as found in dataframe libraries (eg no categorical, string, decimal, ..), and numpy doesn't support missing values. So as the consumer you might want to have control on how the conversion is done.
For example in pandas, the
to_numpy
method has ana_value
keyword to control what value (compatible with the numpy dtype) is used for missing values.This of course doesn't necessarily require a
to_numpy
method, as we might be able to give this kind of control in other ways. But I think having this kind of control is an important use case (at least for compatibility with numpy-based (or general array-based) libraries).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. At the moment
na_value
is the one thing I see in Pandas. Is there anything else that other dataframe libraries have or you know may be needed in the future?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the actual
dtype
, I think. For example, if you have a categorical column, do you want to get a "densified" array (eg string array of the categories are string) or the integer indices? If you have a string column, do you want to get a numpy str dtype or object dtype array? If you have a decimal column, do you want a numpy float array or object array with decimal objects. Etc.Basically any data type that has no direct mapping to a numpy dtype might have potentially multiple options in how to convert it to numpy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we avoid this discussion by solving this in the future? e.g.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suggestion makes sense I think @maartenbreddels. Longer-term this does seem like the cleaner solution.
This is actually a bit of a side discussion, triggered by the presence of
to_numpy
andto_arrow
in wesm/dataframe-protocol#1. In that prototype it's actually not connected to__dataframe__
. And converting one dataframe into another kind of dataframe is a different goal/conversation than "convert a column into an array". We do not have a design requirement for the latter. So I'd say we clarify that in the doc so that this FAQ item has some more context, and then just leave it out.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But often, converting one dataframe into another dataframe will go through converting each column of dataframe 1 to an array, and assemble dataframe 2 from those arrays ?
At least, that's how I imagine dataframe exchange to work based on the protocol we are discussing (if you want to avoid specific knowledge about the type of dataframe 1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's fair enough - that's completely implicit right now, should probably be made more explicit, that will both make this section make more sense and help when implementing. I'll retract my "this is a different goal" comment and will try to fix this up.