-
Notifications
You must be signed in to change notification settings - Fork 21
Add variable-length string support #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add variable-length string support #45
Conversation
* Add a summary document for the dataframe interchange protocol Summarizes the various discussions about and goals/non-goals and requirements for the `__dataframe__` data interchange protocol. The intended audience for this document is Consortium members and dataframe library maintainers who may want to support this protocol. The aim is to keep updating this till we have captured all the requirements and answered all the FAQs, so we can actually design the protocol after and verify it meets all our requirements. Closes gh-29 * Process some review comments * Process a few more review comments. * Link to Release callback semantics in Arrow C Data Interface docs * Add design requirements for column selection and df metadata * Edit the nested/heterogeneous dtypes non-requirement * Add requirements for chunking and memory layout description Also address some smaller review comments. * Add TBD notes on dataframe-array connection and from_dataframe Also add more details on the Arrow C Data Interface. * Address review comments * Add details on implementation options * Add details about the C implementation * Add an image of the dataframe model and its memory layout. * Add link to discussion on array-dataframe connection * Some more updates for review comments * Update table to indicate Arrow does support categoricals. * Add section on dtype format strings * Reflow some lines * Add a requirement on semantic meaning of NaN/NaT, and timezone detail * Textual tweak: say columns in a data frame are ordered * Update requirements document for recent decisions/insights
Add a prototype of the dataframe interchange protocol
Thanks @kgryte! It may be useful to close this PR and resend it against |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very good, thanks @kgryte! I haven't tested it yet, but just reading through the code I have only a couple of comments.
…o variable-length-string-support
@rgommers Will close this and submit against |
You can actually change the target branch by clicking the "Edit" button next to title, so then you don't need to close / open a new PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the separate get_data_buffer
and get_offsets
methods might be a bit problematic. Typically you only want to do one pass over the data and create all buffers at once.
I know this is only a dummy implementation, and presumably the created buffers could be cached on the object so the separate methods don't calculate it twice. But might be useful to think about separate methods vs a single get_buffers()
(with a specified order)
Co-authored-by: Joris Van den Bossche <[email protected]>
I initially tried doing that, but doing so added many unrelated changes and muddied this PR. :( |
I think that either merging latest master in this branch or rebasing this branch on top of master should solve that |
This is a fresh port of changes made in order to support variable length strings in order to provide a cleaner merge.
Closing this PR out in favor of gh-47. |
This PR
offsets
andmask
buffers.object
dtype. The implementation will need to be updated to accommodate pandas' string extension dtype which is based on arrow. Currently, the string extension dtype is considered experimental and subject to change. The use ofobject
dtype is still used as the default string dtype for backward compatiblity.__dataframe__
uses a bit array to indicate missing values.