Skip to content

Add variable-length string support #45

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

kgryte
Copy link
Contributor

@kgryte kgryte commented Jun 24, 2021

This PR

  • adds variable-length string support to the dataframe interchange protocol.
  • modifies the protocol to return a dictionary containing all possible data buffers, rather than separate methods for retrieving data, validity, and offsets buffers, as discussed in consortium meetings.
  • modifies the protocol to return both the buffers and their associated dtypes. This is necessary in order to interpret, e.g., the offsets and mask buffers.
  • updates source code documentation (punctuation, typos, and spelling).
  • manually copies string data to and from byte arrays, as it assumes that strings are stored as object dtype. The implementation will need to be updated to accommodate pandas' string extension dtype which is based on arrow. Currently, the string extension dtype is considered experimental and subject to change. The use of object dtype is still used as the default string dtype for backward compatiblity.
  • uses a byte array mask for indicating missing values in string data buffers. This can be updated to use a bit array for space efficiency, at the cost of additional code complexity.
  • currently does not handle the scenario where a non-pandas dataframe provided to __dataframe__ uses a bit array to indicate missing values.

@kgryte kgryte changed the title WIP: Add variable-length string support Add variable-length string support Jun 25, 2021
@kgryte kgryte requested a review from rgommers June 25, 2021 00:09
rgommers added 3 commits June 25, 2021 21:45
* Add a summary document for the dataframe interchange protocol

Summarizes the various discussions about and goals/non-goals
and requirements for the `__dataframe__` data interchange
protocol.

The intended audience for this document is Consortium members
and dataframe library maintainers who may want to support this
protocol.

The aim is to keep updating this till we have captured all
the requirements and answered all the FAQs, so we can actually
design the protocol after and verify it meets all our requirements.

Closes gh-29

* Process some review comments

* Process a few more review comments.

* Link to Release callback semantics in Arrow C Data Interface docs

* Add design requirements for column selection and df metadata

* Edit the nested/heterogeneous dtypes non-requirement

* Add requirements for chunking and memory layout description

Also address some smaller review comments.

* Add TBD notes on dataframe-array connection and from_dataframe

Also add more details on the Arrow C Data Interface.

* Address review comments

* Add details on implementation options

* Add details about the C implementation

* Add an image of the dataframe model and its memory layout.

* Add link to discussion on array-dataframe connection

* Some more updates for review comments

* Update table to indicate Arrow does support categoricals.

* Add section on dtype format strings

* Reflow some lines

* Add a requirement on semantic meaning of NaN/NaT, and timezone detail

* Textual tweak: say columns in a data frame are ordered

* Update requirements document for recent decisions/insights
Add a prototype of the dataframe interchange protocol
@rgommers rgommers added enhancement New feature or request interchange-protocol labels Jun 25, 2021
@rgommers
Copy link
Member

Thanks @kgryte! It may be useful to close this PR and resend it against main, now that the other PRs it needed are merged.

Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good, thanks @kgryte! I haven't tested it yet, but just reading through the code I have only a couple of comments.

@kgryte kgryte changed the base branch from df-interchange-protocol-staging to main June 28, 2021 17:51
@kgryte kgryte changed the base branch from main to df-interchange-protocol-staging June 28, 2021 17:51
@kgryte
Copy link
Contributor Author

kgryte commented Jun 28, 2021

@rgommers Will close this and submit against main in short order.

@jorisvandenbossche
Copy link
Member

Will close this and submit against main in short order.

You can actually change the target branch by clicking the "Edit" button next to title, so then you don't need to close / open a new PR

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the separate get_data_buffer and get_offsets methods might be a bit problematic. Typically you only want to do one pass over the data and create all buffers at once.

I know this is only a dummy implementation, and presumably the created buffers could be cached on the object so the separate methods don't calculate it twice. But might be useful to think about separate methods vs a single get_buffers() (with a specified order)

Co-authored-by: Joris Van den Bossche <[email protected]>
@kgryte
Copy link
Contributor Author

kgryte commented Jul 8, 2021

You can actually change the target branch by clicking the "Edit" button next to title, so then you don't need to close / open a new PR

I initially tried doing that, but doing so added many unrelated changes and muddied this PR. :(

@jorisvandenbossche
Copy link
Member

I initially tried doing that, but doing so added many unrelated changes and muddied this PR. :(

I think that either merging latest master in this branch or rebasing this branch on top of master should solve that

kgryte added a commit that referenced this pull request Jul 19, 2021
This is a fresh port of changes made in order to support variable
length strings in order to provide a cleaner merge.
@kgryte
Copy link
Contributor Author

kgryte commented Jul 19, 2021

Closing this PR out in favor of gh-47.

@kgryte kgryte closed this Jul 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants