Skip to content

Add variable-length string support #47

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 23, 2021
Merged

Conversation

kgryte
Copy link
Contributor

@kgryte kgryte commented Jul 19, 2021

This PR supersedes gh-45 in order to ensure a cleaner merge and, similar to gh-45,

  • adds variable-length string support to the dataframe interchange protocol.
  • modifies the protocol to return a dictionary containing all possible data buffers, rather than separate methods for retrieving data, validity, and offsets buffers, as discussed in consortium meetings.
  • modifies the protocol to return both the buffers and their associated dtypes. This is necessary in order to interpret, e.g., the offsets and mask buffers.
  • updates source code documentation (punctuation, typos, and spelling).
  • manually copies string data to and from byte arrays, as it assumes that strings are stored as object dtype. The implementation will need to be updated to accommodate pandas' string extension dtype which is based on arrow. At the time of this PR, the string extension dtype was considered experimental and subject to change. The use of object dtype is still used as the default string dtype for backward compatiblity.
  • uses a byte array mask for indicating missing values in string data buffers. This can be updated to use a bit array for space efficiency, at the cost of additional code complexity.

kgryte added 3 commits July 19, 2021 11:19
This is a fresh port of changes made in order to support variable
length strings in order to provide a cleaner merge.
Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, I played with it a little and it all seems to work as expected. The main review comments are on gh-45, and those were all addressed. It's time (well, overdue) to merge this. Thanks @kgryte and all reviewers!

@rgommers rgommers merged commit 916f1af into main Aug 23, 2021
@rgommers rgommers deleted the variable-length-string-support-2 branch August 23, 2021 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants