Skip to content

PR: Add allow_copy flag to interchange protocol #44

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

steff456
Copy link
Member

@steff456 steff456 commented Jun 24, 2021

This PR adds a flag to throw an exception if the export cannot be zero-copy. (e.g. for pandas, possible due to block manager where rows are contiguous and columns are not) .

  • Add allow_zero_copy flag to the DataFrame class.
  • Propagate the flag to the buffer and raise a RuntimeError when it is true
  • Fix test_noncontiguous_columns
  • Make update in the requirements doc

@steff456 steff456 requested a review from rgommers June 24, 2021 05:56
@steff456 steff456 self-assigned this Jun 24, 2021
@rgommers
Copy link
Member

To summarize the options for naming this keyword that were suggested yesterday:

  • allow_copy (your current choice)
  • allow_memory_copy
  • no_copy_only

I like the positive allow_* better than the negative no_*. Happy with either or allow_copy or allow_memory_copy.

Regarding the default value (now True):

  • @maartenbreddels preferred the "performance safety" of False
  • @jorisvandenbossche pointed out that currently strings in Pandas (implemented via object dtype) will always require a copy, but that that in the next Pandas release there will be an Arrow-based variable length string option which can be zero-copy.

I believe that there will be enough cases that will always require a copy (object-strings, columns backed by non-contiguous arrays, non-columnar data storage like for Koalas and Ibis) that using True as the default would be best. And then the very performance-sensitive (= usually more advanced) users can set it to False.

@rgommers rgommers changed the title PR: Add zero copy flag PR: Add allow_copy flag to interchange protocol Jun 25, 2021
@rgommers
Copy link
Member

Make update in the requirements doc

Adding the following to the requirements docs:

We'll also list some things that were discussed but are not requirements:
...
4. Support for strided storage in buffers.
   _Rationale: this is supported by a subset of dataframes only, mainly those
   that use NumPy arrays. In many real-world use cases, strided arrays will
   force a copy at some point, so requiring contiguous memory layout (and hence
   an extra copy at the moment `__dataframe__` is used) is considered a good
   trade-off for reduced implementation complexity._

Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some more testing and review, and found some issues. Also this PR is not against main, so I'll close it. Resubmitted as gh-51.


``allow_copy`` is a keyword that defines if the given implementation
is going to support striding buffers. It is optional, and the libraries
do not need to implement it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrasing this a bit:

        ``allow_copy`` is a keyword that defines whether or not the library is
        allowed to make a copy of the data. This can for example happen if a
        library supports strided buffers (those would need a copy, because this
        protocol specifies contiguous buffers).


``allow_copy`` is a keyword that defines if the given implementation
is going to support striding buffers. It is optional, and the libraries
do not need to implement it. Currently, if the flag is set to ``True`` it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"False and a copy is needed"

# complexity for libraries that don't support it (e.g. Arrow),
# but would help with numpy-based libraries like Pandas.
raise RuntimeError("Design needs fixing - non-contiguous buffer")
if not allow_copy:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This removes the actual check for striding (not x.strides == (x.dtype.itemsize,):), that needs to be put back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic:

        if not x.strides == (x.dtype.itemsize,):
            # The protocol does not support strided buffers, so a copy is
            # necessary. If that's not allowed, we need to raise an exception.
            if allow_copy:
                x = x.copy()
            else:
                raise RuntimeError("Exports cannot be zero-copy in the case "
                                   "of a non-contiguous buffer")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then adding a test case shows there's a problem - we don't hold on to the memory.

@rgommers rgommers closed this Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants