Skip to content

Prefer construction via DLPack to costly element-by-element copy #120615

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

wjakob
Copy link
Contributor

@wjakob wjakob commented Feb 26, 2024

Fixes #120614

Copy of pitch:

Consider the following statement

torch_array = torch.tensor(other_array)

where other_array is an array instance constructed by another array programming framework. If this other framework is principally designed around the DLPack array exchange protocol, something very bad happens.

Basically PyTorch ignores the DLPack interface (other_array.__dlpack__) altogether. Furthermore, if other_array exposes the sequence protocol (__len__ and __getitem__), PyTorch will perform a brute-force element-wise copy potentially requiring hundreds of millions of separate PCI-express transactions to copy individual floating point values from the GPU. To the user, it will seem that the application has crashed because nothing happens and attempting to interrupt the Python kernel doesn't work since this is all done on the C++ side.

Fortunately PyTorch supports the DLPack protocol to efficiently exchange tensors with other libraries. But it only does so when the user creates the arrays in a sort of awkward way:

from torch.utils.dlpack import from_dlpack
torch_array = from_dlpack(other_array)

It's easy to forget to do this, with extremely unpleasant results. It would be very easy for PyTorch to check if other_array implements the DLPack protocol and then simply to switch to this construction method. I will create a separate PR proposing a prototype of this idea.

Copy link

pytorch-bot bot commented Feb 26, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/120615

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 3b48f6a with merge base b381a43 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link

linux-foundation-easycla bot commented Feb 26, 2024

CLA Signed


The committers listed above are authorized under a signed CLA.

@vadimkantorov
Copy link
Contributor

vadimkantorov commented Feb 26, 2024

Does torch.as_tensor also support __dlpack__ or __array_interface__? related: #58036

@wjakob
Copy link
Contributor Author

wjakob commented Feb 26, 2024

@vadimkantorov __array_interface__ is a CPU-only API, and I would like to keep this PR focused on the DLPack functionality.

torch.asarray has the same flaw and also does not check for DLPack capabilities of the input object. If the approach I took seems fine to you, I can add similar logic there.

@vadimkantorov
Copy link
Contributor

vadimkantorov commented Feb 26, 2024

I'm not one of maintainers/reviewers, so we'll have to see what they got to say :) I just also stumbled on this similar issue of torch.as_tensor not supporting at the time __array_interface__/__cuda_array_interface__/__dlpack__/buffer protocol/memoryview inputs...

@wjakob
Copy link
Contributor Author

wjakob commented Feb 26, 2024

Intererestingly, there is some support for __cuda_array_interface__, but this is a legacy API with some rather significant flaws.

@ezyang
Copy link
Contributor

ezyang commented Feb 27, 2024

For reference, who is implementing __dlpack__ in this case?

@ezyang
Copy link
Contributor

ezyang commented Feb 27, 2024

@lezcano @rgommers do either of you have an opinion on the relative priority of this wrt __cuda_array_interface__?

This PR in principle seems fine, I think we may need to haggle about where exactly it goes in the priority chain.

Haven't reviewed the code but it needs a test.

@rgommers
Copy link
Collaborator

This would be great to fix, thanks @wjakob.

torch.asarray has the same flaw and also does not check for DLPack capabilities of the input object. If the approach I took seems fine to you, I can add similar logic there.

+1 to adding this to torch.asarray as well.

This PR in principle seems fine, I think we may need to haggle about where exactly it goes in the priority chain.

In what is probably the most common case (no explicit stream argument, but only a single tensor over the default CUDA stream with regular ints/floats), __dlpack__ and __cuda_array_interface__ (CAI) should behave identically AFAIK. For more complex cases, DLPack learned some of the lessons from CAI and improved over its stream handling. I am unsure if that was fixed in a later CAI update though, or if this is what @wjakob is referring to with "significant flaws". From a browse of the docs (now at v3) there is at least no ROCm stream handling, which DLPack does have.

Another important difference is dtype handling. The dtypes supported by CAI are numpy type strings, while DLPack uses these enums. Where there are differences, DLPack dtypes are more useful to PyTorch I'd think. The extra numpy ones (strings, datetimes, object, void) are not interesting, while DLPack supports bfloat16 and lower-precision int/float dtypes.

DLPack is also more general w.r.t. other devices, and being actively evolved. E.g., it just gained a read-only flag, which may become relevant if PyTorch gains read-only/copy-on-write tensors for example. So I'd have a preference for trying DLPack before CAI.

The counter-argument I can think of may be stability. DLPack just tagged 1.0rc1. 1.0 is ABI-breaking; there's a way to query the version and handle both 0.x and 1.0 appropriately, but that should then of course be supported in PyTorch within 12-18 months or so after it comes out, otherwise there may be a situation in the future where other libraries drop 0.x support while PyTorch doesn't yet support 1.x.

@wjakob
Copy link
Contributor Author

wjakob commented Feb 27, 2024

@ezyang, @rgommers: Regarding the flaws of __cuda_array_interface__: besides the issue with streams and non-CUDA devices mentioned by @rgommers, let me quote the elephant in the room directly from the CuPy reference:

Warning: __cuda_array_interface__ specifies that the object lifetime must be managed by the user, so it is an undefined behavior if the exported object is destroyed while still in use by the consumer library.

(Yikes.) DLPack basically clarifies the ownership story and makes that part accessible to non-python consumers/producers that don't have a way of Py_DECREF-ing an object. It may even be worth prioritizing __dlpack__ over __cuda_array_interface__ if a tensor object supports both.

@rgommers
Copy link
Collaborator

Ah yes, lifetime management is a worse issue than the ones I mentioned. Then I'd definitely put DLPack first.

@vadimkantorov
Copy link
Contributor

Also, DLPack support in PyTorch maybe still has this bug: #43166

The variant of supporting full-user-in-charge lifetime management (or passing an extra user-provided deleter) is also important - for basic interop scenarios like #34646

Also, if conflicts between several interfaces might arise, is it possible to somehow select the wanted protocol? e.g. by torch.as_tensor(myobj.__dlpack__) or torch.as_tensor(myobj.__array_interface__)?

@wjakob
Copy link
Contributor Author

wjakob commented Feb 27, 2024

Also, DLPack support in PyTorch maybe still has this bug: #43166

@vadimkantorov Actually, I think I had reported the same issue separately (was unaware of this ticket) as #117273, which has been fixed by in the meantime.

@ezyang
Copy link
Contributor

ezyang commented Feb 28, 2024

Needs test

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just waiting for test

Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Apr 28, 2024
@github-actions github-actions bot closed this May 28, 2024
@ysiraichi
Copy link
Collaborator

@wjakob Are you still working on this? If not, is it ok if I take over this PR?

@wjakob
Copy link
Contributor Author

wjakob commented Oct 21, 2024

@ysiraichi please go ahead, I ran out of time to finish this.

pytorchmergebot pushed a commit that referenced this pull request Oct 26, 2024
#138697)

Fixes #120614
Takes over #120615

In summary, this PR:
- Adds a `__dlpack__` attribute check in the tensor creation path (i.e. [`internal_new_from_data` @ tensor_new.cpp](https://github.com/pytorch/pytorch/blob/cdfe1bffd16bdd28adbe5518038f68e6ac45de8d/torch/csrc/utils/tensor_new.cpp#L266))
    - Creates the tensor by using the DLPack machinery, instead of an element-by-element copy
    - No changes since #120615
- Adds a test, making sure the DLPack machinery is used
    - Wraps a tensor in a fresh `TensorDLPackWrapper` class that implements only the DLPack methods
    - Creates a new tensor from an instance of `TensorDLPackWrapper`
Pull Request resolved: #138697
Approved by: https://github.com/ezyang

Co-authored-by: Wenzel Jakob <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Leverage DLPack-based construction of PyTorch tensors to avoid costly element-by-element copy
6 participants