Skip to content

StreamFlags::NON_BLOCKING is unsound because of fringe asynchronous memory copy behavior in CUDA #15

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
RDambrosio016 opened this issue Dec 3, 2021 · 2 comments
Labels
C-cust Category: CPU-side CUDA Driver API I-unsound 💥 Issue: A soundness hole on the CPU or the GPU

Comments

@RDambrosio016
Copy link
Member

Streams with NON_BLOCKING exhibit very confusing and very dangerous behavior with regards to memcpy due to odd CUDA semantics, per the driver API docs:

For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.

Because NON_BLOCKING streams do not synchronize with the null (default) stream, this leads to potential race conditions. NVIDIA appears to be aware of this issue, but in the mean time, it may be beneficial to implicitly disable NON_BLOCKING for now. Especially since cust does not expose stream ordered memory allocation.

This is what appears to be happening in the add example sometimes not doing anything on certain systems.

@RDambrosio016 RDambrosio016 added I-unsound 💥 Issue: A soundness hole on the CPU or the GPU C-cust Category: CPU-side CUDA Driver API labels Dec 3, 2021
@RDambrosio016
Copy link
Member Author

I have temporarily disabled StreamFlags::NON_BLOCKING in the unreleased version of cust. This should not have a major performance impact since cust does not expose the null stream anyways. I'll leave this open until NVIDIA gets back to me about this issue

@coreylowman
Copy link

coreylowman commented Sep 13, 2022

@RDambrosio016 could this be solved by using the async version of the copy functions, which require a stream argument? If I'm understanding correctly the issue comes from the default Memcpy functions using the null stream. but since cust doesn't expose the null stream, all the kernels are on different streams than the copies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-cust Category: CPU-side CUDA Driver API I-unsound 💥 Issue: A soundness hole on the CPU or the GPU
Projects
None yet
Development

No branches or pull requests

2 participants