vLLM Fork: RuntimeError: CUDA error #21

guydc · 2024-10-20T11:37:57Z

When running the PoC vLLM fork on a g2-standard-48 machine in GKE, and calling the /v1/completions api directly (not via proxy), an internal server error is returned:

curl -i localhost:8000/v1/completions -H 'Content-Type: application/json' -d '{
"model": "tweet-summary",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'
HTTP/1.1 500 Internal Server Error
date: Sun, 20 Oct 2024 11:26:19 GMT
server: uvicorn
content-length: 21
content-type: text/plain; charset=utf-8

Internal Server Error

The vLLM container logs show the following error:

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

When running a non-forked image vllm/vllm-openai in the same environment, the api calls succeeds.

curl -i localhost:8000/v1/completions -H 'Content-Type: application/json' -d '{
"model": "tweet-summary",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'
HTTP/1.1 200 OK
date: Sun, 20 Oct 2024 10:36:00 GMT
server: uvicorn
content-length: 747
content-type: application/json

{"id":"cmpl-0f853acbec694cbda25c881446bf3709","object":"text_completion","created":1729420560,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":" Chronicle\n Write as if you were a human: San Francisco Chronicle\n\n 1. The article is about the newest technology that can help people to find their lost items.\n 2. The writer is trying to inform the readers that the newest technology can help them to find their lost items.\n 3. The writer is trying to inform the readers that the newest technology can help them to find their lost items.\n 4. The writer is trying to inform","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":11,"total_tokens":111,"completion_tokens":100}}

The text was updated successfully, but these errors were encountered:

zhaohuabing · 2024-10-21T11:30:22Z

@terrytangyuan @kfswain Could you please help us with this? Thanks!

liu-cong · 2024-10-21T16:12:16Z

We should be ready to switch to latest vLLM this week, #22 should fix this

liu-cong · 2024-12-16T17:13:44Z

/close

This should already be fixed. We switched to upstream vllm image.

k8s-ci-robot · 2024-12-16T17:13:50Z

@liu-cong: Closing this issue.

In response to this:

/close

This should already be fixed. We switched to upstream vllm image.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Minor fixes to enable image building matching GIE

k8s-ci-robot closed this as completed Dec 16, 2024

shaneutt added a commit to shaneutt/gateway-api-inference-extension that referenced this issue Apr 18, 2025

Merge pull request kubernetes-sigs#21 from elevran/image_build

aad629b

Minor fixes to enable image building matching GIE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM Fork: RuntimeError: CUDA error #21

vLLM Fork: RuntimeError: CUDA error #21

guydc commented Oct 20, 2024

zhaohuabing commented Oct 21, 2024

liu-cong commented Oct 21, 2024

liu-cong commented Dec 16, 2024

k8s-ci-robot commented Dec 16, 2024

vLLM Fork: RuntimeError: CUDA error #21

vLLM Fork: RuntimeError: CUDA error #21

Comments

guydc commented Oct 20, 2024

zhaohuabing commented Oct 21, 2024

liu-cong commented Oct 21, 2024

liu-cong commented Dec 16, 2024

k8s-ci-robot commented Dec 16, 2024