Closed
Description
System Info
In https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/models/__init__.py,
we test if cuda device capability is > 7.5.
While this is necessary for FlashAttention , this snippet leads to an import error for GPUs "not supported" but that should still be able to handle CausalLM inference (V100) ?
Can't we just turn off FlashAttention for non supported GPUs instead of raising the error ?
Thank you !
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
Run it with a V100 and a Llama implem for example
Expected behavior
It should run, albeit without flash attention
Metadata
Metadata
Assignees
Labels
No labels