V100 raiseError that may be too restrictive

### System Info

In https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/models/__init__.py,
we test if cuda device capability is > 7.5. 

While this is necessary for FlashAttention , this snippet leads to an import error for GPUs "not supported" but that should still be able to handle CausalLM inference (V100) ?

Can't we just turn off FlashAttention for non supported GPUs instead of raising the error ?

Thank you !

### Information

- [X] Docker
- [X] The CLI directly

### Tasks

- [X] An officially supported command
- [ ] My own modifications

### Reproduction

Run it with a V100 and a Llama implem for example

### Expected behavior

It should run, albeit without flash attention

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

V100 raiseError that may be too restrictive #319

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

V100 raiseError that may be too restrictive #319

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions