NotImplementedError: Mistral model requires flash attention v2

### System Info

```
2023-10-29T17:49:27.617627Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.70.0
Commit sha: 96a982ad8fc232479384476b1596a880697cc1d0
Docker label: N/A
nvidia-smi:
Sun Oct 29 17:49:27 2023       
   +---------------------------------------------------------------------------------------+
   | NVIDIA-SMI 535.86.05              Driver Version: 535.86.05    CUDA Version: 12.2     |
   |-----------------------------------------+----------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
   |                                         |                      |               MIG M. |
   |=========================================+======================+======================|
   |   0  NVIDIA A100 80GB PCIe          On  | 00000000:61:00.0 Off |                    0 |
   | N/A   31C    P0              43W / 300W |      4MiB / 81920MiB |      0%      Default |
   |                                         |                      |             Disabled |
   +-----------------------------------------+----------------------+----------------------+
                                                                                            
   +---------------------------------------------------------------------------------------+
   | Processes:                                                                            |
   |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
   |        ID   ID                                                             Usage      |
   |=======================================================================================|
   |  No running processes found                                                           |
   +---------------------------------------------------------------------------------------+
2023-10-29T17:49:27.617675Z  INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "0.0.0.0", port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: true }
2023-10-29T17:49:27.617820Z  INFO download: text_generation_launcher: Starting download process.
2023-10-29T17:49:30.977268Z  INFO text_generation_launcher: Download file: model.safetensors

2023-10-29T17:49:40.033570Z  INFO text_generation_launcher: Downloaded /root/.cache/huggingface/hub/models--bigscience--bloom-560m/snapshots/ac2ae5fab2ce3f9f40dc79b5ca9f637430d24971/model.safetensors in 0:00:09.

2023-10-29T17:49:40.033696Z  INFO text_generation_launcher: Download: [1/1] -- ETA: 0

2023-10-29T17:49:40.484023Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-10-29T17:49:40.484381Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-10-29T17:49:42.825695Z  WARN text_generation_launcher: We're not using custom kernels.

2023-10-29T17:49:42.847772Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: No module named 'vllm'

2023-10-29T17:49:42.860140Z  WARN text_generation_launcher: Could not import Mistral model: No module named 'dropout_layer_norm'

2023-10-29T17:49:46.872499Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0

2023-10-29T17:49:46.894493Z  INFO shard-manager: text_generation_launcher: Shard ready in 6.408955905s rank=0
2023-10-29T17:49:46.991320Z  INFO text_generation_launcher: Starting Webserver
2023-10-29T17:49:47.281224Z  WARN text_generation_router: router/src/main.rs:349: `--revision` is not set
2023-10-29T17:49:47.281249Z  WARN text_generation_router: router/src/main.rs:350: We strongly advise to set it to a known supported commit.
2023-10-29T17:49:47.621493Z  INFO text_generation_router: router/src/main.rs:371: Serving revision ac2ae5fab2ce3f9f40dc79b5ca9f637430d24971 of model bigscience/bloom-560m
2023-10-29T17:49:47.628143Z  INFO text_generation_router: router/src/main.rs:213: Warming up model
2023-10-29T17:49:48.022232Z  WARN text_generation_router: router/src/main.rs:224: Model does not support automatic max batch total tokens
2023-10-29T17:49:48.022267Z  INFO text_generation_router: router/src/main.rs:246: Setting max batch total tokens to 16000
2023-10-29T17:49:48.022276Z  INFO text_generation_router: router/src/main.rs:247: Connected
```



### Information

- [ ] Docker
- [X] The CLI directly

### Tasks

- [X] An officially supported command
- [ ] My own modifications

### Reproduction

This is the command that I am using to run the server
`text-generation-launcher --model-id HuggingFaceH4/zephyr-7b-beta --port 8080`

but this every time throws me this error. I have already installed flash-attention and running `from flash_attn import flash_attn_qkvpacked_func, flash_attn_func` works well.

I am not sure how to get the model running here. Can anyone help me out.

```
2023-10-29T19:05:31.928073Z  INFO text_generation_launcher: Args { model_id: "HuggingFaceH4/zephyr-7b-beta", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "0.0.0.0", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-10-29T19:05:31.928233Z  INFO download: text_generation_launcher: Starting download process.
2023-10-29T19:05:34.766283Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-10-29T19:05:35.184199Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-10-29T19:05:35.184593Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-10-29T19:05:37.497817Z  WARN text_generation_launcher: We're not using custom kernels.

2023-10-29T19:05:38.083636Z  WARN text_generation_launcher: Unable to use Flash Attention V2: Flash Attention V2 is not installed.
Use the official Docker image (ghcr.io/huggingface/text-generation-inference:latest) or install flash attention v2 with `cd server && make install install-flash-attention-v2`

2023-10-29T19:05:38.134483Z  WARN text_generation_launcher: Could not import Mistral model: Mistral model requires flash attn v2

2023-10-29T19:05:38.426160Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/usr/local/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main
    return _main(
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/workspace/text-generation-inference/server/text_generation_server/cli.py", line 83, in serve
    server.serve(
  File "/workspace/text-generation-inference/server/text_generation_server/server.py", line 207, in serve
    asyncio.run(
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 633, in run_until_complete
    self.run_forever()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 600, in run_forever
    self._run_once()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1896, in _run_once
    handle._run()
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/workspace/text-generation-inference/server/text_generation_server/server.py", line 159, in serve_inner
    model = get_model(
  File "/workspace/text-generation-inference/server/text_generation_server/models/__init__.py", line 259, in get_model
    raise NotImplementedError("Mistral model requires flash attention v2")
NotImplementedError: Mistral model requires flash attention v2

2023-10-29T19:05:38.990912Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/usr/local/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/workspace/text-generation-inference/server/text_generation_server/cli.py", line 83, in serve
    server.serve(

  File "/workspace/text-generation-inference/server/text_generation_server/server.py", line 207, in serve
    asyncio.run(

  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/usr/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
    return future.result()

  File "/workspace/text-generation-inference/server/text_generation_server/server.py", line 159, in serve_inner
    model = get_model(

  File "/workspace/text-generation-inference/server/text_generation_server/models/__init__.py", line 259, in get_model
    raise NotImplementedError("Mistral model requires flash attention v2")

NotImplementedError: Mistral model requires flash attention v2
 rank=0
2023-10-29T19:05:39.088464Z ERROR text_generation_launcher: Shard 0 failed to start
2023-10-29T19:05:39.088508Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart
```

### Expected behavior

The server is exected to start running on port 8080

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NotImplementedError: Mistral model requires flash attention v2 #1208

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NotImplementedError: Mistral model requires flash attention v2 #1208

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions