Skip to content

NotImplementedError: Mistral model requires flash attention v2 #1208

Closed
@9throok

Description

@9throok

System Info

2023-10-29T17:49:27.617627Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.70.0
Commit sha: 96a982ad8fc232479384476b1596a880697cc1d0
Docker label: N/A
nvidia-smi:
Sun Oct 29 17:49:27 2023       
   +---------------------------------------------------------------------------------------+
   | NVIDIA-SMI 535.86.05              Driver Version: 535.86.05    CUDA Version: 12.2     |
   |-----------------------------------------+----------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
   |                                         |                      |               MIG M. |
   |=========================================+======================+======================|
   |   0  NVIDIA A100 80GB PCIe          On  | 00000000:61:00.0 Off |                    0 |
   | N/A   31C    P0              43W / 300W |      4MiB / 81920MiB |      0%      Default |
   |                                         |                      |             Disabled |
   +-----------------------------------------+----------------------+----------------------+
                                                                                            
   +---------------------------------------------------------------------------------------+
   | Processes:                                                                            |
   |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
   |        ID   ID                                                             Usage      |
   |=======================================================================================|
   |  No running processes found                                                           |
   +---------------------------------------------------------------------------------------+
2023-10-29T17:49:27.617675Z  INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "0.0.0.0", port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: true }
2023-10-29T17:49:27.617820Z  INFO download: text_generation_launcher: Starting download process.
2023-10-29T17:49:30.977268Z  INFO text_generation_launcher: Download file: model.safetensors

2023-10-29T17:49:40.033570Z  INFO text_generation_launcher: Downloaded /root/.cache/huggingface/hub/models--bigscience--bloom-560m/snapshots/ac2ae5fab2ce3f9f40dc79b5ca9f637430d24971/model.safetensors in 0:00:09.

2023-10-29T17:49:40.033696Z  INFO text_generation_launcher: Download: [1/1] -- ETA: 0

2023-10-29T17:49:40.484023Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-10-29T17:49:40.484381Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-10-29T17:49:42.825695Z  WARN text_generation_launcher: We're not using custom kernels.

2023-10-29T17:49:42.847772Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: No module named 'vllm'

2023-10-29T17:49:42.860140Z  WARN text_generation_launcher: Could not import Mistral model: No module named 'dropout_layer_norm'

2023-10-29T17:49:46.872499Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0

2023-10-29T17:49:46.894493Z  INFO shard-manager: text_generation_launcher: Shard ready in 6.408955905s rank=0
2023-10-29T17:49:46.991320Z  INFO text_generation_launcher: Starting Webserver
2023-10-29T17:49:47.281224Z  WARN text_generation_router: router/src/main.rs:349: `--revision` is not set
2023-10-29T17:49:47.281249Z  WARN text_generation_router: router/src/main.rs:350: We strongly advise to set it to a known supported commit.
2023-10-29T17:49:47.621493Z  INFO text_generation_router: router/src/main.rs:371: Serving revision ac2ae5fab2ce3f9f40dc79b5ca9f637430d24971 of model bigscience/bloom-560m
2023-10-29T17:49:47.628143Z  INFO text_generation_router: router/src/main.rs:213: Warming up model
2023-10-29T17:49:48.022232Z  WARN text_generation_router: router/src/main.rs:224: Model does not support automatic max batch total tokens
2023-10-29T17:49:48.022267Z  INFO text_generation_router: router/src/main.rs:246: Setting max batch total tokens to 16000
2023-10-29T17:49:48.022276Z  INFO text_generation_router: router/src/main.rs:247: Connected

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

This is the command that I am using to run the server
text-generation-launcher --model-id HuggingFaceH4/zephyr-7b-beta --port 8080

but this every time throws me this error. I have already installed flash-attention and running from flash_attn import flash_attn_qkvpacked_func, flash_attn_func works well.

I am not sure how to get the model running here. Can anyone help me out.

2023-10-29T19:05:31.928073Z  INFO text_generation_launcher: Args { model_id: "HuggingFaceH4/zephyr-7b-beta", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "0.0.0.0", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-10-29T19:05:31.928233Z  INFO download: text_generation_launcher: Starting download process.
2023-10-29T19:05:34.766283Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-10-29T19:05:35.184199Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-10-29T19:05:35.184593Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-10-29T19:05:37.497817Z  WARN text_generation_launcher: We're not using custom kernels.

2023-10-29T19:05:38.083636Z  WARN text_generation_launcher: Unable to use Flash Attention V2: Flash Attention V2 is not installed.
Use the official Docker image (ghcr.io/huggingface/text-generation-inference:latest) or install flash attention v2 with `cd server && make install install-flash-attention-v2`

2023-10-29T19:05:38.134483Z  WARN text_generation_launcher: Could not import Mistral model: Mistral model requires flash attn v2

2023-10-29T19:05:38.426160Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/usr/local/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main
    return _main(
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/workspace/text-generation-inference/server/text_generation_server/cli.py", line 83, in serve
    server.serve(
  File "/workspace/text-generation-inference/server/text_generation_server/server.py", line 207, in serve
    asyncio.run(
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 633, in run_until_complete
    self.run_forever()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 600, in run_forever
    self._run_once()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1896, in _run_once
    handle._run()
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/workspace/text-generation-inference/server/text_generation_server/server.py", line 159, in serve_inner
    model = get_model(
  File "/workspace/text-generation-inference/server/text_generation_server/models/__init__.py", line 259, in get_model
    raise NotImplementedError("Mistral model requires flash attention v2")
NotImplementedError: Mistral model requires flash attention v2

2023-10-29T19:05:38.990912Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/usr/local/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/workspace/text-generation-inference/server/text_generation_server/cli.py", line 83, in serve
    server.serve(

  File "/workspace/text-generation-inference/server/text_generation_server/server.py", line 207, in serve
    asyncio.run(

  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/usr/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
    return future.result()

  File "/workspace/text-generation-inference/server/text_generation_server/server.py", line 159, in serve_inner
    model = get_model(

  File "/workspace/text-generation-inference/server/text_generation_server/models/__init__.py", line 259, in get_model
    raise NotImplementedError("Mistral model requires flash attention v2")

NotImplementedError: Mistral model requires flash attention v2
 rank=0
2023-10-29T19:05:39.088464Z ERROR text_generation_launcher: Shard 0 failed to start
2023-10-29T19:05:39.088508Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

Expected behavior

The server is exected to start running on port 8080

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions