Closed
Description
System Info
2023-10-29T17:49:27.617627Z INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.70.0
Commit sha: 96a982ad8fc232479384476b1596a880697cc1d0
Docker label: N/A
nvidia-smi:
Sun Oct 29 17:49:27 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.05 Driver Version: 535.86.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:61:00.0 Off | 0 |
| N/A 31C P0 43W / 300W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
2023-10-29T17:49:27.617675Z INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "0.0.0.0", port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: true }
2023-10-29T17:49:27.617820Z INFO download: text_generation_launcher: Starting download process.
2023-10-29T17:49:30.977268Z INFO text_generation_launcher: Download file: model.safetensors
2023-10-29T17:49:40.033570Z INFO text_generation_launcher: Downloaded /root/.cache/huggingface/hub/models--bigscience--bloom-560m/snapshots/ac2ae5fab2ce3f9f40dc79b5ca9f637430d24971/model.safetensors in 0:00:09.
2023-10-29T17:49:40.033696Z INFO text_generation_launcher: Download: [1/1] -- ETA: 0
2023-10-29T17:49:40.484023Z INFO download: text_generation_launcher: Successfully downloaded weights.
2023-10-29T17:49:40.484381Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-10-29T17:49:42.825695Z WARN text_generation_launcher: We're not using custom kernels.
2023-10-29T17:49:42.847772Z WARN text_generation_launcher: Could not import Flash Attention enabled models: No module named 'vllm'
2023-10-29T17:49:42.860140Z WARN text_generation_launcher: Could not import Mistral model: No module named 'dropout_layer_norm'
2023-10-29T17:49:46.872499Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2023-10-29T17:49:46.894493Z INFO shard-manager: text_generation_launcher: Shard ready in 6.408955905s rank=0
2023-10-29T17:49:46.991320Z INFO text_generation_launcher: Starting Webserver
2023-10-29T17:49:47.281224Z WARN text_generation_router: router/src/main.rs:349: `--revision` is not set
2023-10-29T17:49:47.281249Z WARN text_generation_router: router/src/main.rs:350: We strongly advise to set it to a known supported commit.
2023-10-29T17:49:47.621493Z INFO text_generation_router: router/src/main.rs:371: Serving revision ac2ae5fab2ce3f9f40dc79b5ca9f637430d24971 of model bigscience/bloom-560m
2023-10-29T17:49:47.628143Z INFO text_generation_router: router/src/main.rs:213: Warming up model
2023-10-29T17:49:48.022232Z WARN text_generation_router: router/src/main.rs:224: Model does not support automatic max batch total tokens
2023-10-29T17:49:48.022267Z INFO text_generation_router: router/src/main.rs:246: Setting max batch total tokens to 16000
2023-10-29T17:49:48.022276Z INFO text_generation_router: router/src/main.rs:247: Connected
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
This is the command that I am using to run the server
text-generation-launcher --model-id HuggingFaceH4/zephyr-7b-beta --port 8080
but this every time throws me this error. I have already installed flash-attention and running from flash_attn import flash_attn_qkvpacked_func, flash_attn_func
works well.
I am not sure how to get the model running here. Can anyone help me out.
2023-10-29T19:05:31.928073Z INFO text_generation_launcher: Args { model_id: "HuggingFaceH4/zephyr-7b-beta", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "0.0.0.0", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-10-29T19:05:31.928233Z INFO download: text_generation_launcher: Starting download process.
2023-10-29T19:05:34.766283Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2023-10-29T19:05:35.184199Z INFO download: text_generation_launcher: Successfully downloaded weights.
2023-10-29T19:05:35.184593Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-10-29T19:05:37.497817Z WARN text_generation_launcher: We're not using custom kernels.
2023-10-29T19:05:38.083636Z WARN text_generation_launcher: Unable to use Flash Attention V2: Flash Attention V2 is not installed.
Use the official Docker image (ghcr.io/huggingface/text-generation-inference:latest) or install flash attention v2 with `cd server && make install install-flash-attention-v2`
2023-10-29T19:05:38.134483Z WARN text_generation_launcher: Could not import Mistral model: Mistral model requires flash attn v2
2023-10-29T19:05:38.426160Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/usr/local/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main
return _main(
File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/workspace/text-generation-inference/server/text_generation_server/cli.py", line 83, in serve
server.serve(
File "/workspace/text-generation-inference/server/text_generation_server/server.py", line 207, in serve
asyncio.run(
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 633, in run_until_complete
self.run_forever()
File "/usr/lib/python3.10/asyncio/base_events.py", line 600, in run_forever
self._run_once()
File "/usr/lib/python3.10/asyncio/base_events.py", line 1896, in _run_once
handle._run()
File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/workspace/text-generation-inference/server/text_generation_server/server.py", line 159, in serve_inner
model = get_model(
File "/workspace/text-generation-inference/server/text_generation_server/models/__init__.py", line 259, in get_model
raise NotImplementedError("Mistral model requires flash attention v2")
NotImplementedError: Mistral model requires flash attention v2
2023-10-29T19:05:38.990912Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
Traceback (most recent call last):
File "/usr/local/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/workspace/text-generation-inference/server/text_generation_server/cli.py", line 83, in serve
server.serve(
File "/workspace/text-generation-inference/server/text_generation_server/server.py", line 207, in serve
asyncio.run(
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/workspace/text-generation-inference/server/text_generation_server/server.py", line 159, in serve_inner
model = get_model(
File "/workspace/text-generation-inference/server/text_generation_server/models/__init__.py", line 259, in get_model
raise NotImplementedError("Mistral model requires flash attention v2")
NotImplementedError: Mistral model requires flash attention v2
rank=0
2023-10-29T19:05:39.088464Z ERROR text_generation_launcher: Shard 0 failed to start
2023-10-29T19:05:39.088508Z INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart
Expected behavior
The server is exected to start running on port 8080
Metadata
Metadata
Assignees
Labels
No labels