Skip to content

Added Abstract Type for Model Server Client #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions inference_perf/client/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Custom Clients

All custom clients are organized in this directory including model server and metrics clients. The directory structure is organized to reflect relationships/commonalities between clients.
14 changes: 14 additions & 0 deletions inference_perf/client/model_servers/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Model Server Clients
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say, we don't need a separate sub module for model server clients right away. We can start with the model server client being a separate file. As we get more clients, we can split them as needed. Otherwise importing all the submodules requires additional work on the part of the caller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can get behind that, maybe keeping text-to-text and text-to-image servers separate may be less confusing in the long run especially if benchmarking diffusion models is on our roadmap. Having a dedicated text-to-text abstract class is to deduplicate the common functions like making requests since that procedure is going to be the same regardless of model server (create request, send request, parse relevant info from response). See the example in the other comment.


Common functionality appears beween model servers with similar input and output types. These model servers are organized accordingly.

Todo:
- **Text to Text**:
- Naive_transformers
- tensorrt_llm_triton
- sax
- tgi
- vllm
- jetstream
- **Text to Image**:
- Maxdiffusion
13 changes: 13 additions & 0 deletions inference_perf/client/model_servers/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright 2025
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
78 changes: 78 additions & 0 deletions inference_perf/client/model_servers/client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
from abc import ABC, abstractmethod
from typing import Any, List
import asyncio
import aiohttp


class ErrorsReport:
ClientConnectorErrors: int
TimeoutErrors: int
ContentTypeErrors: int
ClientOSErrors: int
ServerDisconnectedErrors: int
unknown_errors: int

def __init__(self) -> None:
self.ClientConnectorErrors = 0
self.TimeoutErrors = 0
self.ContentTypeErrors = 0
self.ClientOSErrors = 0
self.ServerDisconnectedErrors = 0
self.unknown_errors = 0

def to_dict(self) -> dict[str, int]:
return {k: v for k, v in self.__dict__.items() if isinstance(v, int)}

def record_error(self, error: Exception) -> None:
if isinstance(error, aiohttp.client_exceptions.ClientConnectorError):
self.ClientConnectorErrors += 1
print(f"ClientConnectorError: {error}")
elif isinstance(error, asyncio.TimeoutError):
self.TimeoutErrors += 1
print(f"TimeoutError: {error}")
elif isinstance(error, aiohttp.client_exceptions.ContentTypeError):
self.ContentTypeErrors += 1
print(f"ContentTypeError: {error}")
elif isinstance(error, aiohttp.client_exceptions.ClientOSError):
self.ClientOSErrors += 1
print(f"ClientOSError: {error}")
elif isinstance(error, aiohttp.client_exceptions.ServerDisconnectedError):
self.ServerDisconnectedErrors += 1
print(f"ServerDisconnectedError: {error}")
else:
self.unknown_errors += 1
print(f"Unknown error: {error}")

def append_report(self, report: "ErrorsReport") -> None:
self.ClientConnectorErrors += report.ClientConnectorErrors
self.TimeoutErrors += report.TimeoutErrors
self.ContentTypeErrors += report.ContentTypeErrors
self.ClientOSErrors += report.ClientOSErrors
self.ServerDisconnectedErrors += report.ServerDisconnectedErrors
self.unknown_errors += report.unknown_errors


class Model_Server_Client(ABC):
# The client will collect a summary of all observed errors
Errors: ErrorsReport

@abstractmethod
def summary(self) -> Any:
"""
Returns summary data derived from all inputs and outputs, depends on the clients input and output data types and as such subclasses should implement this at the client data type level (e.g., text-to-text, text-to-image).
"""
pass

@abstractmethod
def request(self, *args: Any, **kwargs: Any) -> Any:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see a concrete implementation alongside this for a model server. vLLM is a good starting point so we can see how this looks in practice and if we need to change / add to this interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in progress because agree, idea is to do something like this for the text-to-text class:

    async def request(
        self, api_url: str, prompt: str, settings: Text_To_Text_Request_Settings
    ) -> Response | Exception:
        request: Request = self.build_request(prompt, settings)
        ttft: float = 0.0
        start_time: float = time.perf_counter()
        output: str = ""
        timeout = aiohttp.ClientTimeout(total=10000)
        async with aiohttp.ClientSession(timeout=timeout, trust_env=True) as session:
            try:
                async with session.post(api_url, **request, ssl=False) as response:
                    if settings["streaming"]:
                        async for chunk_bytes in response.content.iter_chunks():
                            chunk_bytes = chunk_bytes[0].strip()
                            if not chunk_bytes:
                                continue
                            timestamp = time.perf_counter()

                            if ttft == 0.0:
                                ttft = timestamp - start_time
                        standardized_resopnse = self.parse_response(response, settings)
                        standardized_resopnse["time_to_first_token"] = ttft
                        return standardized_resopnse
                    else:
                        return self.parse_response(await response, settings)
            except Exception as e:
                self.Errors.record_error(e)
                return e
                
    @abstractmethod
    def build_request(
        self, prompt: str, settings: Text_To_Text_Request_Settings
    ) -> Request:
        """
        Request headers and bodies depend on the specific model server
        """
        pass

    @abstractmethod
    def parse_response(
        self, response: requests.Response, settings: Text_To_Text_Request_Settings
    ) -> Response:
        """
        Since model server responses are not standardized
        """
        pass

That way this is all we would need for a vllm client:

class vLLM_Client(Text_To_Text_Model_Server_Client):

    def build_request(
        self, prompt: str, settings: Text_To_Text_Request_Settings
    ) -> Any:
        return {
            "headers": {"User-Agent": "Test Client"},
            "json": {
                "prompt": prompt,
                "use_beam_search": settings["use_beam_search"],
                "temperature": 0.0,
                "max_tokens": settings["output_len"],
                "stream": settings["streaming"],
            },
            "streaming": settings["streaming"],
        }

    def parse_response(
        self, response: requests.Response, settings: Text_To_Text_Request_Settings
    ) -> Response:
        res: List[Any] = []  # response["choices"]
        output_token_ids = self.tokenizer(res[0]["text"]).input_ids
        return {
            "num_output_tokens": len(output_token_ids),
            "request_duration": 0.0,
            "time_to_first_token": None,
        }

Simillar for jetstream for jetstream:

class Jetstream_Client(Text_To_Text_Model_Server_Client):

    def build_request(
        self, prompt: str, settings: Text_To_Text_Request_Settings
    ) -> Any:

        return {
            "json": {
                "prompt": prompt,
                "max_tokens": settings["output_len"],
            }
        }

    def parse_response(
        self, response: requests.Response, settings: Text_To_Text_Request_Settings
    ) -> Response:
        res: List[Any] = []  # response["response"]
        output_token_ids = self.tokenizer(res).input_ids

        return {
            "num_output_tokens": len(output_token_ids),
            "request_duration": 0.0,
            "time_to_first_token": None,
        }
        pass

"""
This is the method loadgen should use to make requests to a model server
"""
pass

@abstractmethod
def list_model_server_metrics(self) -> list[str]:
"""
Returns list of model server metrics of interest.
"""
pass