Benchmark feature fixes tests #3

makungaj1 · 2024-05-01T19:58:51Z

Issue #, if available:

Description of changes:

Unit tests
Customize Python LRU to only cache when output is complete

Testing done:
Notebook

-- Without identity-based policy that allows the pricing:GetProducts action

jumpstart_model = JumpStartModel(model_id="meta-textgeneration-llama-3-70b")
jumpstart_model.list_deployment_configs()

Instance rate metrics will be omitted. Reason: User: arn:aws:sts::312206380606:assumed-role/AmazonSageMaker-ExecutionRole-20230707T131628/SageMaker is not authorized to perform: pricing:GetProducts because no identity-based policy allows the pricing:GetProducts action
[{'DeploymentConfigName': 'tgi',
  'DeploymentArgs': {'ImageUri': '875423407011.dkr.ecr.us-west-2.amazonaws.com/suzuka-beta:djl-serving-v3.2',
   'ModelData': {'S3DataSource': {'S3Uri': 's3://jumpstart-private-cache-prod-us-west-2/meta-textgeneration/meta-textgeneration-llama-3-70b/artifacts/inference-prepack/v1.0.0/',
     'S3DataType': 'S3Prefix',
     'CompressionType': 'None'}},
   'Environment': {'SAGEMAKER_PROGRAM': 'inference.py',
    'ENDPOINT_SERVER_TIMEOUT': '3600',
    'MODEL_CACHE_ROOT': '/opt/ml/model',
    'SAGEMAKER_ENV': '1',
    'HF_MODEL_ID': '/opt/ml/model',
    'SERVING_LOAD_MODELS': 'test::MPI=/opt/ml/model',
    'OPTION_MODEL_ID': '/opt/ml/model',
    'OPTION_SPECULATIVE_DRAFT_MODEL': 'sagemaker',
    'OPTION_TENSOR_PARALLEL_DEGREE': 'max',
    'OPTION_MAX_ROLLING_BATCH_SIZE': '64',
    'OPTION_ROLLING_BATCH': 'lmi-dist',
    'OPTION_GPU_MEMORY_UTILIZATION': '0.85',
    'SAGEMAKER_MODEL_SERVER_WORKERS': '1'},
   'InstanceType': 'ml.p4d.24xlarge',
   'ComputeResourceRequirements': {'MinMemoryRequiredInMb': 589824,
    'NumberOfAcceleratorDevicesRequired': 8},
   'ModelDataDownloadTimeout': 1200,
   'ContainerStartupHealthCheckTimeout': 1200},
  'AccelerationConfigs': None,
  'BenchmarkMetrics': None},
 {'DeploymentConfigName': 'lmi',
  'DeploymentArgs': {'ImageUri': '875423407011.dkr.ecr.us-west-2.amazonaws.com/suzuka-beta:djl-serving-v3.2',
   'ModelData': {'S3DataSource': {'S3Uri': 's3://jumpstart-private-cache-prod-us-west-2/meta-textgeneration/meta-textgeneration-llama-3-70b/artifacts/inference-prepack/v1.0.0/',
     'S3DataType': 'S3Prefix',
     'CompressionType': 'None'}},
   'Environment': {'SAGEMAKER_PROGRAM': 'inference.py',
    'ENDPOINT_SERVER_TIMEOUT': '3600',
    'MODEL_CACHE_ROOT': '/opt/ml/model',
    'SAGEMAKER_ENV': '1',
    'HF_MODEL_ID': '/opt/ml/model',
    'SERVING_LOAD_MODELS': 'test::MPI=/opt/ml/model',
    'OPTION_MODEL_ID': '/opt/ml/model',
    'OPTION_SPECULATIVE_DRAFT_MODEL': 'sagemaker',
    'OPTION_TENSOR_PARALLEL_DEGREE': 'max',
    'OPTION_MAX_ROLLING_BATCH_SIZE': '64',
    'OPTION_ROLLING_BATCH': 'lmi-dist',
    'OPTION_GPU_MEMORY_UTILIZATION': '0.85',
    'SAGEMAKER_MODEL_SERVER_WORKERS': '1'},
   'InstanceType': 'ml.p4d.24xlarge',
   'ComputeResourceRequirements': {'MinMemoryRequiredInMb': 589824,
    'NumberOfAcceleratorDevicesRequired': 8},
   'ModelDataDownloadTimeout': 1200,
   'ContainerStartupHealthCheckTimeout': 1200},
  'AccelerationConfigs': None,
  'BenchmarkMetrics': {'ml.p4d.24xlarge': [{'name': 'Min Latency',
     'value': '26',
     'unit': 'ms/token'},
    {'name': 'Max Throughput', 'value': '983', 'unit': 'tokens/sec'},
    {'name': 'Max Concurrency', 'value': '256', 'unit': 'count'}]}},
 {'DeploymentConfigName': 'lmi-accelerated',
  'DeploymentArgs': {'ImageUri': '875423407011.dkr.ecr.us-west-2.amazonaws.com/suzuka-beta:djl-serving-v3.2',
   'ModelData': {'S3DataSource': {'S3Uri': 's3://jumpstart-private-cache-prod-us-west-2/meta-textgeneration/meta-textgeneration-llama-3-70b/artifacts/inference-prepack/v1.0.0/',
     'S3DataType': 'S3Prefix',
     'CompressionType': 'None'}},
   'Environment': {'SAGEMAKER_PROGRAM': 'inference.py',
    'ENDPOINT_SERVER_TIMEOUT': '3600',
    'MODEL_CACHE_ROOT': '/opt/ml/model',
    'SAGEMAKER_ENV': '1',
    'HF_MODEL_ID': '/opt/ml/model',
    'SERVING_LOAD_MODELS': 'test::MPI=/opt/ml/model',
    'OPTION_MODEL_ID': '/opt/ml/model',
    'OPTION_SPECULATIVE_DRAFT_MODEL': 'sagemaker',
    'OPTION_TENSOR_PARALLEL_DEGREE': 'max',
    'OPTION_MAX_ROLLING_BATCH_SIZE': '64',
    'OPTION_ROLLING_BATCH': 'lmi-dist',
    'OPTION_GPU_MEMORY_UTILIZATION': '0.85',
    'SAGEMAKER_MODEL_SERVER_WORKERS': '1'},
   'InstanceType': 'ml.p4d.24xlarge',
   'ComputeResourceRequirements': {'MinMemoryRequiredInMb': 589824,
    'NumberOfAcceleratorDevicesRequired': 8},
   'ModelDataDownloadTimeout': 1200,
   'ContainerStartupHealthCheckTimeout': 1200},
  'AccelerationConfigs': None,
  'BenchmarkMetrics': {'ml.p4d.24xlarge': [{'name': 'Min Latency',
     'value': '10',
     'unit': 'ms/token'},
    {'name': 'Max Throughput', 'value': '635', 'unit': 'tokens/sec'},
    {'name': 'Max Concurrency', 'value': '64', 'unit': 'count'}]}}]

jumpstart_model.display_benchmark_metrics()

Instance rate metrics will be omitted. Reason: User: arn:aws:sts::312206380606:assumed-role/AmazonSageMaker-ExecutionRole-20230707T131628/SageMaker is not authorized to perform: pricing:GetProducts because no identity-based policy allows the pricing:GetProducts action
| Instance Type             | Config Name     |   Min Latency (ms/token) |   Max Throughput (tokens/sec) |   Max Concurrency (count) |
|:--------------------------|:----------------|-------------------------:|------------------------------:|--------------------------:|
| ml.g5.48xlarge            | lmi             |                       57 |                            60 |                         8 |
| ml.p4d.24xlarge           | lmi             |                       26 |                           983 |                       256 |
| ml.g5.48xlarge            | lmi-accelerated |                       33 |                            43 |                         8 |
| ml.p4d.24xlarge (Default) | lmi-accelerated |                       10 |                           635 |                        64 |

-- With identity-based policy that allows the pricing:GetProducts action: No need to reboot the kernel, the output will be cached so future requests can use.

NOTE: After adding the policy, need to give it at least 10 seconds to propagate.

jumpstart_model.display_benchmark_metrics()

| Instance Type             | Config Name     |   Min Latency (ms/token) |   Max Throughput (tokens/sec) |   Max Concurrency (count) |   Instance Rate (USD/Hrs) |
|:--------------------------|:----------------|-------------------------:|------------------------------:|--------------------------:|--------------------------:|
| ml.g5.48xlarge            | lmi             |                       57 |                            60 |                         8 |                     20.36 |
| ml.p4d.24xlarge           | lmi             |                       26 |                           983 |                       256 |                     38.67 |
| ml.g5.48xlarge            | lmi-accelerated |                       33 |                            43 |                         8 |                     20.36 |
| ml.p4d.24xlarge (Default) | lmi-accelerated |                       10 |                           635 |                        64 |                     38.67 |

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

[ x] I have read the CONTRIBUTING doc
[x ] I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the Python SDK team
[x ] I used the commit message format described in CONTRIBUTING
[ x] I have passed the region in to all S3 and STS clients that I've initialized as part of this change.
[x ] I have updated any necessary documentation, including READMEs and API docs (if appropriate)

Tests

[x ] I have added tests that prove my fix is effective or that my feature works (if appropriate)
[x ] I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes
[x ] I have checked that my tests are not configured for a specific region or account (if appropriate)
[x ] I have used unique_name_from_base to create resource names in integ tests (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

src/sagemaker/jumpstart/model.py

Jonathan Makunga added 11 commits May 1, 2024 11:01

Unit tests

055cf4d

Custom lru

5547922

Custom lru

3a71233

Custom lru

0697473

Custom lru

4358dad

Custom lru

7f715e4

Custom lru

7c72895

Custom lru

707a23e

Custom lru

071e9c4

Custom lru

d122b06

Custom lru

01568fd

makungaj1 marked this pull request as ready for review May 1, 2024 20:08

makungaj1 self-assigned this May 1, 2024

qiyunz reviewed May 1, 2024

View reviewed changes

src/sagemaker/jumpstart/model.py Outdated Show resolved Hide resolved

Jonathan Makunga added 16 commits May 1, 2024 15:42

Refactoring

10430ea

Debug

296cc79

Config ranking

73118f6

Debug

a569871

Debug

c2cfff2

Debug

360033f

Debug

ce2b50b

Debug

8eab467

Ranking

e923392

Ranking-Debug

42f9434

Ranking-Debug

a276253

Ranking-Debug

a6bd660

Ranking-Debug

6421100

Ranking-Debug

a4c58be

Ranking-Debug

9c99144

Debug

8e1cb87

Jonathan Makunga added 4 commits May 2, 2024 08:49

Debug

3aa1f16

Debug

a9d4a7e

Debug

477e272

Refactoring

2c363af

makungaj1 merged commit c05e919 into benchmark-feature-fixes May 2, 2024

makungaj1 deleted the benchmark-feature-fixes-tests branch May 2, 2024 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark feature fixes tests #3

Benchmark feature fixes tests #3

makungaj1 commented May 1, 2024 •

edited

Loading

Benchmark feature fixes tests #3

Benchmark feature fixes tests #3

Conversation

makungaj1 commented May 1, 2024 • edited Loading

Merge Checklist

General

Tests

makungaj1 commented May 1, 2024 •

edited

Loading