Skip to content

Pytorch deployment failing with unexpected errors #752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
carlomazzaferro opened this issue Apr 12, 2019 · 6 comments
Closed

Pytorch deployment failing with unexpected errors #752

carlomazzaferro opened this issue Apr 12, 2019 · 6 comments

Comments

@carlomazzaferro
Copy link

Please fill out the form below.

System Information

  • Framework: Pytorch
  • Framework Version: 1.0.0
  • Python Version: 3
  • CPU or GPU: CPU
  • Python SDK Version: 1.18.4
  • Are you using a custom image: No

Describe the problem

Model deployment fails with cryptic errors. See the logs below. The command issued to deploy the model is the following:

MODEL_PATH = 's3:///sagemaker-us-east-2-971148336196/improved-ner-training-0-25-0/output/model.tar.gz'

MODEL_NAME = 'improved-ner-model-model-' + os.environ['ENVIRONMENT']
ENDPOINT_NAME = 'improved-ner-model-sagemaker-endpoint-' + os.environ['ENVIRONMENT']

DEPLOY_INSTANCE = 'ml.m5.large'

model = PyTorchModel(model_data=MODEL_PATH, role=ROLE, entry_point='train_model.py',
                             sagemaker_session=sm_session, py_version='py3', framework_version='1.0.0',
                             name=ENDPOINT_NAME)

model.deploy(initial_instance_count=1, instance_type=DEPLOY_INSTANCE, endpoint_name=ENDPOINT_NAME)

The model is publicly available here:
https://s3.us-east-2.amazonaws.com/sagemaker-us-east-2-971148336196/improved-ner-training-0-25-0/output/model.tar.gz

It contains a directory called flair which contains the final_model.pt

The (relevant) part of the train_model.py script is the following:

def model_fn(model_dir):
    f_out = os.path.join(model_dir, 'flair')
    m = SequenceTagger.load_from_file(os.path.join(f_out, 'final-model.pt'))
    return m

def input_fn(request_body, request_content_type):
    if request_content_type.lower() != 'application/json':
        raise ValueError('Content type must be application/json')

    if 'sentence' not in request_body:
        raise ValueError('Request must be JSON formatted with key: sentence')
    return request_body['sentence']


def predict_fn(input_data, model):
    return model.predict(input_data)


if __name__ == "__main__":
    args, _ = parse_args()

    flair_out = os.path.join(args.model_dir, 'flair')
    trainer(flair_out)  # This trains a model using flair.trainer.ModelTrainer

    model = SequenceTagger.load_from_file(os.path.join(flair_out, 'final-model.pt'))
    # create example sentence
    sentence = Sentence('I love Berlin')

    # predict tags and print
    model.predict(sentence)

Minimal repro / logs

The CloudWatch logs are very opaque. One of the errors is the following:

sagemaker_containers._errors.ClientError: [Errno 30] Read-only file system: '/opt/ml/model/flair/final-model.pt'

Then, much later, these errors pop up:

Processing /opt/ml/code
Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: '/tmp/pip-req-tracker-27gca9by/35241637574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3'
You are using pip version 18.1, however version 19.0.3 is available.
[2019-04-12 03:49:26 +0000] [25] [ERROR] Error handling request /ping

Any ideas of what is actually causing the error, or some other steps to take to make it easier to debug?

@carlomazzaferro
Copy link
Author

Got to the bottom of it, and fixed it with quite an unsavory hack:

def model_fn(model_dir):
    try:
        os.system('cp /opt/ml/model/final-model.pt /etc')
    except Exception as e:
        print("os.system('cp /opt/ml/model/final-model.pt /etc') failed because: ", e)

    return SequenceTagger.load_from_file('/etc/final-model.pt')

Apparently the /opt/ml/model directory is not in the same fs and hence is not writable. The reason why it needs to be writable is the specific to how flair loads models. See here.

I'll close this for now, but I'd really like to know if there is a more efficient way of testing the deployments locally. Granted, the docker containers are made available for Pytorch and Tensorflow, but there is no clear documentation on the steps required to build the images with the required entry_point script, model_data, etc. locally and run the inference container as it would on the SageMaker environment.

@ShaneRyan1977
Copy link

Hi Carlo,

I found your post: #752
I'm wondering if you could answer one thing. Did you have any issues getting flair installed? In the endpoint docker image? What version of flair are you using?

My issue is that I keep getting "No module named 'flair'". Your model.tar.gz file is no longer available. Did that have something else inside it other than the pt or pth file?

Thanks in advance! I am completely stumped.

Here is the output from my Cloudwatch
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
[2019-06-13 18:43:31 +0000] [21] [INFO] Starting gunicorn 19.9.0
[2019-06-13 18:43:31 +0000] [21] [INFO] Listening at: unix:/tmp/gunicorn.sock (21)
[2019-06-13 18:43:31 +0000] [21] [INFO] Using worker: gevent
[2019-06-13 18:43:31 +0000] [28] [INFO] Booting worker with pid: 28
[2019-06-13 18:43:31 +0000] [29] [INFO] Booting worker with pid: 29
[2019-06-13 18:43:31 +0000] [33] [INFO] Booting worker with pid: 33
[2019-06-13 18:43:31 +0000] [37] [INFO] Booting worker with pid: 37
Processing /opt/ml/code
Building wheels for collected packages: mnist Running setup.py bdist_wheel for mnist: started Running setup.py bdist_wheel for mnist: finished with status 'done' Stored in directory: /tmp/pip-ephem-wheel-cache-ufs3gl8c/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
Successfully built mnist
Installing collected packages: mnist
Successfully installed mnist-1.0.0
You are using pip version 18.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
[2019-06-13 18:44:37 +0000] [28] [ERROR] Error handling request /ping
Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/sagemaker_containers/_functions.py", line 84, in wrapper return fn(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/mnist.py", line 159, in model_fn model.load_state_dict(torch.load(f)) File "/usr/local/lib/python3.5/dist-packages/torch/serialization.py", line 303, in load return _load(f,
ImportError: No module named 'flair'

@carlomazzaferro
Copy link
Author

carlomazzaferro commented Jun 13, 2019

Hey @ShaneRyan1977 AFAIK there are a couple of ways, one a bit hackier than the other. First one, is create a custom docker image:

# For more information on creating a Dockerfile
# https://docs.docker.com/compose/gettingstarted/#step-2-create-a-dockerfile
# https://github.com/awslabs/amazon-sagemaker-examples/master/advanced_functionality/pytorch_extending_our_containers/pytorch_extending_our_containers.ipynb
# SageMaker PyTorch image

ARG REGION

# These are the base images for pytorch-sagemaker 
FROM 520713654638.dkr.ecr.${REGION}.amazonaws.com/sagemaker-pytorch:1.0.0-cpu-py3  

RUN pip install flair          # Install flair, and whatever else you need

ENV PATH="/opt/ml/code:${PATH}"

# /opt/ml and all subdirectories are utilized by SageMaker, we use the /code subdirectory to store our user code.
COPY /ner /opt/ml/code

# this environment variable is used by the SageMaker PyTorch container to determine our user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code

You can then push that image to a private ECR repo with something like this:

algorithm_name=sagemaker-pytorch-ner
account=$(aws sts get-caller-identity --query Account --output text)

echo "ACCOUNT: ${account}"

region='us-east-1'

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Get the login command from ECR in order to pull down the SageMaker PyTorch image
$(aws ecr get-login --registry-ids 520713654638 --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build  -t ${algorithm_name} . --build-arg REGION=${region}
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

Then on training, you can specify your image as a base image to use: https://sagemaker.readthedocs.io/en/stable/sagemaker.pytorch.html#pytorch-estimator see the image_name parameter.

Alternatively, you can add a couple of lines at the very top of your training code:

import pip
pip._internal.main(['install',  'flair'])

import flair
import pytorch
etc....

Lastly, you can also add a bash file as your entry point. That bash script can run pip commands, and in the end it would call python train.py. There are some gotchas as far as using the bash command though, haven't had much success doing so.

@ShaneRyan1977
Copy link

Thanks for the incredibly quick reply. But I'm having a problem with deploying a model to an endpoint rather than training.

I actually trained the model using flair in a JupyterLab. Then I uploaded the model.tar.gz to s3.

I am trying to do something almost identical to your original code but deploying a trained model to an endpoint.

But I keep getting the errors I pasted.
I created a custom docker image with python 3.6 installed and python 3.5 not installed. But for some reason the code that runs in "/opt/ml/code" keeps installing python3.5 and then I get an error in Cloudwatch :
ImportError: No module named 'flair'

I am completely stumped. I am also amazed that you didn't have to build a custom image for deploying your endpoint!
Clearly something happens when the model is decompressed in:

@ShaneRyan1977
Copy link

The reason why python 3.5 is an issue is that flair 0.4.2 doesn't support it (it need python 3.6). Which is what I trained my model with.

@carlomazzaferro
Copy link
Author

The same is valid, you can either define your own Docker image for deployment of add the pip install script in your main inference script. You can create an instance of a model from a pre-trained model just using this class right here: https://sagemaker.readthedocs.io/en/stable/sagemaker.pytorch.html#pytorch-model

Namely:

model = PytorchModel(model_data='s3://path/to/model.tar.gz', image_name='ecr_image_with_flair', ...)

Then you can model.deploy() as you normally would.

I haven't tested this in some time, I actually solved the issue by deploying models with terraform (which I'd advise you to look into, as it solves many problems as far as repeatable deployments go). See here: https://www.terraform.io/docs/providers/aws/r/sagemaker_model.html

qidewenwhen added a commit to qidewenwhen/sagemaker-python-sdk that referenced this issue Dec 13, 2022
qidewenwhen added a commit to qidewenwhen/sagemaker-python-sdk that referenced this issue Dec 14, 2022
qidewenwhen added a commit to qidewenwhen/sagemaker-python-sdk that referenced this issue Dec 14, 2022
qidewenwhen added a commit to qidewenwhen/sagemaker-python-sdk that referenced this issue Dec 14, 2022
qidewenwhen added a commit to qidewenwhen/sagemaker-python-sdk that referenced this issue Dec 14, 2022
navinsoni pushed a commit that referenced this issue Dec 14, 2022
* feature: Add experiment plus Run class (#691)

* feature: Add Experiment helper classes (#646)

* feature: Add Experiment helper classes

feature: Add helper class _RunEnvironment

* change: Change sleep retry to backoff retry for get TC

* minor fixes in backoff retry

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add helper classes and methods for Run class (#660)

* feature: Add helper classes and methods for Run class

* Add Parent class to address comment

* fix docstyle check

* Add arg docstrings in _helper

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add Experiment Run class (#651)

Co-authored-by: Dewen Qi <[email protected]>

* change: Add integ tests for Run (#673)

Co-authored-by: Dewen Qi <[email protected]>

* Update run log metric to use MetricsManager (#678)

* Update run.log_metric to use _MetricsManager

* fix several metrics issues

* Add doc strings to metrics.py

Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dewen Qi <[email protected]>

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>

* change: Simplify exp plus integ test configuration (#694)

Co-authored-by: Dewen Qi <[email protected]>

* feature: add RunName to expeirment_config (#696)

* change: Update Run init and add Run load and _RunContext (#707)

* change: Update Run init and add Run load

Add exp name and run group name to load and address comments

* Address nit comments

Co-authored-by: Dewen Qi <[email protected]>

* fix: Fix run name uniqueness issue (#730)

Co-authored-by: Dewen Qi <[email protected]>

* change: Update integ tests for Exp Plus M1 changes (#741)

Co-authored-by: Dewen Qi <[email protected]>

* add metrics client to session object (#745)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* change: Add integ test for using Run in Transform Job (#749)

Co-authored-by: Dewen Qi <[email protected]>

* Add async metrics sink (#739)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* use metrics client provided by session (#754)

* fix flaky metrics test (#753)

* change: Change Run.init and Run.load to constructor and module method respectively (#752)

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add latest metric service model (#757)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* fix: lowercase run name (#767)

* Change: Minimize use of lower case tc name (#769)

* change: Clean up test resources to remove model files (#756)

* change: Clean up test resources to remove model files

* fix: Change experiment enums to upper case

* change: Upgrade boto3 and update test to validate mixed case name

* fix: Update as per latest botocore release and backend change

Co-authored-by: Dewen Qi <[email protected]>

* lowercase trial component name (#776)

* change: Expose sagemaker experiment doc strings

* fix: Fix exp name mixed case in issue

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Yifei Zhu <[email protected]>
claytonparnell pushed a commit to claytonparnell/sagemaker-python-sdk that referenced this issue Dec 16, 2022
* feature: Add experiment plus Run class (aws#691)

* feature: Add Experiment helper classes (aws#646)

* feature: Add Experiment helper classes

feature: Add helper class _RunEnvironment

* change: Change sleep retry to backoff retry for get TC

* minor fixes in backoff retry

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add helper classes and methods for Run class (aws#660)

* feature: Add helper classes and methods for Run class

* Add Parent class to address comment

* fix docstyle check

* Add arg docstrings in _helper

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add Experiment Run class (aws#651)

Co-authored-by: Dewen Qi <[email protected]>

* change: Add integ tests for Run (aws#673)

Co-authored-by: Dewen Qi <[email protected]>

* Update run log metric to use MetricsManager (aws#678)

* Update run.log_metric to use _MetricsManager

* fix several metrics issues

* Add doc strings to metrics.py

Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dewen Qi <[email protected]>

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>

* change: Simplify exp plus integ test configuration (aws#694)

Co-authored-by: Dewen Qi <[email protected]>

* feature: add RunName to expeirment_config (aws#696)

* change: Update Run init and add Run load and _RunContext (aws#707)

* change: Update Run init and add Run load

Add exp name and run group name to load and address comments

* Address nit comments

Co-authored-by: Dewen Qi <[email protected]>

* fix: Fix run name uniqueness issue (aws#730)

Co-authored-by: Dewen Qi <[email protected]>

* change: Update integ tests for Exp Plus M1 changes (aws#741)

Co-authored-by: Dewen Qi <[email protected]>

* add metrics client to session object (aws#745)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* change: Add integ test for using Run in Transform Job (aws#749)

Co-authored-by: Dewen Qi <[email protected]>

* Add async metrics sink (aws#739)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* use metrics client provided by session (aws#754)

* fix flaky metrics test (aws#753)

* change: Change Run.init and Run.load to constructor and module method respectively (aws#752)

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add latest metric service model (aws#757)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* fix: lowercase run name (aws#767)

* Change: Minimize use of lower case tc name (aws#769)

* change: Clean up test resources to remove model files (aws#756)

* change: Clean up test resources to remove model files

* fix: Change experiment enums to upper case

* change: Upgrade boto3 and update test to validate mixed case name

* fix: Update as per latest botocore release and backend change

Co-authored-by: Dewen Qi <[email protected]>

* lowercase trial component name (aws#776)

* change: Expose sagemaker experiment doc strings

* fix: Fix exp name mixed case in issue

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Yifei Zhu <[email protected]>
mufaddal-rohawala pushed a commit to mufaddal-rohawala/sagemaker-python-sdk that referenced this issue Dec 19, 2022
* feature: Add experiment plus Run class (aws#691)

* feature: Add Experiment helper classes (aws#646)

* feature: Add Experiment helper classes

feature: Add helper class _RunEnvironment

* change: Change sleep retry to backoff retry for get TC

* minor fixes in backoff retry

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add helper classes and methods for Run class (aws#660)

* feature: Add helper classes and methods for Run class

* Add Parent class to address comment

* fix docstyle check

* Add arg docstrings in _helper

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add Experiment Run class (aws#651)

Co-authored-by: Dewen Qi <[email protected]>

* change: Add integ tests for Run (aws#673)

Co-authored-by: Dewen Qi <[email protected]>

* Update run log metric to use MetricsManager (aws#678)

* Update run.log_metric to use _MetricsManager

* fix several metrics issues

* Add doc strings to metrics.py

Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dewen Qi <[email protected]>

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>

* change: Simplify exp plus integ test configuration (aws#694)

Co-authored-by: Dewen Qi <[email protected]>

* feature: add RunName to expeirment_config (aws#696)

* change: Update Run init and add Run load and _RunContext (aws#707)

* change: Update Run init and add Run load

Add exp name and run group name to load and address comments

* Address nit comments

Co-authored-by: Dewen Qi <[email protected]>

* fix: Fix run name uniqueness issue (aws#730)

Co-authored-by: Dewen Qi <[email protected]>

* change: Update integ tests for Exp Plus M1 changes (aws#741)

Co-authored-by: Dewen Qi <[email protected]>

* add metrics client to session object (aws#745)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* change: Add integ test for using Run in Transform Job (aws#749)

Co-authored-by: Dewen Qi <[email protected]>

* Add async metrics sink (aws#739)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* use metrics client provided by session (aws#754)

* fix flaky metrics test (aws#753)

* change: Change Run.init and Run.load to constructor and module method respectively (aws#752)

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add latest metric service model (aws#757)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* fix: lowercase run name (aws#767)

* Change: Minimize use of lower case tc name (aws#769)

* change: Clean up test resources to remove model files (aws#756)

* change: Clean up test resources to remove model files

* fix: Change experiment enums to upper case

* change: Upgrade boto3 and update test to validate mixed case name

* fix: Update as per latest botocore release and backend change

Co-authored-by: Dewen Qi <[email protected]>

* lowercase trial component name (aws#776)

* change: Expose sagemaker experiment doc strings

* fix: Fix exp name mixed case in issue

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Yifei Zhu <[email protected]>
mufaddal-rohawala pushed a commit that referenced this issue Dec 20, 2022
* feature: Add experiment plus Run class (#691)

* feature: Add Experiment helper classes (#646)

* feature: Add Experiment helper classes

feature: Add helper class _RunEnvironment

* change: Change sleep retry to backoff retry for get TC

* minor fixes in backoff retry

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add helper classes and methods for Run class (#660)

* feature: Add helper classes and methods for Run class

* Add Parent class to address comment

* fix docstyle check

* Add arg docstrings in _helper

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add Experiment Run class (#651)

Co-authored-by: Dewen Qi <[email protected]>

* change: Add integ tests for Run (#673)

Co-authored-by: Dewen Qi <[email protected]>

* Update run log metric to use MetricsManager (#678)

* Update run.log_metric to use _MetricsManager

* fix several metrics issues

* Add doc strings to metrics.py

Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dewen Qi <[email protected]>

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>

* change: Simplify exp plus integ test configuration (#694)

Co-authored-by: Dewen Qi <[email protected]>

* feature: add RunName to expeirment_config (#696)

* change: Update Run init and add Run load and _RunContext (#707)

* change: Update Run init and add Run load

Add exp name and run group name to load and address comments

* Address nit comments

Co-authored-by: Dewen Qi <[email protected]>

* fix: Fix run name uniqueness issue (#730)

Co-authored-by: Dewen Qi <[email protected]>

* change: Update integ tests for Exp Plus M1 changes (#741)

Co-authored-by: Dewen Qi <[email protected]>

* add metrics client to session object (#745)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* change: Add integ test for using Run in Transform Job (#749)

Co-authored-by: Dewen Qi <[email protected]>

* Add async metrics sink (#739)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* use metrics client provided by session (#754)

* fix flaky metrics test (#753)

* change: Change Run.init and Run.load to constructor and module method respectively (#752)

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add latest metric service model (#757)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* fix: lowercase run name (#767)

* Change: Minimize use of lower case tc name (#769)

* change: Clean up test resources to remove model files (#756)

* change: Clean up test resources to remove model files

* fix: Change experiment enums to upper case

* change: Upgrade boto3 and update test to validate mixed case name

* fix: Update as per latest botocore release and backend change

Co-authored-by: Dewen Qi <[email protected]>

* lowercase trial component name (#776)

* change: Expose sagemaker experiment doc strings

* fix: Fix exp name mixed case in issue

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Yifei Zhu <[email protected]>
JoseJuan98 pushed a commit to JoseJuan98/sagemaker-python-sdk that referenced this issue Mar 4, 2023
* feature: Add experiment plus Run class (aws#691)

* feature: Add Experiment helper classes (aws#646)

* feature: Add Experiment helper classes

feature: Add helper class _RunEnvironment

* change: Change sleep retry to backoff retry for get TC

* minor fixes in backoff retry

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add helper classes and methods for Run class (aws#660)

* feature: Add helper classes and methods for Run class

* Add Parent class to address comment

* fix docstyle check

* Add arg docstrings in _helper

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add Experiment Run class (aws#651)

Co-authored-by: Dewen Qi <[email protected]>

* change: Add integ tests for Run (aws#673)

Co-authored-by: Dewen Qi <[email protected]>

* Update run log metric to use MetricsManager (aws#678)

* Update run.log_metric to use _MetricsManager

* fix several metrics issues

* Add doc strings to metrics.py

Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dewen Qi <[email protected]>

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>

* change: Simplify exp plus integ test configuration (aws#694)

Co-authored-by: Dewen Qi <[email protected]>

* feature: add RunName to expeirment_config (aws#696)

* change: Update Run init and add Run load and _RunContext (aws#707)

* change: Update Run init and add Run load

Add exp name and run group name to load and address comments

* Address nit comments

Co-authored-by: Dewen Qi <[email protected]>

* fix: Fix run name uniqueness issue (aws#730)

Co-authored-by: Dewen Qi <[email protected]>

* change: Update integ tests for Exp Plus M1 changes (aws#741)

Co-authored-by: Dewen Qi <[email protected]>

* add metrics client to session object (aws#745)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* change: Add integ test for using Run in Transform Job (aws#749)

Co-authored-by: Dewen Qi <[email protected]>

* Add async metrics sink (aws#739)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* use metrics client provided by session (aws#754)

* fix flaky metrics test (aws#753)

* change: Change Run.init and Run.load to constructor and module method respectively (aws#752)

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add latest metric service model (aws#757)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* fix: lowercase run name (aws#767)

* Change: Minimize use of lower case tc name (aws#769)

* change: Clean up test resources to remove model files (aws#756)

* change: Clean up test resources to remove model files

* fix: Change experiment enums to upper case

* change: Upgrade boto3 and update test to validate mixed case name

* fix: Update as per latest botocore release and backend change

Co-authored-by: Dewen Qi <[email protected]>

* lowercase trial component name (aws#776)

* change: Expose sagemaker experiment doc strings

* fix: Fix exp name mixed case in issue

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Yifei Zhu <[email protected]>
JoseJuan98 pushed a commit to JoseJuan98/sagemaker-python-sdk that referenced this issue Mar 4, 2023
* feature: Add experiment plus Run class (aws#691)

* feature: Add Experiment helper classes (aws#646)

* feature: Add Experiment helper classes

feature: Add helper class _RunEnvironment

* change: Change sleep retry to backoff retry for get TC

* minor fixes in backoff retry

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add helper classes and methods for Run class (aws#660)

* feature: Add helper classes and methods for Run class

* Add Parent class to address comment

* fix docstyle check

* Add arg docstrings in _helper

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add Experiment Run class (aws#651)

Co-authored-by: Dewen Qi <[email protected]>

* change: Add integ tests for Run (aws#673)

Co-authored-by: Dewen Qi <[email protected]>

* Update run log metric to use MetricsManager (aws#678)

* Update run.log_metric to use _MetricsManager

* fix several metrics issues

* Add doc strings to metrics.py

Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dewen Qi <[email protected]>

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>

* change: Simplify exp plus integ test configuration (aws#694)

Co-authored-by: Dewen Qi <[email protected]>

* feature: add RunName to expeirment_config (aws#696)

* change: Update Run init and add Run load and _RunContext (aws#707)

* change: Update Run init and add Run load

Add exp name and run group name to load and address comments

* Address nit comments

Co-authored-by: Dewen Qi <[email protected]>

* fix: Fix run name uniqueness issue (aws#730)

Co-authored-by: Dewen Qi <[email protected]>

* change: Update integ tests for Exp Plus M1 changes (aws#741)

Co-authored-by: Dewen Qi <[email protected]>

* add metrics client to session object (aws#745)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* change: Add integ test for using Run in Transform Job (aws#749)

Co-authored-by: Dewen Qi <[email protected]>

* Add async metrics sink (aws#739)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* use metrics client provided by session (aws#754)

* fix flaky metrics test (aws#753)

* change: Change Run.init and Run.load to constructor and module method respectively (aws#752)

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add latest metric service model (aws#757)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* fix: lowercase run name (aws#767)

* Change: Minimize use of lower case tc name (aws#769)

* change: Clean up test resources to remove model files (aws#756)

* change: Clean up test resources to remove model files

* fix: Change experiment enums to upper case

* change: Upgrade boto3 and update test to validate mixed case name

* fix: Update as per latest botocore release and backend change

Co-authored-by: Dewen Qi <[email protected]>

* lowercase trial component name (aws#776)

* change: Expose sagemaker experiment doc strings

* fix: Fix exp name mixed case in issue

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Yifei Zhu <[email protected]>
nmadan pushed a commit to nmadan/sagemaker-python-sdk that referenced this issue Apr 18, 2023
* feature: Add experiment plus Run class (aws#691)

* feature: Add Experiment helper classes (aws#646)

* feature: Add Experiment helper classes

feature: Add helper class _RunEnvironment

* change: Change sleep retry to backoff retry for get TC

* minor fixes in backoff retry

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add helper classes and methods for Run class (aws#660)

* feature: Add helper classes and methods for Run class

* Add Parent class to address comment

* fix docstyle check

* Add arg docstrings in _helper

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add Experiment Run class (aws#651)

Co-authored-by: Dewen Qi <[email protected]>

* change: Add integ tests for Run (aws#673)

Co-authored-by: Dewen Qi <[email protected]>

* Update run log metric to use MetricsManager (aws#678)

* Update run.log_metric to use _MetricsManager

* fix several metrics issues

* Add doc strings to metrics.py

Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dewen Qi <[email protected]>

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>

* change: Simplify exp plus integ test configuration (aws#694)

Co-authored-by: Dewen Qi <[email protected]>

* feature: add RunName to expeirment_config (aws#696)

* change: Update Run init and add Run load and _RunContext (aws#707)

* change: Update Run init and add Run load

Add exp name and run group name to load and address comments

* Address nit comments

Co-authored-by: Dewen Qi <[email protected]>

* fix: Fix run name uniqueness issue (aws#730)

Co-authored-by: Dewen Qi <[email protected]>

* change: Update integ tests for Exp Plus M1 changes (aws#741)

Co-authored-by: Dewen Qi <[email protected]>

* add metrics client to session object (aws#745)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* change: Add integ test for using Run in Transform Job (aws#749)

Co-authored-by: Dewen Qi <[email protected]>

* Add async metrics sink (aws#739)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* use metrics client provided by session (aws#754)

* fix flaky metrics test (aws#753)

* change: Change Run.init and Run.load to constructor and module method respectively (aws#752)

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add latest metric service model (aws#757)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* fix: lowercase run name (aws#767)

* Change: Minimize use of lower case tc name (aws#769)

* change: Clean up test resources to remove model files (aws#756)

* change: Clean up test resources to remove model files

* fix: Change experiment enums to upper case

* change: Upgrade boto3 and update test to validate mixed case name

* fix: Update as per latest botocore release and backend change

Co-authored-by: Dewen Qi <[email protected]>

* lowercase trial component name (aws#776)

* change: Expose sagemaker experiment doc strings

* fix: Fix exp name mixed case in issue

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Yifei Zhu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants