Skip to content

sagemaker job failing in transformation step #753

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
NEIA20 opened this issue Apr 16, 2019 · 17 comments
Closed

sagemaker job failing in transformation step #753

NEIA20 opened this issue Apr 16, 2019 · 17 comments

Comments

@NEIA20
Copy link

NEIA20 commented Apr 16, 2019

Please fill out the form below.

System Information

  • Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): Tensorflow
  • Framework Version: 1.12.0
  • Python Version: Python 3.6.8
  • CPU or GPU: CPU
  • Python SDK Version: 1.18.13
  • Are you using a custom image: No

Describe the problem

We're using sagemaker to parallelize a tensorflow job. We create a model using tensorflow. Training completes successfully. When the job moves on to transformation, it fails with an error: “Unable to get response from algorithm.”

Minimal repro / logs

Stack trace:

  File "/home/abexecutor/ml-conversion/dcs-analytics/ML_models/sage_maker_job_runner.py", line 160, in _predict
    transformer.wait()
  File "/home/abexecutor/Environments/ml_py_36/lib/python3.6/site-packages/sagemaker/transformer.py", line 135, in wait
    self.latest_transform_job.wait()
  File "/home/abexecutor/Environments/ml_py_36/lib/python3.6/site-packages/sagemaker/transformer.py", line 209, in wait
    self.sagemaker_session.wait_for_transform_job(self.job_name)
  File "/home/abexecutor/Environments/ml_py_36/lib/python3.6/site-packages/sagemaker/session.py", line 886, in wait_for_transform_job
    self._check_job_status(job, desc, 'TransformJobStatus')
  File "/home/abexecutor/Environments/ml_py_36/lib/python3.6/site-packages/sagemaker/session.py", line 908, in _check_job_status
    raise ValueError('Error for {} {}: {} Reason: {}'.format(job_type, job, status, reason))
ValueError: Error for Transform job sagemaker-tensorflow-2019-04-16-20-36-36-284: Failed Reason: AlgorithmError: See job logs for more information

Transformation logs:

2019-04-16T20:39:55.259:[sagemaker logs]: MaxConcurrentTransforms=100, MaxPayloadInMB=1, BatchStrategy=SINGLE_RECORD

#011at java.lang.Thread.run(Thread.java:748)2019-04-16T20:40:23.253:[sagemaker logs]: normalizedetl/ml-clicks-lookalike/campaign-1863917904632580551/1555446017448/test/testing_data_encoded_mean_aligned_0_2_1.csv: Unable to get response from algorithm
#011at java.lang.Thread.run(Thread.java:748)2019-04-16T20:40:23.253:[sagemaker logs]: normalizedetl/ml-clicks-lookalike/campaign-1863917904632580551/1555446017448/test/testing_data_encoded_mean_aligned_0_2_1.csv: Unable to get response from algorithm
#011at java.lang.Thread.run(Thread.java:748)2019-04-16T20:40:23.289:[sagemaker logs]: normalizedetl/ml-clicks-lookalike/campaign-1863917904632580551/1555446017448/test/testing_data_encoded_mean_aligned_0_2_1.csv: Unable to get response from algorithm
  • Exact command to reproduce:

Tensorflow model settings:

estimator = TensorFlow(entry_point='ML_models/tensorflow_entry_point.py'
                              role=ROLE,
                              framework_version='1.12.0',
                              training_steps=1000,
                              evaluation_steps=100,
                              train_instance_count=2,
                              train_instance_type='ml.m5.xlarge')
estimator.output_path = f's3://JOB_PATH/model/TRAIN_TABLE_NAME'

validation_prefix = f's3://JOB_PATH/validation/TRAIN_TABLE_NAME'

s3_eval_train = sagemaker.s3_input(
    s3_data=validation_prefix,
    content_type='csv',
    distribution='ShardedByS3Key')
    
input_prefix = f's3://JOB_PATH/train/TRAIN_TABLE_NAME'

s3_input_train = sagemaker.s3_input(
    s3_data=input_prefix,
    content_type='csv',
    distribution='ShardedByS3Key')
    
estimator.fit({'train': s3_input_train, 'validation': s3_eval_train})

Transformation settings:

estimator.transformer(instance_count=10,                                  instance_type=ml.c5.2xlarge,
                                         strategy='SingleRecord',
                                         assemble_with='Line',
                                         max_payload=1,
                                         max_concurrent_transforms=100)

transformer.output_path = f's3://JOB_PATH/predictions/PREDICTION_TABLE_NAME'

transformer.transform(f's3://JOB_PATH/test/PREDICTION_TABLE_NAME',
                              content_type='text/csv',
                              split_type='Line')
@laurenyu
Copy link
Contributor

hi @NEIA20, thanks for using SageMaker! Could you provide your entry point code and data (or something similar if your code/data is private) for us to reproduce the error?

@NEIA20
Copy link
Author

NEIA20 commented Apr 17, 2019

Hi @laurenyu, thanks for getting back to me

Here's some of our tensorflow entry point code:

import os
import tensorflow as tf

tf.logging.set_verbosity(tf.logging.ERROR)

INPUT_TENSOR_NAME = 'inputs'

# Disable MKL to get a better perfomance for this model.
os.environ['TF_DISABLE_MKL'] = '1'
os.environ['TF_DISABLE_POOL_ALLOCATOR'] = '1'

BASE_PATH = '/opt/ml/input/data/'
TRAIN_PATH = BASE_PATH + 'train'
VALIDATION_PATH = BASE_PATH + 'validation'

def serving_input_fn(params):
    training_dir = '/opt/ml/input/data/train/'
    feature_spec = {INPUT_TENSOR_NAME: tf.FixedLenFeature(dtype=tf.float32, shape=get_shape(training_dir))}
    return tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec)()

def eval_input_fn(training_dir, params):
    """Returns input function that would feed the model during evaluation"""
    if os.path.exists(VALIDATION_PATH):
        os.system('cat /opt/ml/input/data/validation/*.csv > /opt/ml/input/data/validation/all_validate.csv')
        return _generate_input_fn(VALIDATION_PATH, 'all_validate.csv')
    else:
        first_file = os.listdir(TRAIN_PATH)[0]
        return _generate_input_fn(TRAIN_PATH, first_file)


def _generate_input_fn(training_dir, training_filename):
    training_set = tf.contrib.learn.datasets.base.load_csv_without_header(
        filename=os.path.join(training_dir, training_filename),
        target_dtype=np.float32,
        features_dtype=np.float32,
        target_column=0
    )

    return tf.estimator.inputs.numpy_input_fn(
        x={INPUT_TENSOR_NAME: np.array(training_set.data)},
        y=np.array(training_set.target),
        num_epochs=8200,
        batch_size=512,
        shuffle=True
    )()

And here's one of our testing files:

testing_data_encoded_mean_aligned_0_0_1.csv.zip

@laurenyu
Copy link
Contributor

@NEIA20 thanks for including all of that! would you also be able to include either the training code or a model artifact? (trying to run this on my end)

@NEIA20
Copy link
Author

NEIA20 commented Apr 18, 2019

sure thing :)
Here's the model artifcat:

model.tar.gz

Let me know if you need anything else

@laurenyu
Copy link
Contributor

laurenyu commented Apr 18, 2019

@NEIA20 thanks! the batch transform job succeeded for me (ran it twice just to be sure).

here's what I ran (should be very similar to what you posted):

import sagemaker
from sagemaker.tensorflow import TensorFlowModel

ROLE = 'role-name'  # replaced with dummy string
JOB_PATH = 's3-bucket-and-path'  # replaced with dummy string
PREDICTION_TABLE_NAME = 'predict'

model = TensorFlowModel(f's3://{JOB_PATH}/model.tar.gz',
                        ROLE,
                        'entry.py',
                        sagemaker_session=sagemaker.session.Session())

transformer = model.transformer(instance_count=10,
                                instance_type='ml.c5.2xlarge',
                                strategy='SingleRecord',
                                assemble_with='Line',
                                max_payload=1,
                                max_concurrent_transforms=100)

transformer.output_path = f's3://{JOB_PATH}/predictions/{PREDICTION_TABLE_NAME}'

transformer.transform(f's3://{JOB_PATH}/testing_data_encoded_mean_aligned_0_0_1.csv',
                      content_type='text/csv',
                      split_type='Line')

transformer.wait()

and I copied your entry point code above into a file called entry.py.

Are you consistently encountering the error?

@NEIA20
Copy link
Author

NEIA20 commented Apr 19, 2019

@laurenyu thanks for trying it out. Yes we are consistently getting the error.

I thought I had given you everything relevant but perhaps it's an issue in our output_fn? Here's the rest of our tensorflow_entry_point.py file:

import os
import tensorflow as tf
import pandas as pd
import sys
from google.protobuf.json_format import MessageToJson
import json

tf.logging.set_verbosity(tf.logging.ERROR)

INPUT_TENSOR_NAME = 'inputs'

# Disable MKL to get a better perfomance for this model.
os.environ['TF_DISABLE_MKL'] = '1'
os.environ['TF_DISABLE_POOL_ALLOCATOR'] = '1'

BASE_PATH = '/opt/ml/input/data/'
TRAIN_PATH = BASE_PATH + 'train'
VALIDATION_PATH = BASE_PATH + 'validation'


def get_shape(directory):
    filename0 = os.listdir(directory)[0]
    shape_size = pd.read_csv(directory+filename0, nrows=1).shape[1] - 1
    return shape_size


def estimator_fn(run_config, params):
    training_dir = '/opt/ml/input/data/train/'
    feature_columns = [tf.feature_column.numeric_column(INPUT_TENSOR_NAME, shape=get_shape(training_dir))]
    return tf.estimator.DNNClassifier(feature_columns=feature_columns,
                                      hidden_units=[380, 180],
                                      n_classes=2,
                                      config=run_config)


def serving_input_fn(params):
    training_dir = '/opt/ml/input/data/train/'
    feature_spec = {INPUT_TENSOR_NAME: tf.FixedLenFeature(dtype=tf.float32, shape=get_shape(training_dir))}
    return tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec)()


def train_input_fn(training_dir, params):
    """Returns input function that would feed the model during training"""
    for root, dirs, files in os.walk(BASE_PATH):
        print('Current directory ' + root)
        print(files)
    os.system('cat /opt/ml/input/data/train/*.csv > /opt/ml/input/data/train/all_train.csv')
    return _generate_input_fn(TRAIN_PATH, 'all_train.csv')


def eval_input_fn(training_dir, params):
    """Returns input function that would feed the model during evaluation"""
    if os.path.exists(VALIDATION_PATH):
        os.system('cat /opt/ml/input/data/validation/*.csv > /opt/ml/input/data/validation/all_validate.csv')
        return _generate_input_fn(VALIDATION_PATH, 'all_validate.csv')
    else:
        first_file = os.listdir(TRAIN_PATH)[0]
        return _generate_input_fn(TRAIN_PATH, first_file)


def _generate_input_fn(training_dir, training_filename):
    training_set = tf.contrib.learn.datasets.base.load_csv_without_header(
        filename=os.path.join(training_dir, training_filename),
        target_dtype=np.float32,
        features_dtype=np.float32,
        target_column=0)

    return tf.estimator.inputs.numpy_input_fn(
        x={INPUT_TENSOR_NAME: np.array(training_set.data)},
        y=np.array(training_set.target),
        num_epochs=8200,
        batch_size=512,
        shuffle=True)()


def output_fn(prediction_result, accepts):
    as_str = MessageToJson(prediction_result)
    try:
        result = json.loads(as_str)
        classes = result['result']['classifications'][0]['classes']
        score_0 = classes[0].get('score', 0.0)
        score_1 = classes[1].get('score', 0.0)
        formatted = "[{score_0}, {score_1}]".format(score_0=score_0, score_1=score_1)
    except KeyError:
        formatted = "[1.0, 0.0]"
        print >> sys.stderr, 'Error while parsing', str(prediction_result)
    return formatted

@laurenyu
Copy link
Contributor

I created a new batch transform job based on the full entry point code, and it still succeeded for me. I'm going to reach out to the relevant service team to see if they have any insight. thanks for your patience!

@andremoeller
Copy link
Contributor

Hey @NEIA20 ,

It looks like the algorithm isn't responding to some of the requests. Could you try reducing max_concurrent_transforms to 8? Thank you.

@NEIA20
Copy link
Author

NEIA20 commented Apr 25, 2019

hi @andremoeller @laurenyu - sorry for the late response.

We had previously tried lowering max_concurrent_transforms to 10 and still got the error. I tried again with 8 just now and the same error is still there unfortunately

@NEIA20
Copy link
Author

NEIA20 commented Apr 30, 2019

hi @andremoeller @laurenyu, any updates on what might be the issue?

@andremoeller
Copy link
Contributor

Hey @NEIA20 ,

We're actively looking into this with your most recent Transform Job. It seems like we aren't handling errors in certain cases correctly, but we don't have a fix yet. We will keep this issue updated, and when we know when we expect to have a fix deployed, we will let you know. Thanks!

@NEIA20
Copy link
Author

NEIA20 commented May 1, 2019

thanks @andremoeller, glad to hear it. If you need any more information from us just let me know

@andremoeller
Copy link
Contributor

Hi @NEIA20 ,

It seems like problem is both that the container isn't handling multiple requests well, and we aren't handling certain types of errors. There's a newer TensorFlow Serving container that you can deploy to with endpoint_type 'tensorflow-serving' which we expect to work better in this case, if your goal is to get predictions from CSV data -- the docs are located here: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst

Would using the newer TensorFlow serving container satisfy your use case?

Thank you.

@NEIA20
Copy link
Author

NEIA20 commented May 13, 2019

hi @andremoeller - the thing is that we're not generating predictions using a SageMaker Endpoint, we're using batch transform instead

@andremoeller
Copy link
Contributor

Hi @NEIA20

Got it -- that doc is missing some information relevant to Batch Transform, but it's still possible to use Batch with the newer TensorFlow Serving container. If you already have trained a model, you can use the new TensorFlow Serving container like this:

from sagemaker.tensorflow.serving import Model

model = Model(model_data='s3://../model.tar.gz', role= role, ...)
transformer = model.transformer(...)
transformer.transform(...)

@jesterhazy
Copy link
Contributor

It looks like the question has been solved, so I closing this issue.

@AlexJonesNLP
Copy link

I'm having this issue as well. The "solution" @andremoeller offered is unhelpful.

qidewenwhen pushed a commit to qidewenwhen/sagemaker-python-sdk that referenced this issue Dec 13, 2022
qidewenwhen pushed a commit to qidewenwhen/sagemaker-python-sdk that referenced this issue Dec 14, 2022
qidewenwhen pushed a commit to qidewenwhen/sagemaker-python-sdk that referenced this issue Dec 14, 2022
qidewenwhen pushed a commit to qidewenwhen/sagemaker-python-sdk that referenced this issue Dec 14, 2022
qidewenwhen pushed a commit to qidewenwhen/sagemaker-python-sdk that referenced this issue Dec 14, 2022
navinsoni pushed a commit that referenced this issue Dec 14, 2022
* feature: Add experiment plus Run class (#691)

* feature: Add Experiment helper classes (#646)

* feature: Add Experiment helper classes

feature: Add helper class _RunEnvironment

* change: Change sleep retry to backoff retry for get TC

* minor fixes in backoff retry

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add helper classes and methods for Run class (#660)

* feature: Add helper classes and methods for Run class

* Add Parent class to address comment

* fix docstyle check

* Add arg docstrings in _helper

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add Experiment Run class (#651)

Co-authored-by: Dewen Qi <[email protected]>

* change: Add integ tests for Run (#673)

Co-authored-by: Dewen Qi <[email protected]>

* Update run log metric to use MetricsManager (#678)

* Update run.log_metric to use _MetricsManager

* fix several metrics issues

* Add doc strings to metrics.py

Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dewen Qi <[email protected]>

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>

* change: Simplify exp plus integ test configuration (#694)

Co-authored-by: Dewen Qi <[email protected]>

* feature: add RunName to expeirment_config (#696)

* change: Update Run init and add Run load and _RunContext (#707)

* change: Update Run init and add Run load

Add exp name and run group name to load and address comments

* Address nit comments

Co-authored-by: Dewen Qi <[email protected]>

* fix: Fix run name uniqueness issue (#730)

Co-authored-by: Dewen Qi <[email protected]>

* change: Update integ tests for Exp Plus M1 changes (#741)

Co-authored-by: Dewen Qi <[email protected]>

* add metrics client to session object (#745)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* change: Add integ test for using Run in Transform Job (#749)

Co-authored-by: Dewen Qi <[email protected]>

* Add async metrics sink (#739)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* use metrics client provided by session (#754)

* fix flaky metrics test (#753)

* change: Change Run.init and Run.load to constructor and module method respectively (#752)

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add latest metric service model (#757)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* fix: lowercase run name (#767)

* Change: Minimize use of lower case tc name (#769)

* change: Clean up test resources to remove model files (#756)

* change: Clean up test resources to remove model files

* fix: Change experiment enums to upper case

* change: Upgrade boto3 and update test to validate mixed case name

* fix: Update as per latest botocore release and backend change

Co-authored-by: Dewen Qi <[email protected]>

* lowercase trial component name (#776)

* change: Expose sagemaker experiment doc strings

* fix: Fix exp name mixed case in issue

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Yifei Zhu <[email protected]>
claytonparnell pushed a commit to claytonparnell/sagemaker-python-sdk that referenced this issue Dec 16, 2022
* feature: Add experiment plus Run class (aws#691)

* feature: Add Experiment helper classes (aws#646)

* feature: Add Experiment helper classes

feature: Add helper class _RunEnvironment

* change: Change sleep retry to backoff retry for get TC

* minor fixes in backoff retry

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add helper classes and methods for Run class (aws#660)

* feature: Add helper classes and methods for Run class

* Add Parent class to address comment

* fix docstyle check

* Add arg docstrings in _helper

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add Experiment Run class (aws#651)

Co-authored-by: Dewen Qi <[email protected]>

* change: Add integ tests for Run (aws#673)

Co-authored-by: Dewen Qi <[email protected]>

* Update run log metric to use MetricsManager (aws#678)

* Update run.log_metric to use _MetricsManager

* fix several metrics issues

* Add doc strings to metrics.py

Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dewen Qi <[email protected]>

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>

* change: Simplify exp plus integ test configuration (aws#694)

Co-authored-by: Dewen Qi <[email protected]>

* feature: add RunName to expeirment_config (aws#696)

* change: Update Run init and add Run load and _RunContext (aws#707)

* change: Update Run init and add Run load

Add exp name and run group name to load and address comments

* Address nit comments

Co-authored-by: Dewen Qi <[email protected]>

* fix: Fix run name uniqueness issue (aws#730)

Co-authored-by: Dewen Qi <[email protected]>

* change: Update integ tests for Exp Plus M1 changes (aws#741)

Co-authored-by: Dewen Qi <[email protected]>

* add metrics client to session object (aws#745)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* change: Add integ test for using Run in Transform Job (aws#749)

Co-authored-by: Dewen Qi <[email protected]>

* Add async metrics sink (aws#739)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* use metrics client provided by session (aws#754)

* fix flaky metrics test (aws#753)

* change: Change Run.init and Run.load to constructor and module method respectively (aws#752)

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add latest metric service model (aws#757)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* fix: lowercase run name (aws#767)

* Change: Minimize use of lower case tc name (aws#769)

* change: Clean up test resources to remove model files (aws#756)

* change: Clean up test resources to remove model files

* fix: Change experiment enums to upper case

* change: Upgrade boto3 and update test to validate mixed case name

* fix: Update as per latest botocore release and backend change

Co-authored-by: Dewen Qi <[email protected]>

* lowercase trial component name (aws#776)

* change: Expose sagemaker experiment doc strings

* fix: Fix exp name mixed case in issue

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Yifei Zhu <[email protected]>
mufaddal-rohawala pushed a commit to mufaddal-rohawala/sagemaker-python-sdk that referenced this issue Dec 19, 2022
* feature: Add experiment plus Run class (aws#691)

* feature: Add Experiment helper classes (aws#646)

* feature: Add Experiment helper classes

feature: Add helper class _RunEnvironment

* change: Change sleep retry to backoff retry for get TC

* minor fixes in backoff retry

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add helper classes and methods for Run class (aws#660)

* feature: Add helper classes and methods for Run class

* Add Parent class to address comment

* fix docstyle check

* Add arg docstrings in _helper

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add Experiment Run class (aws#651)

Co-authored-by: Dewen Qi <[email protected]>

* change: Add integ tests for Run (aws#673)

Co-authored-by: Dewen Qi <[email protected]>

* Update run log metric to use MetricsManager (aws#678)

* Update run.log_metric to use _MetricsManager

* fix several metrics issues

* Add doc strings to metrics.py

Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dewen Qi <[email protected]>

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>

* change: Simplify exp plus integ test configuration (aws#694)

Co-authored-by: Dewen Qi <[email protected]>

* feature: add RunName to expeirment_config (aws#696)

* change: Update Run init and add Run load and _RunContext (aws#707)

* change: Update Run init and add Run load

Add exp name and run group name to load and address comments

* Address nit comments

Co-authored-by: Dewen Qi <[email protected]>

* fix: Fix run name uniqueness issue (aws#730)

Co-authored-by: Dewen Qi <[email protected]>

* change: Update integ tests for Exp Plus M1 changes (aws#741)

Co-authored-by: Dewen Qi <[email protected]>

* add metrics client to session object (aws#745)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* change: Add integ test for using Run in Transform Job (aws#749)

Co-authored-by: Dewen Qi <[email protected]>

* Add async metrics sink (aws#739)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* use metrics client provided by session (aws#754)

* fix flaky metrics test (aws#753)

* change: Change Run.init and Run.load to constructor and module method respectively (aws#752)

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add latest metric service model (aws#757)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* fix: lowercase run name (aws#767)

* Change: Minimize use of lower case tc name (aws#769)

* change: Clean up test resources to remove model files (aws#756)

* change: Clean up test resources to remove model files

* fix: Change experiment enums to upper case

* change: Upgrade boto3 and update test to validate mixed case name

* fix: Update as per latest botocore release and backend change

Co-authored-by: Dewen Qi <[email protected]>

* lowercase trial component name (aws#776)

* change: Expose sagemaker experiment doc strings

* fix: Fix exp name mixed case in issue

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Yifei Zhu <[email protected]>
mufaddal-rohawala pushed a commit that referenced this issue Dec 20, 2022
* feature: Add experiment plus Run class (#691)

* feature: Add Experiment helper classes (#646)

* feature: Add Experiment helper classes

feature: Add helper class _RunEnvironment

* change: Change sleep retry to backoff retry for get TC

* minor fixes in backoff retry

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add helper classes and methods for Run class (#660)

* feature: Add helper classes and methods for Run class

* Add Parent class to address comment

* fix docstyle check

* Add arg docstrings in _helper

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add Experiment Run class (#651)

Co-authored-by: Dewen Qi <[email protected]>

* change: Add integ tests for Run (#673)

Co-authored-by: Dewen Qi <[email protected]>

* Update run log metric to use MetricsManager (#678)

* Update run.log_metric to use _MetricsManager

* fix several metrics issues

* Add doc strings to metrics.py

Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dewen Qi <[email protected]>

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>

* change: Simplify exp plus integ test configuration (#694)

Co-authored-by: Dewen Qi <[email protected]>

* feature: add RunName to expeirment_config (#696)

* change: Update Run init and add Run load and _RunContext (#707)

* change: Update Run init and add Run load

Add exp name and run group name to load and address comments

* Address nit comments

Co-authored-by: Dewen Qi <[email protected]>

* fix: Fix run name uniqueness issue (#730)

Co-authored-by: Dewen Qi <[email protected]>

* change: Update integ tests for Exp Plus M1 changes (#741)

Co-authored-by: Dewen Qi <[email protected]>

* add metrics client to session object (#745)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* change: Add integ test for using Run in Transform Job (#749)

Co-authored-by: Dewen Qi <[email protected]>

* Add async metrics sink (#739)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* use metrics client provided by session (#754)

* fix flaky metrics test (#753)

* change: Change Run.init and Run.load to constructor and module method respectively (#752)

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add latest metric service model (#757)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* fix: lowercase run name (#767)

* Change: Minimize use of lower case tc name (#769)

* change: Clean up test resources to remove model files (#756)

* change: Clean up test resources to remove model files

* fix: Change experiment enums to upper case

* change: Upgrade boto3 and update test to validate mixed case name

* fix: Update as per latest botocore release and backend change

Co-authored-by: Dewen Qi <[email protected]>

* lowercase trial component name (#776)

* change: Expose sagemaker experiment doc strings

* fix: Fix exp name mixed case in issue

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Yifei Zhu <[email protected]>
JoseJuan98 pushed a commit to JoseJuan98/sagemaker-python-sdk that referenced this issue Mar 4, 2023
* feature: Add experiment plus Run class (aws#691)

* feature: Add Experiment helper classes (aws#646)

* feature: Add Experiment helper classes

feature: Add helper class _RunEnvironment

* change: Change sleep retry to backoff retry for get TC

* minor fixes in backoff retry

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add helper classes and methods for Run class (aws#660)

* feature: Add helper classes and methods for Run class

* Add Parent class to address comment

* fix docstyle check

* Add arg docstrings in _helper

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add Experiment Run class (aws#651)

Co-authored-by: Dewen Qi <[email protected]>

* change: Add integ tests for Run (aws#673)

Co-authored-by: Dewen Qi <[email protected]>

* Update run log metric to use MetricsManager (aws#678)

* Update run.log_metric to use _MetricsManager

* fix several metrics issues

* Add doc strings to metrics.py

Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dewen Qi <[email protected]>

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>

* change: Simplify exp plus integ test configuration (aws#694)

Co-authored-by: Dewen Qi <[email protected]>

* feature: add RunName to expeirment_config (aws#696)

* change: Update Run init and add Run load and _RunContext (aws#707)

* change: Update Run init and add Run load

Add exp name and run group name to load and address comments

* Address nit comments

Co-authored-by: Dewen Qi <[email protected]>

* fix: Fix run name uniqueness issue (aws#730)

Co-authored-by: Dewen Qi <[email protected]>

* change: Update integ tests for Exp Plus M1 changes (aws#741)

Co-authored-by: Dewen Qi <[email protected]>

* add metrics client to session object (aws#745)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* change: Add integ test for using Run in Transform Job (aws#749)

Co-authored-by: Dewen Qi <[email protected]>

* Add async metrics sink (aws#739)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* use metrics client provided by session (aws#754)

* fix flaky metrics test (aws#753)

* change: Change Run.init and Run.load to constructor and module method respectively (aws#752)

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add latest metric service model (aws#757)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* fix: lowercase run name (aws#767)

* Change: Minimize use of lower case tc name (aws#769)

* change: Clean up test resources to remove model files (aws#756)

* change: Clean up test resources to remove model files

* fix: Change experiment enums to upper case

* change: Upgrade boto3 and update test to validate mixed case name

* fix: Update as per latest botocore release and backend change

Co-authored-by: Dewen Qi <[email protected]>

* lowercase trial component name (aws#776)

* change: Expose sagemaker experiment doc strings

* fix: Fix exp name mixed case in issue

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Yifei Zhu <[email protected]>
JoseJuan98 pushed a commit to JoseJuan98/sagemaker-python-sdk that referenced this issue Mar 4, 2023
* feature: Add experiment plus Run class (aws#691)

* feature: Add Experiment helper classes (aws#646)

* feature: Add Experiment helper classes

feature: Add helper class _RunEnvironment

* change: Change sleep retry to backoff retry for get TC

* minor fixes in backoff retry

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add helper classes and methods for Run class (aws#660)

* feature: Add helper classes and methods for Run class

* Add Parent class to address comment

* fix docstyle check

* Add arg docstrings in _helper

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add Experiment Run class (aws#651)

Co-authored-by: Dewen Qi <[email protected]>

* change: Add integ tests for Run (aws#673)

Co-authored-by: Dewen Qi <[email protected]>

* Update run log metric to use MetricsManager (aws#678)

* Update run.log_metric to use _MetricsManager

* fix several metrics issues

* Add doc strings to metrics.py

Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dewen Qi <[email protected]>

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>

* change: Simplify exp plus integ test configuration (aws#694)

Co-authored-by: Dewen Qi <[email protected]>

* feature: add RunName to expeirment_config (aws#696)

* change: Update Run init and add Run load and _RunContext (aws#707)

* change: Update Run init and add Run load

Add exp name and run group name to load and address comments

* Address nit comments

Co-authored-by: Dewen Qi <[email protected]>

* fix: Fix run name uniqueness issue (aws#730)

Co-authored-by: Dewen Qi <[email protected]>

* change: Update integ tests for Exp Plus M1 changes (aws#741)

Co-authored-by: Dewen Qi <[email protected]>

* add metrics client to session object (aws#745)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* change: Add integ test for using Run in Transform Job (aws#749)

Co-authored-by: Dewen Qi <[email protected]>

* Add async metrics sink (aws#739)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* use metrics client provided by session (aws#754)

* fix flaky metrics test (aws#753)

* change: Change Run.init and Run.load to constructor and module method respectively (aws#752)

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add latest metric service model (aws#757)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* fix: lowercase run name (aws#767)

* Change: Minimize use of lower case tc name (aws#769)

* change: Clean up test resources to remove model files (aws#756)

* change: Clean up test resources to remove model files

* fix: Change experiment enums to upper case

* change: Upgrade boto3 and update test to validate mixed case name

* fix: Update as per latest botocore release and backend change

Co-authored-by: Dewen Qi <[email protected]>

* lowercase trial component name (aws#776)

* change: Expose sagemaker experiment doc strings

* fix: Fix exp name mixed case in issue

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Yifei Zhu <[email protected]>
nmadan pushed a commit to nmadan/sagemaker-python-sdk that referenced this issue Apr 18, 2023
* feature: Add experiment plus Run class (aws#691)

* feature: Add Experiment helper classes (aws#646)

* feature: Add Experiment helper classes

feature: Add helper class _RunEnvironment

* change: Change sleep retry to backoff retry for get TC

* minor fixes in backoff retry

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add helper classes and methods for Run class (aws#660)

* feature: Add helper classes and methods for Run class

* Add Parent class to address comment

* fix docstyle check

* Add arg docstrings in _helper

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add Experiment Run class (aws#651)

Co-authored-by: Dewen Qi <[email protected]>

* change: Add integ tests for Run (aws#673)

Co-authored-by: Dewen Qi <[email protected]>

* Update run log metric to use MetricsManager (aws#678)

* Update run.log_metric to use _MetricsManager

* fix several metrics issues

* Add doc strings to metrics.py

Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dewen Qi <[email protected]>

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>

* change: Simplify exp plus integ test configuration (aws#694)

Co-authored-by: Dewen Qi <[email protected]>

* feature: add RunName to expeirment_config (aws#696)

* change: Update Run init and add Run load and _RunContext (aws#707)

* change: Update Run init and add Run load

Add exp name and run group name to load and address comments

* Address nit comments

Co-authored-by: Dewen Qi <[email protected]>

* fix: Fix run name uniqueness issue (aws#730)

Co-authored-by: Dewen Qi <[email protected]>

* change: Update integ tests for Exp Plus M1 changes (aws#741)

Co-authored-by: Dewen Qi <[email protected]>

* add metrics client to session object (aws#745)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* change: Add integ test for using Run in Transform Job (aws#749)

Co-authored-by: Dewen Qi <[email protected]>

* Add async metrics sink (aws#739)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* use metrics client provided by session (aws#754)

* fix flaky metrics test (aws#753)

* change: Change Run.init and Run.load to constructor and module method respectively (aws#752)

Co-authored-by: Dewen Qi <[email protected]>

* feature: Add latest metric service model (aws#757)

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>

* fix: lowercase run name (aws#767)

* Change: Minimize use of lower case tc name (aws#769)

* change: Clean up test resources to remove model files (aws#756)

* change: Clean up test resources to remove model files

* fix: Change experiment enums to upper case

* change: Upgrade boto3 and update test to validate mixed case name

* fix: Update as per latest botocore release and backend change

Co-authored-by: Dewen Qi <[email protected]>

* lowercase trial component name (aws#776)

* change: Expose sagemaker experiment doc strings

* fix: Fix exp name mixed case in issue

Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Yifei Zhu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants