sagemaker job failing in transformation step #753

NEIA20 · 2019-04-16T22:17:43Z

Please fill out the form below.

System Information

Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): Tensorflow
Framework Version: 1.12.0
Python Version: Python 3.6.8
CPU or GPU: CPU
Python SDK Version: 1.18.13
Are you using a custom image: No

Describe the problem

We're using sagemaker to parallelize a tensorflow job. We create a model using tensorflow. Training completes successfully. When the job moves on to transformation, it fails with an error: “Unable to get response from algorithm.”

Minimal repro / logs

Stack trace:

  File "/home/abexecutor/ml-conversion/dcs-analytics/ML_models/sage_maker_job_runner.py", line 160, in _predict
    transformer.wait()
  File "/home/abexecutor/Environments/ml_py_36/lib/python3.6/site-packages/sagemaker/transformer.py", line 135, in wait
    self.latest_transform_job.wait()
  File "/home/abexecutor/Environments/ml_py_36/lib/python3.6/site-packages/sagemaker/transformer.py", line 209, in wait
    self.sagemaker_session.wait_for_transform_job(self.job_name)
  File "/home/abexecutor/Environments/ml_py_36/lib/python3.6/site-packages/sagemaker/session.py", line 886, in wait_for_transform_job
    self._check_job_status(job, desc, 'TransformJobStatus')
  File "/home/abexecutor/Environments/ml_py_36/lib/python3.6/site-packages/sagemaker/session.py", line 908, in _check_job_status
    raise ValueError('Error for {} {}: {} Reason: {}'.format(job_type, job, status, reason))
ValueError: Error for Transform job sagemaker-tensorflow-2019-04-16-20-36-36-284: Failed Reason: AlgorithmError: See job logs for more information

Transformation logs:

2019-04-16T20:39:55.259:[sagemaker logs]: MaxConcurrentTransforms=100, MaxPayloadInMB=1, BatchStrategy=SINGLE_RECORD

#011at java.lang.Thread.run(Thread.java:748)2019-04-16T20:40:23.253:[sagemaker logs]: normalizedetl/ml-clicks-lookalike/campaign-1863917904632580551/1555446017448/test/testing_data_encoded_mean_aligned_0_2_1.csv: Unable to get response from algorithm
#011at java.lang.Thread.run(Thread.java:748)2019-04-16T20:40:23.253:[sagemaker logs]: normalizedetl/ml-clicks-lookalike/campaign-1863917904632580551/1555446017448/test/testing_data_encoded_mean_aligned_0_2_1.csv: Unable to get response from algorithm
#011at java.lang.Thread.run(Thread.java:748)2019-04-16T20:40:23.289:[sagemaker logs]: normalizedetl/ml-clicks-lookalike/campaign-1863917904632580551/1555446017448/test/testing_data_encoded_mean_aligned_0_2_1.csv: Unable to get response from algorithm

Exact command to reproduce:

Tensorflow model settings:

estimator = TensorFlow(entry_point='ML_models/tensorflow_entry_point.py'
                              role=ROLE,
                              framework_version='1.12.0',
                              training_steps=1000,
                              evaluation_steps=100,
                              train_instance_count=2,
                              train_instance_type='ml.m5.xlarge')
estimator.output_path = f's3://JOB_PATH/model/TRAIN_TABLE_NAME'

validation_prefix = f's3://JOB_PATH/validation/TRAIN_TABLE_NAME'

s3_eval_train = sagemaker.s3_input(
    s3_data=validation_prefix,
    content_type='csv',
    distribution='ShardedByS3Key')
    
input_prefix = f's3://JOB_PATH/train/TRAIN_TABLE_NAME'

s3_input_train = sagemaker.s3_input(
    s3_data=input_prefix,
    content_type='csv',
    distribution='ShardedByS3Key')
    
estimator.fit({'train': s3_input_train, 'validation': s3_eval_train})

Transformation settings:

estimator.transformer(instance_count=10,                                  instance_type=ml.c5.2xlarge,
                                         strategy='SingleRecord',
                                         assemble_with='Line',
                                         max_payload=1,
                                         max_concurrent_transforms=100)

transformer.output_path = f's3://JOB_PATH/predictions/PREDICTION_TABLE_NAME'

transformer.transform(f's3://JOB_PATH/test/PREDICTION_TABLE_NAME',
                              content_type='text/csv',
                              split_type='Line')

The text was updated successfully, but these errors were encountered:

laurenyu · 2019-04-17T00:13:23Z

hi @NEIA20, thanks for using SageMaker! Could you provide your entry point code and data (or something similar if your code/data is private) for us to reproduce the error?

NEIA20 · 2019-04-17T18:26:33Z

Hi @laurenyu, thanks for getting back to me

Here's some of our tensorflow entry point code:

import os
import tensorflow as tf

tf.logging.set_verbosity(tf.logging.ERROR)

INPUT_TENSOR_NAME = 'inputs'

# Disable MKL to get a better perfomance for this model.
os.environ['TF_DISABLE_MKL'] = '1'
os.environ['TF_DISABLE_POOL_ALLOCATOR'] = '1'

BASE_PATH = '/opt/ml/input/data/'
TRAIN_PATH = BASE_PATH + 'train'
VALIDATION_PATH = BASE_PATH + 'validation'

def serving_input_fn(params):
    training_dir = '/opt/ml/input/data/train/'
    feature_spec = {INPUT_TENSOR_NAME: tf.FixedLenFeature(dtype=tf.float32, shape=get_shape(training_dir))}
    return tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec)()

def eval_input_fn(training_dir, params):
    """Returns input function that would feed the model during evaluation"""
    if os.path.exists(VALIDATION_PATH):
        os.system('cat /opt/ml/input/data/validation/*.csv > /opt/ml/input/data/validation/all_validate.csv')
        return _generate_input_fn(VALIDATION_PATH, 'all_validate.csv')
    else:
        first_file = os.listdir(TRAIN_PATH)[0]
        return _generate_input_fn(TRAIN_PATH, first_file)


def _generate_input_fn(training_dir, training_filename):
    training_set = tf.contrib.learn.datasets.base.load_csv_without_header(
        filename=os.path.join(training_dir, training_filename),
        target_dtype=np.float32,
        features_dtype=np.float32,
        target_column=0
    )

    return tf.estimator.inputs.numpy_input_fn(
        x={INPUT_TENSOR_NAME: np.array(training_set.data)},
        y=np.array(training_set.target),
        num_epochs=8200,
        batch_size=512,
        shuffle=True
    )()

And here's one of our testing files:

testing_data_encoded_mean_aligned_0_0_1.csv.zip

laurenyu · 2019-04-17T20:04:49Z

@NEIA20 thanks for including all of that! would you also be able to include either the training code or a model artifact? (trying to run this on my end)

NEIA20 · 2019-04-18T17:54:09Z

sure thing :)
Here's the model artifcat:

model.tar.gz

Let me know if you need anything else

laurenyu · 2019-04-18T21:37:03Z

@NEIA20 thanks! the batch transform job succeeded for me (ran it twice just to be sure).

here's what I ran (should be very similar to what you posted):

import sagemaker
from sagemaker.tensorflow import TensorFlowModel

ROLE = 'role-name'  # replaced with dummy string
JOB_PATH = 's3-bucket-and-path'  # replaced with dummy string
PREDICTION_TABLE_NAME = 'predict'

model = TensorFlowModel(f's3://{JOB_PATH}/model.tar.gz',
                        ROLE,
                        'entry.py',
                        sagemaker_session=sagemaker.session.Session())

transformer = model.transformer(instance_count=10,
                                instance_type='ml.c5.2xlarge',
                                strategy='SingleRecord',
                                assemble_with='Line',
                                max_payload=1,
                                max_concurrent_transforms=100)

transformer.output_path = f's3://{JOB_PATH}/predictions/{PREDICTION_TABLE_NAME}'

transformer.transform(f's3://{JOB_PATH}/testing_data_encoded_mean_aligned_0_0_1.csv',
                      content_type='text/csv',
                      split_type='Line')

transformer.wait()

and I copied your entry point code above into a file called entry.py.

Are you consistently encountering the error?

NEIA20 · 2019-04-19T17:19:20Z

@laurenyu thanks for trying it out. Yes we are consistently getting the error.

I thought I had given you everything relevant but perhaps it's an issue in our output_fn? Here's the rest of our tensorflow_entry_point.py file:

import os
import tensorflow as tf
import pandas as pd
import sys
from google.protobuf.json_format import MessageToJson
import json

tf.logging.set_verbosity(tf.logging.ERROR)

INPUT_TENSOR_NAME = 'inputs'

# Disable MKL to get a better perfomance for this model.
os.environ['TF_DISABLE_MKL'] = '1'
os.environ['TF_DISABLE_POOL_ALLOCATOR'] = '1'

BASE_PATH = '/opt/ml/input/data/'
TRAIN_PATH = BASE_PATH + 'train'
VALIDATION_PATH = BASE_PATH + 'validation'


def get_shape(directory):
    filename0 = os.listdir(directory)[0]
    shape_size = pd.read_csv(directory+filename0, nrows=1).shape[1] - 1
    return shape_size


def estimator_fn(run_config, params):
    training_dir = '/opt/ml/input/data/train/'
    feature_columns = [tf.feature_column.numeric_column(INPUT_TENSOR_NAME, shape=get_shape(training_dir))]
    return tf.estimator.DNNClassifier(feature_columns=feature_columns,
                                      hidden_units=[380, 180],
                                      n_classes=2,
                                      config=run_config)


def serving_input_fn(params):
    training_dir = '/opt/ml/input/data/train/'
    feature_spec = {INPUT_TENSOR_NAME: tf.FixedLenFeature(dtype=tf.float32, shape=get_shape(training_dir))}
    return tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec)()


def train_input_fn(training_dir, params):
    """Returns input function that would feed the model during training"""
    for root, dirs, files in os.walk(BASE_PATH):
        print('Current directory ' + root)
        print(files)
    os.system('cat /opt/ml/input/data/train/*.csv > /opt/ml/input/data/train/all_train.csv')
    return _generate_input_fn(TRAIN_PATH, 'all_train.csv')


def eval_input_fn(training_dir, params):
    """Returns input function that would feed the model during evaluation"""
    if os.path.exists(VALIDATION_PATH):
        os.system('cat /opt/ml/input/data/validation/*.csv > /opt/ml/input/data/validation/all_validate.csv')
        return _generate_input_fn(VALIDATION_PATH, 'all_validate.csv')
    else:
        first_file = os.listdir(TRAIN_PATH)[0]
        return _generate_input_fn(TRAIN_PATH, first_file)


def _generate_input_fn(training_dir, training_filename):
    training_set = tf.contrib.learn.datasets.base.load_csv_without_header(
        filename=os.path.join(training_dir, training_filename),
        target_dtype=np.float32,
        features_dtype=np.float32,
        target_column=0)

    return tf.estimator.inputs.numpy_input_fn(
        x={INPUT_TENSOR_NAME: np.array(training_set.data)},
        y=np.array(training_set.target),
        num_epochs=8200,
        batch_size=512,
        shuffle=True)()


def output_fn(prediction_result, accepts):
    as_str = MessageToJson(prediction_result)
    try:
        result = json.loads(as_str)
        classes = result['result']['classifications'][0]['classes']
        score_0 = classes[0].get('score', 0.0)
        score_1 = classes[1].get('score', 0.0)
        formatted = "[{score_0}, {score_1}]".format(score_0=score_0, score_1=score_1)
    except KeyError:
        formatted = "[1.0, 0.0]"
        print >> sys.stderr, 'Error while parsing', str(prediction_result)
    return formatted

laurenyu · 2019-04-22T19:13:47Z

I created a new batch transform job based on the full entry point code, and it still succeeded for me. I'm going to reach out to the relevant service team to see if they have any insight. thanks for your patience!

andremoeller · 2019-04-22T23:57:59Z

Hey @NEIA20 ,

It looks like the algorithm isn't responding to some of the requests. Could you try reducing max_concurrent_transforms to 8? Thank you.

NEIA20 · 2019-04-25T15:57:32Z

hi @andremoeller @laurenyu - sorry for the late response.

We had previously tried lowering max_concurrent_transforms to 10 and still got the error. I tried again with 8 just now and the same error is still there unfortunately

NEIA20 · 2019-04-30T13:53:44Z

hi @andremoeller @laurenyu, any updates on what might be the issue?

andremoeller · 2019-04-30T23:42:57Z

Hey @NEIA20 ,

We're actively looking into this with your most recent Transform Job. It seems like we aren't handling errors in certain cases correctly, but we don't have a fix yet. We will keep this issue updated, and when we know when we expect to have a fix deployed, we will let you know. Thanks!

NEIA20 · 2019-05-01T14:16:07Z

thanks @andremoeller, glad to hear it. If you need any more information from us just let me know

andremoeller · 2019-05-07T23:58:22Z

Hi @NEIA20 ,

It seems like problem is both that the container isn't handling multiple requests well, and we aren't handling certain types of errors. There's a newer TensorFlow Serving container that you can deploy to with endpoint_type 'tensorflow-serving' which we expect to work better in this case, if your goal is to get predictions from CSV data -- the docs are located here: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst

Would using the newer TensorFlow serving container satisfy your use case?

Thank you.

NEIA20 · 2019-05-13T14:51:58Z

hi @andremoeller - the thing is that we're not generating predictions using a SageMaker Endpoint, we're using batch transform instead

andremoeller · 2019-05-14T22:46:37Z

Hi @NEIA20

Got it -- that doc is missing some information relevant to Batch Transform, but it's still possible to use Batch with the newer TensorFlow Serving container. If you already have trained a model, you can use the new TensorFlow Serving container like this:

from sagemaker.tensorflow.serving import Model

model = Model(model_data='s3://../model.tar.gz', role= role, ...)
transformer = model.transformer(...)
transformer.transform(...)

jesterhazy · 2019-05-20T19:47:21Z

It looks like the question has been solved, so I closing this issue.

AlexJonesNLP · 2022-07-27T22:01:36Z

I'm having this issue as well. The "solution" @andremoeller offered is unhelpful.

* feature: Add experiment plus Run class (#691) * feature: Add Experiment helper classes (#646) * feature: Add Experiment helper classes feature: Add helper class _RunEnvironment * change: Change sleep retry to backoff retry for get TC * minor fixes in backoff retry Co-authored-by: Dewen Qi <[email protected]> * feature: Add helper classes and methods for Run class (#660) * feature: Add helper classes and methods for Run class * Add Parent class to address comment * fix docstyle check * Add arg docstrings in _helper Co-authored-by: Dewen Qi <[email protected]> * feature: Add Experiment Run class (#651) Co-authored-by: Dewen Qi <[email protected]> * change: Add integ tests for Run (#673) Co-authored-by: Dewen Qi <[email protected]> * Update run log metric to use MetricsManager (#678) * Update run.log_metric to use _MetricsManager * fix several metrics issues * Add doc strings to metrics.py Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> * change: Simplify exp plus integ test configuration (#694) Co-authored-by: Dewen Qi <[email protected]> * feature: add RunName to expeirment_config (#696) * change: Update Run init and add Run load and _RunContext (#707) * change: Update Run init and add Run load Add exp name and run group name to load and address comments * Address nit comments Co-authored-by: Dewen Qi <[email protected]> * fix: Fix run name uniqueness issue (#730) Co-authored-by: Dewen Qi <[email protected]> * change: Update integ tests for Exp Plus M1 changes (#741) Co-authored-by: Dewen Qi <[email protected]> * add metrics client to session object (#745) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * change: Add integ test for using Run in Transform Job (#749) Co-authored-by: Dewen Qi <[email protected]> * Add async metrics sink (#739) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * use metrics client provided by session (#754) * fix flaky metrics test (#753) * change: Change Run.init and Run.load to constructor and module method respectively (#752) Co-authored-by: Dewen Qi <[email protected]> * feature: Add latest metric service model (#757) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * fix: lowercase run name (#767) * Change: Minimize use of lower case tc name (#769) * change: Clean up test resources to remove model files (#756) * change: Clean up test resources to remove model files * fix: Change experiment enums to upper case * change: Upgrade boto3 and update test to validate mixed case name * fix: Update as per latest botocore release and backend change Co-authored-by: Dewen Qi <[email protected]> * lowercase trial component name (#776) * change: Expose sagemaker experiment doc strings * fix: Fix exp name mixed case in issue Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Yifei Zhu <[email protected]>

* feature: Add experiment plus Run class (aws#691) * feature: Add Experiment helper classes (aws#646) * feature: Add Experiment helper classes feature: Add helper class _RunEnvironment * change: Change sleep retry to backoff retry for get TC * minor fixes in backoff retry Co-authored-by: Dewen Qi <[email protected]> * feature: Add helper classes and methods for Run class (aws#660) * feature: Add helper classes and methods for Run class * Add Parent class to address comment * fix docstyle check * Add arg docstrings in _helper Co-authored-by: Dewen Qi <[email protected]> * feature: Add Experiment Run class (aws#651) Co-authored-by: Dewen Qi <[email protected]> * change: Add integ tests for Run (aws#673) Co-authored-by: Dewen Qi <[email protected]> * Update run log metric to use MetricsManager (aws#678) * Update run.log_metric to use _MetricsManager * fix several metrics issues * Add doc strings to metrics.py Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> * change: Simplify exp plus integ test configuration (aws#694) Co-authored-by: Dewen Qi <[email protected]> * feature: add RunName to expeirment_config (aws#696) * change: Update Run init and add Run load and _RunContext (aws#707) * change: Update Run init and add Run load Add exp name and run group name to load and address comments * Address nit comments Co-authored-by: Dewen Qi <[email protected]> * fix: Fix run name uniqueness issue (aws#730) Co-authored-by: Dewen Qi <[email protected]> * change: Update integ tests for Exp Plus M1 changes (aws#741) Co-authored-by: Dewen Qi <[email protected]> * add metrics client to session object (aws#745) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * change: Add integ test for using Run in Transform Job (aws#749) Co-authored-by: Dewen Qi <[email protected]> * Add async metrics sink (aws#739) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * use metrics client provided by session (aws#754) * fix flaky metrics test (aws#753) * change: Change Run.init and Run.load to constructor and module method respectively (aws#752) Co-authored-by: Dewen Qi <[email protected]> * feature: Add latest metric service model (aws#757) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * fix: lowercase run name (aws#767) * Change: Minimize use of lower case tc name (aws#769) * change: Clean up test resources to remove model files (aws#756) * change: Clean up test resources to remove model files * fix: Change experiment enums to upper case * change: Upgrade boto3 and update test to validate mixed case name * fix: Update as per latest botocore release and backend change Co-authored-by: Dewen Qi <[email protected]> * lowercase trial component name (aws#776) * change: Expose sagemaker experiment doc strings * fix: Fix exp name mixed case in issue Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Yifei Zhu <[email protected]>

* feature: Add experiment plus Run class (#691) * feature: Add Experiment helper classes (#646) * feature: Add Experiment helper classes feature: Add helper class _RunEnvironment * change: Change sleep retry to backoff retry for get TC * minor fixes in backoff retry Co-authored-by: Dewen Qi <[email protected]> * feature: Add helper classes and methods for Run class (#660) * feature: Add helper classes and methods for Run class * Add Parent class to address comment * fix docstyle check * Add arg docstrings in _helper Co-authored-by: Dewen Qi <[email protected]> * feature: Add Experiment Run class (#651) Co-authored-by: Dewen Qi <[email protected]> * change: Add integ tests for Run (#673) Co-authored-by: Dewen Qi <[email protected]> * Update run log metric to use MetricsManager (#678) * Update run.log_metric to use _MetricsManager * fix several metrics issues * Add doc strings to metrics.py Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> * change: Simplify exp plus integ test configuration (#694) Co-authored-by: Dewen Qi <[email protected]> * feature: add RunName to expeirment_config (#696) * change: Update Run init and add Run load and _RunContext (#707) * change: Update Run init and add Run load Add exp name and run group name to load and address comments * Address nit comments Co-authored-by: Dewen Qi <[email protected]> * fix: Fix run name uniqueness issue (#730) Co-authored-by: Dewen Qi <[email protected]> * change: Update integ tests for Exp Plus M1 changes (#741) Co-authored-by: Dewen Qi <[email protected]> * add metrics client to session object (#745) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * change: Add integ test for using Run in Transform Job (#749) Co-authored-by: Dewen Qi <[email protected]> * Add async metrics sink (#739) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * use metrics client provided by session (#754) * fix flaky metrics test (#753) * change: Change Run.init and Run.load to constructor and module method respectively (#752) Co-authored-by: Dewen Qi <[email protected]> * feature: Add latest metric service model (#757) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * fix: lowercase run name (#767) * Change: Minimize use of lower case tc name (#769) * change: Clean up test resources to remove model files (#756) * change: Clean up test resources to remove model files * fix: Change experiment enums to upper case * change: Upgrade boto3 and update test to validate mixed case name * fix: Update as per latest botocore release and backend change Co-authored-by: Dewen Qi <[email protected]> * lowercase trial component name (#776) * change: Expose sagemaker experiment doc strings * fix: Fix exp name mixed case in issue Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Yifei Zhu <[email protected]>

* feature: Add experiment plus Run class (aws#691) * feature: Add Experiment helper classes (aws#646) * feature: Add Experiment helper classes feature: Add helper class _RunEnvironment * change: Change sleep retry to backoff retry for get TC * minor fixes in backoff retry Co-authored-by: Dewen Qi <[email protected]> * feature: Add helper classes and methods for Run class (aws#660) * feature: Add helper classes and methods for Run class * Add Parent class to address comment * fix docstyle check * Add arg docstrings in _helper Co-authored-by: Dewen Qi <[email protected]> * feature: Add Experiment Run class (aws#651) Co-authored-by: Dewen Qi <[email protected]> * change: Add integ tests for Run (aws#673) Co-authored-by: Dewen Qi <[email protected]> * Update run log metric to use MetricsManager (aws#678) * Update run.log_metric to use _MetricsManager * fix several metrics issues * Add doc strings to metrics.py Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> * change: Simplify exp plus integ test configuration (aws#694) Co-authored-by: Dewen Qi <[email protected]> * feature: add RunName to expeirment_config (aws#696) * change: Update Run init and add Run load and _RunContext (aws#707) * change: Update Run init and add Run load Add exp name and run group name to load and address comments * Address nit comments Co-authored-by: Dewen Qi <[email protected]> * fix: Fix run name uniqueness issue (aws#730) Co-authored-by: Dewen Qi <[email protected]> * change: Update integ tests for Exp Plus M1 changes (aws#741) Co-authored-by: Dewen Qi <[email protected]> * add metrics client to session object (aws#745) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * change: Add integ test for using Run in Transform Job (aws#749) Co-authored-by: Dewen Qi <[email protected]> * Add async metrics sink (aws#739) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * use metrics client provided by session (aws#754) * fix flaky metrics test (aws#753) * change: Change Run.init and Run.load to constructor and module method respectively (aws#752) Co-authored-by: Dewen Qi <[email protected]> * feature: Add latest metric service model (aws#757) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * fix: lowercase run name (aws#767) * Change: Minimize use of lower case tc name (aws#769) * change: Clean up test resources to remove model files (aws#756) * change: Clean up test resources to remove model files * fix: Change experiment enums to upper case * change: Upgrade boto3 and update test to validate mixed case name * fix: Update as per latest botocore release and backend change Co-authored-by: Dewen Qi <[email protected]> * lowercase trial component name (aws#776) * change: Expose sagemaker experiment doc strings * fix: Fix exp name mixed case in issue Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Yifei Zhu <[email protected]>

laurenyu added the type: question label Apr 17, 2019

jesterhazy closed this as completed May 20, 2019

qidewenwhen pushed a commit to qidewenwhen/sagemaker-python-sdk that referenced this issue Dec 13, 2022

fix flaky metrics test (aws#753)

2111929

qidewenwhen pushed a commit to qidewenwhen/sagemaker-python-sdk that referenced this issue Dec 14, 2022

fix flaky metrics test (aws#753)

50b9f1c

qidewenwhen pushed a commit to qidewenwhen/sagemaker-python-sdk that referenced this issue Dec 14, 2022

fix flaky metrics test (aws#753)

891c267

qidewenwhen pushed a commit to qidewenwhen/sagemaker-python-sdk that referenced this issue Dec 14, 2022

fix flaky metrics test (aws#753)

fa29b69

qidewenwhen pushed a commit to qidewenwhen/sagemaker-python-sdk that referenced this issue Dec 14, 2022

fix flaky metrics test (aws#753)

cde42a6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sagemaker job failing in transformation step #753

sagemaker job failing in transformation step #753

NEIA20 commented Apr 16, 2019

laurenyu commented Apr 17, 2019

NEIA20 commented Apr 17, 2019

laurenyu commented Apr 17, 2019

NEIA20 commented Apr 18, 2019

laurenyu commented Apr 18, 2019 •

edited

Loading

NEIA20 commented Apr 19, 2019 •

edited

Loading

laurenyu commented Apr 22, 2019

andremoeller commented Apr 22, 2019

NEIA20 commented Apr 25, 2019

NEIA20 commented Apr 30, 2019

andremoeller commented Apr 30, 2019

NEIA20 commented May 1, 2019

andremoeller commented May 7, 2019

NEIA20 commented May 13, 2019

andremoeller commented May 14, 2019

jesterhazy commented May 20, 2019

AlexJonesNLP commented Jul 27, 2022

sagemaker job failing in transformation step #753

sagemaker job failing in transformation step #753

Comments

NEIA20 commented Apr 16, 2019

System Information

Describe the problem

Minimal repro / logs

laurenyu commented Apr 17, 2019

NEIA20 commented Apr 17, 2019

laurenyu commented Apr 17, 2019

NEIA20 commented Apr 18, 2019

laurenyu commented Apr 18, 2019 • edited Loading

NEIA20 commented Apr 19, 2019 • edited Loading

laurenyu commented Apr 22, 2019

andremoeller commented Apr 22, 2019

NEIA20 commented Apr 25, 2019

NEIA20 commented Apr 30, 2019

andremoeller commented Apr 30, 2019

NEIA20 commented May 1, 2019

andremoeller commented May 7, 2019

NEIA20 commented May 13, 2019

andremoeller commented May 14, 2019

jesterhazy commented May 20, 2019

AlexJonesNLP commented Jul 27, 2022

laurenyu commented Apr 18, 2019 •

edited

Loading

NEIA20 commented Apr 19, 2019 •

edited

Loading