add sagemaker cli #32

jesterhazy · 2017-12-26T20:39:55Z

add sagemaker command that allows very simple model training or hosting using the mxnet or tensorflow frameworks.

example usage:

sagemaker train --mx --data ./data --script train.py # train an mxnet model in sagemaker
sagemaker host --mx --data ./data --script host.py # host an mxnet model in sagemaker

winstonaws · 2018-01-04T22:04:46Z

Just initial thoughts - I'm hesitant to add this in without a good idea of what it's for, since it's a lot more API surface. Do you have clear use cases in mind?

The main advantages to me seem to be being bash-friendly - so situations where you don't want to be working in a notebook environment - and doing a lot of defaulting. I buy that the first exists, but I'm not sure we understand those cases well enough yet. For the second, the parameters which aren't defaulted in the python SDK (the hardware configuration) were pretty explicitly chosen, so I don't think we should add more defaulting for the CLI.

owen-t · 2018-01-05T22:41:41Z

This is just for mxnet and tensorflow. Do we only imagine this is going to be for running frameworks, should we flag this as such? Should we auto detect and include frameworks and provide a framework list? Do we want to support 1P algorithms or custom containers?

jesterhazy · 2018-01-05T23:56:56Z

yes, it's really just for mxnet and tf at this point. although it would not be too hard to extend to 1p stuff and custom frameworks. the cli args already hint at support for that, but i don't have test cases for other frameworks, so it seems better to defer. perhaps the related cli args (--image) should be removed or raise 'unsupported' errors?

i'm not sure what the value for autodetecting frameworks is... target user is someone who is using a specific framework, has data and code for that framework, and wants an instant way to launch a job/endpoint. also, whatever we could come up with for autodetecting would not know about custom frameworks. an easy alternative would just be to list the known frameworks that are part of the sagemaker sdk in the usage info, but it would need to be kept updated.

winstonaws

The "tensorflow only" and "mxnet only" arguments seem like they'll get confusing if we keep expanding (both in # of frameworks and # of specific arguments to each framework).
I like what you already have with the subparsers for "train" and "host". What if we add another layer for framework?
We could make this work similarly to the AWS CLI - they have "aws s3 ls", we can have "sagemaker mxnet train". Then we can also give more specific guidance for each combination in help messages.
Open to other suggestions, but I think we should come up with a scheme that allows for easy extensibility going forward.
Unit tests on cli/main.py, cli/train.py, and cli/host.py, please.
What do you think about adding short versions (e.g. -r for --role-name) for each of the options?

winstonaws · 2018-01-10T22:53:01Z

src/sagemaker/cli/main.py

+    image_mtx.add_argument('--mx', help='use an MXNet container image', action='store_true')
+
+    # path to data and script files
+    common_parser.add_argument('--data', help='path to training data or model files', type=str, default='./data')


I'm dubious about the defaulting for data, script, and hyperparameters.json - this seems like nonstandard behavior for CLIs, and it's harder for the users to understand what they need to do if they don't specify the arguments but don't have the appropriate files.

one of the main goals for this cli is to enable easy iteration, so sensible defaults seems like the right thing to do.

What's the cost relatively of typing out the extra parameters compared to the cost of understanding how to write the command initially? I think it's common to start out only specifying the parameters that are marked as required, and currently the error message isn't very helpful if those files don't exist. Can we at least improve those error messages?

The other issue with defaulting is hardware - we made the choice explicitly to not default those in the Python SDK to reduce the chances of over-charging or under-provisioning, since there's no one size fits all for all problems. What makes that not apply to this CLI?

@tomfaulhaber i think you were party to the initial discussion about defaulting. how do you feel about it in this context?

I think count is obvious: 1 (and we should do that in the SDK also - I disagree with not defaulting it)

Instance type is trickier, but I do feel like we should default it. I would go for ml.c4.xlarge here. Again, if it were up to me, I would add this default in the SDK, though I admit that it's way less obvious than the instance count.

@jesterhazy pointed out that the ml.m4.xlarge are free tier eligible, so we should go with that

winstonaws · 2018-01-10T22:54:28Z

src/sagemaker/cli/main.py

+    # path to data and script files
+    common_parser.add_argument('--data', help='path to training data or model files', type=str, default='./data')
+    common_parser.add_argument('--script', help='path to script', type=str, default='./script.py')
+    common_parser.add_argument('--job-name', help='job or endpoint name', type=str, default=None)


If this is going to be common to both, can this be "name" instead?

name could refer to anything, so is too vague. i could go with separate job-name, endpoint-name params though.

Sounds good.

winstonaws · 2018-01-10T22:55:38Z

src/sagemaker/cli/main.py

+    common_parser.add_argument('--data', help='path to training data or model files', type=str, default='./data')
+    common_parser.add_argument('--script', help='path to script', type=str, default='./script.py')
+    common_parser.add_argument('--job-name', help='job or endpoint name', type=str, default=None)
+    common_parser.add_argument('--bucket-name', help='S3 bucket', type=str, default=None)


The help message should explain what this is used for.

winstonaws · 2018-01-10T22:56:48Z

src/sagemaker/cli/main.py

+                               default='AmazonSageMakerFullAccess')
+
+    instance_group = common_parser.add_argument_group('instance settings')
+    instance_group.add_argument('--instance-type', type=str, help='instance type', default='ml.m4.xlarge')


See my earlier comment about defaulting instance type and count.

winstonaws · 2018-01-10T22:57:07Z

src/sagemaker/cli/main.py

+    common_parser.add_argument('--script', help='path to script', type=str, default='./script.py')
+    common_parser.add_argument('--job-name', help='job or endpoint name', type=str, default=None)
+    common_parser.add_argument('--bucket-name', help='S3 bucket', type=str, default=None)
+    common_parser.add_argument('--role-name', help='SageMaker execution role name', type=str,


Will this fail slow if the customer doesn't have this role already? That could be frustrating.

it does fail slow. it is frustrating. it is same behavior as rest of python sdk, so should probably be fixed somewhere deeper into the sdk's shared modules.

Yes, but we default it here, and the Python SDK doesn't. It's also defaulted to something that isn't created automatically for them anywhere. This one I feel more strongly about removing the defaulting for.

removed default role name

winstonaws · 2018-01-10T22:59:58Z

src/sagemaker/cli/train.py

+        data_url = self.upload_training_data()
+        estimator = self.create_estimator()
+        estimator.fit(data_url)
+        logger.info(' code location: {}'.format(estimator.uploaded_code.s3_prefix))


Nit: extra leading space

Also, maybe this should be debug level rather than info? The user typically won't care much about this.

winstonaws · 2018-01-10T23:03:29Z

src/sagemaker/cli/host.py

+
+        with tarfile.open(archive, mode='w:gz') as t:
+            t.add(src, arcname=arcname)
+        t.close()


Extraneous? (because of context manager)

winstonaws · 2018-01-10T23:06:48Z

src/sagemaker/cli/host.py

+    def start(self):
+        model_url = self.upload_model()
+        model = self.create_model(model_url)
+        predictor = model.deploy(initial_instance_count=self.instance_count,


Does this print out the endpoint name at the end?

It might also be cool to print out an example command for invoking the endpoint using the aws cli. I don't know which of our frameworks will accept something cli-typable by default, though.

it prints out the model name and endpoint name when it starts. with default log levels, the only output between start and end is the polling progress output (e.g. -----!). this seems sufficient?

agree this would be great, but there doesn't appear to be any support for invoke-endpoint in awscli right now (i have 1.14.24)

It's under "aws runtime.sagemaker invoke-endpoint". That said, I'm okay leaving this to a future PR.

jesterhazy · 2018-01-14T15:40:34Z

@winstonaws I'm open to reorganizing the command args to sagemaker mxnet train... etc., but what would be the command for generic frameworks? or 1p algos if/when those are added? it doesn't need to be final, and I won't implement it now, but we need to be able to at least imagine a scheme that makes sense. i think it would be odd to make people enter commands like sagemaker generic train

also, which is the better hierarchy? sagemaker <algo> <action> or sagemaker <action> <algo>. the latter seems more natural to me, but individual users are likely to stick with one framework (at least during a particular task), so something sagemaker mxnet ... would make it convenient to create command aliases.

jesterhazy · 2018-01-14T15:43:47Z

@winstonaws I can look at adding short aliases for args... current commands mostly have unique initial letters. but this may bite you in future.

jesterhazy · 2018-01-14T21:57:51Z

addressed PR comments:

command args reorganized to sagemaker mxnet train..., sagemaker tensorflow host... etc.
unit tests
formatting issues

jesterhazy · 2018-01-14T21:58:05Z

@winstonaws @owen-t

winstonaws · 2018-01-16T01:06:43Z

Aside from getting Tom to chime in about the defaulting, one more thing -

We found that just typing "sagemaker" doesn't bring up the help info as it should in python 3 (works fine in python 2 for some reason). Same goes for "sagemaker mxnet".

jesterhazy · 2018-01-16T01:53:49Z

fixed the py2/3 usage issue in the last commit. also added copyright notices to top of each file.

winstonaws · 2018-01-16T19:59:59Z

I see that the Travis unit tests passed for both py2 and py3, but when running it locally in a new py2 virtualenv, "sagemaker mxnet host --role-name SageMakerRole" is not working for me. I get:

$ sagemaker mxnet host --role-name SageMakerRole
usage: sagemaker [-h] [--log-level LOG_LEVEL]
[--botocore-log-level BOTOCORE_LOG_LEVEL]
{mxnet,tensorflow} ...

Can you check it on your computer? If it works there then I'll chalk it up to something weird with my setup.

Also, can you update the .sh files in your cli/examples/train and /host directories to use the new syntax?

jesterhazy · 2018-01-16T22:20:36Z

@winstonaws i think this is ready to go now.

sagemaker mxnet host ... wfm, probably your command was picking up outdated install (i.e. try pip install -U . again).
updated example .sh scripts

* add sagemaker cli (#32) * add sagemaker cli * remove unnecessary close * address PR comments * tidy up imports * fix imports, flake8 errors * improve help message for bucket-name * remove default role name * fix log-level and py3 tests, add copyright * update cli example scripts * Add documentation about BYO Models (#47) * Add test for BYO estimator using Factorization Machines algorithm as an example. (#50) * Support multi-part uploads (#45) * Update TensorFlow examples following API change (#44) * Add data_type to hyperparameters (#54) When we describe a training job the data type of the hyper parameters is lost because we use a dict[str, str]. This adds a new field to Hyperparameter so that we can convert the datatypes at runtime. instead of validating with isinstance(), we cast the hp value to the type it is meant to be. This enforces a "strongly typed" value. When we deserialize from the API string responses it becomes easier to deal with too.

Removed: copies of moved directories

add sagemaker cli

d489689

jesterhazy requested review from owen-t and mvsusp December 26, 2017 20:39

winstonaws suggested changes Jan 10, 2018

View reviewed changes

Jonathan Esterhazy added 2 commits January 14, 2018 08:04

remove unnecessary close

b2f7e9e

address PR comments

80c4e92

Jonathan Esterhazy and others added 5 commits January 14, 2018 14:12

tidy up imports

dc89892

fix imports, flake8 errors

594ab29

improve help message for bucket-name

770c61a

Merge branch 'master' into master

3dd7a14

remove default role name

ee741a5

fix log-level and py3 tests, add copyright

e2a0042

update cli example scripts

ab192b5

winstonaws approved these changes Jan 16, 2018

View reviewed changes

Merge branch 'master' into master

800640d

winstonaws merged commit e93eff6 into aws:master Jan 17, 2018

ragavvenkatesan mentioned this pull request Jan 30, 2018

sync (#63) #64

Closed

laurenyu pushed a commit to laurenyu/sagemaker-python-sdk that referenced this pull request May 31, 2018

Fix unit test (aws#32)

ae1dd2f

apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this pull request Nov 15, 2018

Merge pull request aws#32 from awslabs/arpin_directory_remove

ef9e6fc

Removed: copies of moved directories

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add sagemaker cli #32

add sagemaker cli #32

jesterhazy commented Dec 26, 2017

winstonaws commented Jan 4, 2018

owen-t commented Jan 5, 2018

jesterhazy commented Jan 5, 2018

winstonaws left a comment •

edited

Loading

winstonaws Jan 10, 2018

jesterhazy Jan 14, 2018

winstonaws Jan 16, 2018

jesterhazy Jan 16, 2018

tomfaulhaber Jan 16, 2018

tomfaulhaber Jan 16, 2018

winstonaws Jan 10, 2018

jesterhazy Jan 14, 2018

winstonaws Jan 16, 2018

winstonaws Jan 10, 2018

winstonaws Jan 10, 2018

winstonaws Jan 10, 2018

jesterhazy Jan 14, 2018

winstonaws Jan 16, 2018

jesterhazy Jan 16, 2018

winstonaws Jan 10, 2018

winstonaws Jan 10, 2018

jesterhazy Jan 14, 2018

winstonaws Jan 10, 2018

jesterhazy Jan 14, 2018 •

edited

Loading

winstonaws Jan 16, 2018

jesterhazy commented Jan 14, 2018 •

edited

Loading

jesterhazy commented Jan 14, 2018

jesterhazy commented Jan 14, 2018

jesterhazy commented Jan 14, 2018

winstonaws commented Jan 16, 2018

jesterhazy commented Jan 16, 2018

winstonaws commented Jan 16, 2018

jesterhazy commented Jan 16, 2018

add sagemaker cli #32

add sagemaker cli #32

Conversation

jesterhazy commented Dec 26, 2017

winstonaws commented Jan 4, 2018

owen-t commented Jan 5, 2018

jesterhazy commented Jan 5, 2018

winstonaws left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jesterhazy Jan 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jesterhazy commented Jan 14, 2018 • edited Loading

jesterhazy commented Jan 14, 2018

jesterhazy commented Jan 14, 2018

jesterhazy commented Jan 14, 2018

winstonaws commented Jan 16, 2018

jesterhazy commented Jan 16, 2018

winstonaws commented Jan 16, 2018

jesterhazy commented Jan 16, 2018

winstonaws left a comment •

edited

Loading

jesterhazy Jan 14, 2018 •

edited

Loading

jesterhazy commented Jan 14, 2018 •

edited

Loading