Skip to content

add sagemaker cli #32

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jan 17, 2018
Merged

add sagemaker cli #32

merged 11 commits into from
Jan 17, 2018

Conversation

jesterhazy
Copy link
Contributor

add sagemaker command that allows very simple model training or hosting using the mxnet or tensorflow frameworks.

example usage:

sagemaker train --mx --data ./data --script train.py # train an mxnet model in sagemaker
sagemaker host --mx --data ./data --script host.py # host an mxnet model in sagemaker

@jesterhazy jesterhazy requested review from owen-t and mvsusp December 26, 2017 20:39
@winstonaws
Copy link
Contributor

Just initial thoughts - I'm hesitant to add this in without a good idea of what it's for, since it's a lot more API surface. Do you have clear use cases in mind?

The main advantages to me seem to be being bash-friendly - so situations where you don't want to be working in a notebook environment - and doing a lot of defaulting. I buy that the first exists, but I'm not sure we understand those cases well enough yet. For the second, the parameters which aren't defaulted in the python SDK (the hardware configuration) were pretty explicitly chosen, so I don't think we should add more defaulting for the CLI.

@owen-t
Copy link
Contributor

owen-t commented Jan 5, 2018

This is just for mxnet and tensorflow. Do we only imagine this is going to be for running frameworks, should we flag this as such? Should we auto detect and include frameworks and provide a framework list? Do we want to support 1P algorithms or custom containers?

@jesterhazy
Copy link
Contributor Author

yes, it's really just for mxnet and tf at this point. although it would not be too hard to extend to 1p stuff and custom frameworks. the cli args already hint at support for that, but i don't have test cases for other frameworks, so it seems better to defer. perhaps the related cli args (--image) should be removed or raise 'unsupported' errors?

i'm not sure what the value for autodetecting frameworks is... target user is someone who is using a specific framework, has data and code for that framework, and wants an instant way to launch a job/endpoint. also, whatever we could come up with for autodetecting would not know about custom frameworks. an easy alternative would just be to list the known frameworks that are part of the sagemaker sdk in the usage info, but it would need to be kept updated.

Copy link
Contributor

@winstonaws winstonaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The "tensorflow only" and "mxnet only" arguments seem like they'll get confusing if we keep expanding (both in # of frameworks and # of specific arguments to each framework).
    I like what you already have with the subparsers for "train" and "host". What if we add another layer for framework?
    We could make this work similarly to the AWS CLI - they have "aws s3 ls", we can have "sagemaker mxnet train". Then we can also give more specific guidance for each combination in help messages.
    Open to other suggestions, but I think we should come up with a scheme that allows for easy extensibility going forward.
  • Unit tests on cli/main.py, cli/train.py, and cli/host.py, please.
  • What do you think about adding short versions (e.g. -r for --role-name) for each of the options?

image_mtx.add_argument('--mx', help='use an MXNet container image', action='store_true')

# path to data and script files
common_parser.add_argument('--data', help='path to training data or model files', type=str, default='./data')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm dubious about the defaulting for data, script, and hyperparameters.json - this seems like nonstandard behavior for CLIs, and it's harder for the users to understand what they need to do if they don't specify the arguments but don't have the appropriate files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one of the main goals for this cli is to enable easy iteration, so sensible defaults seems like the right thing to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the cost relatively of typing out the extra parameters compared to the cost of understanding how to write the command initially? I think it's common to start out only specifying the parameters that are marked as required, and currently the error message isn't very helpful if those files don't exist. Can we at least improve those error messages?

The other issue with defaulting is hardware - we made the choice explicitly to not default those in the Python SDK to reduce the chances of over-charging or under-provisioning, since there's no one size fits all for all problems. What makes that not apply to this CLI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomfaulhaber i think you were party to the initial discussion about defaulting. how do you feel about it in this context?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think count is obvious: 1 (and we should do that in the SDK also - I disagree with not defaulting it)

Instance type is trickier, but I do feel like we should default it. I would go for ml.c4.xlarge here. Again, if it were up to me, I would add this default in the SDK, though I admit that it's way less obvious than the instance count.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jesterhazy pointed out that the ml.m4.xlarge are free tier eligible, so we should go with that

# path to data and script files
common_parser.add_argument('--data', help='path to training data or model files', type=str, default='./data')
common_parser.add_argument('--script', help='path to script', type=str, default='./script.py')
common_parser.add_argument('--job-name', help='job or endpoint name', type=str, default=None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is going to be common to both, can this be "name" instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name could refer to anything, so is too vague. i could go with separate job-name, endpoint-name params though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

common_parser.add_argument('--data', help='path to training data or model files', type=str, default='./data')
common_parser.add_argument('--script', help='path to script', type=str, default='./script.py')
common_parser.add_argument('--job-name', help='job or endpoint name', type=str, default=None)
common_parser.add_argument('--bucket-name', help='S3 bucket', type=str, default=None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The help message should explain what this is used for.

default='AmazonSageMakerFullAccess')

instance_group = common_parser.add_argument_group('instance settings')
instance_group.add_argument('--instance-type', type=str, help='instance type', default='ml.m4.xlarge')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my earlier comment about defaulting instance type and count.

common_parser.add_argument('--script', help='path to script', type=str, default='./script.py')
common_parser.add_argument('--job-name', help='job or endpoint name', type=str, default=None)
common_parser.add_argument('--bucket-name', help='S3 bucket', type=str, default=None)
common_parser.add_argument('--role-name', help='SageMaker execution role name', type=str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this fail slow if the customer doesn't have this role already? That could be frustrating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does fail slow. it is frustrating. it is same behavior as rest of python sdk, so should probably be fixed somewhere deeper into the sdk's shared modules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we default it here, and the Python SDK doesn't. It's also defaulted to something that isn't created automatically for them anywhere. This one I feel more strongly about removing the defaulting for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed default role name

data_url = self.upload_training_data()
estimator = self.create_estimator()
estimator.fit(data_url)
logger.info(' code location: {}'.format(estimator.uploaded_code.s3_prefix))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: extra leading space

Also, maybe this should be debug level rather than info? The user typically won't care much about this.


with tarfile.open(archive, mode='w:gz') as t:
t.add(src, arcname=arcname)
t.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extraneous? (because of context manager)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

def start(self):
model_url = self.upload_model()
model = self.create_model(model_url)
predictor = model.deploy(initial_instance_count=self.instance_count,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this print out the endpoint name at the end?

It might also be cool to print out an example command for invoking the endpoint using the aws cli. I don't know which of our frameworks will accept something cli-typable by default, though.

Copy link
Contributor Author

@jesterhazy jesterhazy Jan 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. it prints out the model name and endpoint name when it starts. with default log levels, the only output between start and end is the polling progress output (e.g. -----!). this seems sufficient?

  2. agree this would be great, but there doesn't appear to be any support for invoke-endpoint in awscli right now (i have 1.14.24)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's under "aws runtime.sagemaker invoke-endpoint". That said, I'm okay leaving this to a future PR.

@jesterhazy
Copy link
Contributor Author

jesterhazy commented Jan 14, 2018

@winstonaws I'm open to reorganizing the command args to sagemaker mxnet train... etc., but what would be the command for generic frameworks? or 1p algos if/when those are added? it doesn't need to be final, and I won't implement it now, but we need to be able to at least imagine a scheme that makes sense. i think it would be odd to make people enter commands like sagemaker generic train

also, which is the better hierarchy? sagemaker <algo> <action> or sagemaker <action> <algo>. the latter seems more natural to me, but individual users are likely to stick with one framework (at least during a particular task), so something sagemaker mxnet ... would make it convenient to create command aliases.

@jesterhazy
Copy link
Contributor Author

@winstonaws I can look at adding short aliases for args... current commands mostly have unique initial letters. but this may bite you in future.

@jesterhazy
Copy link
Contributor Author

addressed PR comments:

  • command args reorganized to sagemaker mxnet train..., sagemaker tensorflow host... etc.
  • unit tests
  • formatting issues

@jesterhazy
Copy link
Contributor Author

@winstonaws @owen-t

@winstonaws
Copy link
Contributor

Aside from getting Tom to chime in about the defaulting, one more thing -

We found that just typing "sagemaker" doesn't bring up the help info as it should in python 3 (works fine in python 2 for some reason). Same goes for "sagemaker mxnet".

@jesterhazy
Copy link
Contributor Author

fixed the py2/3 usage issue in the last commit. also added copyright notices to top of each file.

@winstonaws
Copy link
Contributor

I see that the Travis unit tests passed for both py2 and py3, but when running it locally in a new py2 virtualenv, "sagemaker mxnet host --role-name SageMakerRole" is not working for me. I get:

$ sagemaker mxnet host --role-name SageMakerRole
usage: sagemaker [-h] [--log-level LOG_LEVEL]
[--botocore-log-level BOTOCORE_LOG_LEVEL]
{mxnet,tensorflow} ...

Can you check it on your computer? If it works there then I'll chalk it up to something weird with my setup.

Also, can you update the .sh files in your cli/examples/train and /host directories to use the new syntax?

@jesterhazy
Copy link
Contributor Author

@winstonaws i think this is ready to go now.

  • sagemaker mxnet host ... wfm, probably your command was picking up outdated install (i.e. try pip install -U . again).
  • updated example .sh scripts

@winstonaws winstonaws merged commit e93eff6 into aws:master Jan 17, 2018
ragavvenkatesan pushed a commit that referenced this pull request Jan 30, 2018
* add sagemaker cli (#32)

* add sagemaker cli

* remove unnecessary close

* address PR comments

* tidy up imports

* fix imports, flake8 errors

* improve help message for bucket-name

* remove default role name

* fix log-level and py3 tests, add copyright

* update cli example scripts

* Add documentation about BYO Models (#47)

* Add test for BYO estimator using Factorization Machines algorithm as an example. (#50)

* Support multi-part uploads (#45)

* Update TensorFlow examples following API change (#44)

* Add data_type to hyperparameters (#54)

When we describe a training job the data type of the hyper parameters is
lost because we use a dict[str, str]. This adds a new field to
Hyperparameter so that we can convert the datatypes at runtime.

instead of validating with isinstance(), we cast the hp value to the type it
is meant to be. This enforces a "strongly typed" value. When we
deserialize from the API string responses it becomes easier to deal with
too.
@ragavvenkatesan ragavvenkatesan mentioned this pull request Jan 30, 2018
laurenyu pushed a commit to laurenyu/sagemaker-python-sdk that referenced this pull request May 31, 2018
apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this pull request Nov 15, 2018
Removed: copies of moved directories
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants