-
Notifications
You must be signed in to change notification settings - Fork 1.2k
add sagemaker cli #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Just initial thoughts - I'm hesitant to add this in without a good idea of what it's for, since it's a lot more API surface. Do you have clear use cases in mind? The main advantages to me seem to be being bash-friendly - so situations where you don't want to be working in a notebook environment - and doing a lot of defaulting. I buy that the first exists, but I'm not sure we understand those cases well enough yet. For the second, the parameters which aren't defaulted in the python SDK (the hardware configuration) were pretty explicitly chosen, so I don't think we should add more defaulting for the CLI. |
This is just for mxnet and tensorflow. Do we only imagine this is going to be for running frameworks, should we flag this as such? Should we auto detect and include frameworks and provide a framework list? Do we want to support 1P algorithms or custom containers? |
yes, it's really just for mxnet and tf at this point. although it would not be too hard to extend to 1p stuff and custom frameworks. the cli args already hint at support for that, but i don't have test cases for other frameworks, so it seems better to defer. perhaps the related cli args (--image) should be removed or raise 'unsupported' errors? i'm not sure what the value for autodetecting frameworks is... target user is someone who is using a specific framework, has data and code for that framework, and wants an instant way to launch a job/endpoint. also, whatever we could come up with for autodetecting would not know about custom frameworks. an easy alternative would just be to list the known frameworks that are part of the sagemaker sdk in the usage info, but it would need to be kept updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The "tensorflow only" and "mxnet only" arguments seem like they'll get confusing if we keep expanding (both in # of frameworks and # of specific arguments to each framework).
I like what you already have with the subparsers for "train" and "host". What if we add another layer for framework?
We could make this work similarly to the AWS CLI - they have "aws s3 ls", we can have "sagemaker mxnet train". Then we can also give more specific guidance for each combination in help messages.
Open to other suggestions, but I think we should come up with a scheme that allows for easy extensibility going forward. - Unit tests on cli/main.py, cli/train.py, and cli/host.py, please.
- What do you think about adding short versions (e.g. -r for --role-name) for each of the options?
src/sagemaker/cli/main.py
Outdated
image_mtx.add_argument('--mx', help='use an MXNet container image', action='store_true') | ||
|
||
# path to data and script files | ||
common_parser.add_argument('--data', help='path to training data or model files', type=str, default='./data') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm dubious about the defaulting for data, script, and hyperparameters.json - this seems like nonstandard behavior for CLIs, and it's harder for the users to understand what they need to do if they don't specify the arguments but don't have the appropriate files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one of the main goals for this cli is to enable easy iteration, so sensible defaults seems like the right thing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the cost relatively of typing out the extra parameters compared to the cost of understanding how to write the command initially? I think it's common to start out only specifying the parameters that are marked as required, and currently the error message isn't very helpful if those files don't exist. Can we at least improve those error messages?
The other issue with defaulting is hardware - we made the choice explicitly to not default those in the Python SDK to reduce the chances of over-charging or under-provisioning, since there's no one size fits all for all problems. What makes that not apply to this CLI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomfaulhaber i think you were party to the initial discussion about defaulting. how do you feel about it in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think count is obvious: 1 (and we should do that in the SDK also - I disagree with not defaulting it)
Instance type is trickier, but I do feel like we should default it. I would go for ml.c4.xlarge here. Again, if it were up to me, I would add this default in the SDK, though I admit that it's way less obvious than the instance count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jesterhazy pointed out that the ml.m4.xlarge are free tier eligible, so we should go with that
src/sagemaker/cli/main.py
Outdated
# path to data and script files | ||
common_parser.add_argument('--data', help='path to training data or model files', type=str, default='./data') | ||
common_parser.add_argument('--script', help='path to script', type=str, default='./script.py') | ||
common_parser.add_argument('--job-name', help='job or endpoint name', type=str, default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is going to be common to both, can this be "name" instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name could refer to anything, so is too vague. i could go with separate job-name, endpoint-name params though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
src/sagemaker/cli/main.py
Outdated
common_parser.add_argument('--data', help='path to training data or model files', type=str, default='./data') | ||
common_parser.add_argument('--script', help='path to script', type=str, default='./script.py') | ||
common_parser.add_argument('--job-name', help='job or endpoint name', type=str, default=None) | ||
common_parser.add_argument('--bucket-name', help='S3 bucket', type=str, default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The help message should explain what this is used for.
src/sagemaker/cli/main.py
Outdated
default='AmazonSageMakerFullAccess') | ||
|
||
instance_group = common_parser.add_argument_group('instance settings') | ||
instance_group.add_argument('--instance-type', type=str, help='instance type', default='ml.m4.xlarge') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my earlier comment about defaulting instance type and count.
src/sagemaker/cli/main.py
Outdated
common_parser.add_argument('--script', help='path to script', type=str, default='./script.py') | ||
common_parser.add_argument('--job-name', help='job or endpoint name', type=str, default=None) | ||
common_parser.add_argument('--bucket-name', help='S3 bucket', type=str, default=None) | ||
common_parser.add_argument('--role-name', help='SageMaker execution role name', type=str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this fail slow if the customer doesn't have this role already? That could be frustrating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does fail slow. it is frustrating. it is same behavior as rest of python sdk, so should probably be fixed somewhere deeper into the sdk's shared modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but we default it here, and the Python SDK doesn't. It's also defaulted to something that isn't created automatically for them anywhere. This one I feel more strongly about removing the defaulting for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed default role name
src/sagemaker/cli/train.py
Outdated
data_url = self.upload_training_data() | ||
estimator = self.create_estimator() | ||
estimator.fit(data_url) | ||
logger.info(' code location: {}'.format(estimator.uploaded_code.s3_prefix)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: extra leading space
Also, maybe this should be debug level rather than info? The user typically won't care much about this.
src/sagemaker/cli/host.py
Outdated
|
||
with tarfile.open(archive, mode='w:gz') as t: | ||
t.add(src, arcname=arcname) | ||
t.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extraneous? (because of context manager)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/sagemaker/cli/host.py
Outdated
def start(self): | ||
model_url = self.upload_model() | ||
model = self.create_model(model_url) | ||
predictor = model.deploy(initial_instance_count=self.instance_count, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this print out the endpoint name at the end?
It might also be cool to print out an example command for invoking the endpoint using the aws cli. I don't know which of our frameworks will accept something cli-typable by default, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
it prints out the model name and endpoint name when it starts. with default log levels, the only output between start and end is the polling progress output (e.g.
-----!
). this seems sufficient? -
agree this would be great, but there doesn't appear to be any support for invoke-endpoint in awscli right now (i have 1.14.24)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's under "aws runtime.sagemaker invoke-endpoint". That said, I'm okay leaving this to a future PR.
@winstonaws I'm open to reorganizing the command args to also, which is the better hierarchy? |
@winstonaws I can look at adding short aliases for args... current commands mostly have unique initial letters. but this may bite you in future. |
addressed PR comments:
|
Aside from getting Tom to chime in about the defaulting, one more thing - We found that just typing "sagemaker" doesn't bring up the help info as it should in python 3 (works fine in python 2 for some reason). Same goes for "sagemaker mxnet". |
fixed the py2/3 usage issue in the last commit. also added copyright notices to top of each file. |
I see that the Travis unit tests passed for both py2 and py3, but when running it locally in a new py2 virtualenv, "sagemaker mxnet host --role-name SageMakerRole" is not working for me. I get: $ sagemaker mxnet host --role-name SageMakerRole Can you check it on your computer? If it works there then I'll chalk it up to something weird with my setup. Also, can you update the .sh files in your cli/examples/train and /host directories to use the new syntax? |
@winstonaws i think this is ready to go now.
|
* add sagemaker cli (#32) * add sagemaker cli * remove unnecessary close * address PR comments * tidy up imports * fix imports, flake8 errors * improve help message for bucket-name * remove default role name * fix log-level and py3 tests, add copyright * update cli example scripts * Add documentation about BYO Models (#47) * Add test for BYO estimator using Factorization Machines algorithm as an example. (#50) * Support multi-part uploads (#45) * Update TensorFlow examples following API change (#44) * Add data_type to hyperparameters (#54) When we describe a training job the data type of the hyper parameters is lost because we use a dict[str, str]. This adds a new field to Hyperparameter so that we can convert the datatypes at runtime. instead of validating with isinstance(), we cast the hp value to the type it is meant to be. This enforces a "strongly typed" value. When we deserialize from the API string responses it becomes easier to deal with too.
Removed: copies of moved directories
add
sagemaker
command that allows very simple model training or hosting using the mxnet or tensorflow frameworks.example usage:
sagemaker train --mx --data ./data --script train.py # train an mxnet model in sagemaker
sagemaker host --mx --data ./data --script host.py # host an mxnet model in sagemaker