Add wrapper for LDA. #56

lukmis · 2018-01-24T17:43:27Z

Add wrapper class for LDA algorithm.

Include unit tests for the new class.
Include an integration test for the new class.

Add static method to construct RecordSet for existing files in S3.

Bump up version and update CHANGELOG with changes since last version.

Update CHANGELOG and bump the version number.

…in hp declaration.

winstonaws · 2018-01-25T22:12:24Z

src/sagemaker/amazon/amazon_estimator.py

@@ -47,7 +49,7 @@ def __init__(self, role, train_instance_count, train_instance_type, data_locatio
        self.data_location = data_location

    def train_image(self):
-        return registry(self.sagemaker_session.boto_region_name) + "/" + type(self).repo
+        return registry(self.sagemaker_session.boto_region_name, type(self).__name__) + "/" + type(self).repo


The type(self).name here isn't ideal since this will break if the LDA class is subclassed. (I think the type(self).repo stuff we're already doing isn't the best either, but at least that still works in that case).

Changing the new algorithm parameter to take in the class instead of the name may be an improvement. Open to other suggestions as well.

Good observation. Not sure about the scenario for subclassing these wrapper classes but in any case there are two options: either you add your new class to registry() mapping (just like you add any new class) or provide custom 'train_image' in a subclass.

I'm talking about end users subclassing them in their own code. There's no super obvious case for doing so, but it's also not wrong to do so, and all else being equal there shouldn't be unexpected things that break when it happens.

The cost to make this work in the case of subclassing isn't big IMO - you can modify the registry() method to accept a Class instead of a string, then instead of just using direct equality checks, you use is https://docs.python.org/3/library/functions.html#issubclass instead.

The implementation with class has an issue with circular dependency, let's use same approach as with repo name then

winstonaws · 2018-01-25T22:23:54Z

src/sagemaker/amazon/amazon_estimator.py

@@ -152,6 +154,61 @@ def __repr__(self):
        """Return an unambiguous representation of this RecordSet"""
        return str((RecordSet, self.__dict__))

+    @staticmethod
+    def from_s3(data_path, num_records, feature_dim, channel='train'):


I don't think this is needed. EASE already has the logic to list files in an S3 prefix and create its internal manifest file - that's what happens when you use its S3Prefix mode and just pass it a prefix. We shouldn't duplicate that logic here.

So, you can already create a RecordSet object and set the s3_data_type constructor argument to 'S3Prefix' to get this functionality. (Or if you just pass an S3 URI string as input to .fit(), you'll get it as well.)

Indeed no need to duplicate. I'll be addressing your comment below about the test and this will likely go away.

winstonaws · 2018-01-25T22:26:26Z

src/sagemaker/amazon/lda.py

+    repo = 'lda:1'
+
+    num_topics = hp('num_topics', (isint, gt(0)), 'An integer greater than zero')
+    alpha0 = hp('alpha0', isnumber, "A float value")


Should also be gt(0)? (Basing this off: https://docs.aws.amazon.com/sagemaker/latest/dg/lda_hyperparameters.html )

winstonaws · 2018-01-25T22:26:57Z

src/sagemaker/amazon/lda.py

+    alpha0 = hp('alpha0', isnumber, "A float value")
+    max_restarts = hp('max_restarts', (isint, gt(0)), 'An integer greater than zero')
+    max_iterations = hp('max_iterations', (isint, gt(0)), 'An integer greater than zero')
+    tol = hp('tol', (isnumber, gt(0)), "A positive float")


Use single quotes for consistency (also applies to line 27).

winstonaws · 2018-01-25T22:33:31Z

src/sagemaker/amazon/lda.py

+                the inference code might use the IAM role, if accessing AWS resource.
+            train_instance_type (str): Type of EC2 instance to use for training, for example, 'ml.c4.xlarge'.
+            num_topics (int): The number of topics for LDA to find within the data.
+            alpha0 (float): Initial guess for the concentration parameter


Indicate that these hps are optional.

winstonaws · 2018-01-25T22:35:06Z

src/sagemaker/amazon/lda.py

+        """
+
+        # this algorithm only supports single instance training
+        super(LDA, self).__init__(role, 1, train_instance_type, **kwargs)


Add a link to the docs for this.

It also indicates that it only supports CPU instances for training. That seems like it would be good to validate.

The training job will not fail if not started on CPU instance so I think we shouldn't fail too fast here. Will add comment/link.

Yea, that makes sense. I misunderstood.

winstonaws · 2018-01-25T22:44:36Z

src/sagemaker/amazon/lda.py

+
+    def fit(self, records, mini_batch_size, **kwargs):
+        # mini_batch_size is required
+        if mini_batch_size is None:


This check isn't needed since mini_batch_size isn't an optional parameter.

Since it is required parameter we must fail if you call: fit(records, None)

I personally think this kind of validation is generally not worth the code clutter, since Python's support for optional parameters as a first-class feature means that people don't randomly pass None to things and expect it to work. However this is pretty minor and it can add value in some cases so I'm okay with leaving this if you feel strongly.

Simplified the checks

winstonaws · 2018-01-25T22:45:32Z

src/sagemaker/amazon/lda.py

+        # mini_batch_size is required
+        if mini_batch_size is None:
+            raise ValueError("mini_batch_size must be set")
+        if not isinstance(mini_batch_size, int) or mini_batch_size < 1:


I don't think this check should go here - mini_batch_size is a shared concept across all the 1P algorithms, so it'd be better to put it in the base class if we want it at all.

It's not entirely true. Some algorithms have it as optional (e.g. FM) and some do not even have it at all (e.g. XGBoost). Since it is algorithm dependent we need to validate it there.

mini_batch_size is a parameter in the fit() method of the base class of all 1P algorithms: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/amazon/amazon_estimator.py#L67

It takes either an int or None; None represents the cases where it's not required. Validating that it's an int if it's not None will apply for all cases.

you are right, eventually code will check the type, just leaving the value check only

winstonaws · 2018-01-25T22:48:44Z

tests/integ/test_lda.py

+                  sagemaker_session=sagemaker_session, base_job_name='test-lda')
+
+        # upload data and prepare the set
+        data_location_key = "integ-test-data/lda-" + sagemaker_timestamp()


You should be able to do something like "lda.record_set(...)" instead of uploading to S3 separately, right?

"lda.record_set(...)" works with numpy arrays but not with binary files.

With respect you your other comment about "from_s3". I'll see if it can be nicely refactored into one call.

winstonaws · 2018-01-25T23:12:08Z

tests/unit/test_lda.py

+HP_TRAIN_CALL.update({'hyperparameters': STRINGIFIED_HYPERPARAMS})
+
+
+def test_call_fit(sagemaker_session):


I don't think we should write unit tests that test functionality past the point where the LDA class hands off to the base class. The base class should already have test coverage on this functionality (and if it doesn't, we should add it there).

We should really just be (comprehensively) unit testing that the new code we wrote works correctly, and not more.

I agree with that principle. These unit tests verify logic inside fit() on this level (verification of batch_size). This particular one tests that it delegates to base implementation correctly. It's easiest to check how it calls to the service since this is where we intercept calls.

You should be able to use patching to replace the fit() method of the base class with a Mock object, and verify that parameters are passed correctly to the mock. That would be much more direct and concise (and thus maintainable).

yeah - this seems cleaner, changing

winstonaws · 2018-01-26T18:47:27Z

src/sagemaker/amazon/lda.py

+    repo = 'lda:1'
+
+    num_topics = hp('num_topics', gt(0), 'An integer greater than zero', int)
+    alpha0 = hp('alpha0', (), "A float value", float)


Should also be gt(0)? (Basing this off: https://docs.aws.amazon.com/sagemaker/latest/dg/lda_hyperparameters.html )

Also, use single quotes for consistency (here and also line 30).

Good catch!

winstonaws · 2018-01-29T21:27:25Z

src/sagemaker/amazon/factorization_machines.py

@@ -21,7 +21,8 @@

 class FactorizationMachines(AmazonAlgorithmEstimatorBase):

-    repo = 'factorization-machines:1'
+    alg_name = 'factorization-machines'


I would still name these "repo_name" and "repo_tag" or similar, since that's what they are first and foremost.

winstonaws · 2018-01-29T21:28:42Z

src/sagemaker/amazon/factorization_machines.py

@@ -194,7 +195,8 @@ class FactorizationMachinesModel(Model):

    def __init__(self, model_data, role, sagemaker_session=None):
        sagemaker_session = sagemaker_session or Session()
-        image = registry(sagemaker_session.boto_session.region_name) + "/" + FactorizationMachines.repo
+        repo = '{}:{}'.format(FactorizationMachines.alg_name, FactorizationMachines.alg_version)
+        image = registry(sagemaker_session.boto_session.region_name) + "/" + repo


Use format here as well for consistency?

winstonaws · 2018-01-29T21:31:21Z

src/sagemaker/amazon/amazon_estimator.py

@@ -259,14 +223,14 @@ def upload_numpy_to_s3_shards(num_shards, s3, bucket, key_prefix, array, labels=

 def registry(region_name, algorithm=None):


Maybe change "algorithm" to "algorithm_repo_name" or similar?

winstonaws · 2018-01-29T21:46:34Z

src/sagemaker/amazon/amazon_estimator.py

@@ -127,6 +126,26 @@ def record_set(self, train, labels=None, channel="train"):
        logger.debug("Created manifest file {}".format(manifest_s3_file))
        return RecordSet(manifest_s3_file, num_records=train.shape[0], feature_dim=train.shape[1], channel=channel)

+    def record_set_from_local_files(self, data_path, num_records, feature_dim, channel="train"):


Unit tests please.

winstonaws · 2018-01-29T21:47:38Z

src/sagemaker/amazon/amazon_estimator.py

+        """Build a :class:`~RecordSet` by pointing to local files.
+
+        Args:
+            data_path (string): Path to local file to be uploaded for training.


Need more specificity either here or in the overall comment. The user needs to know whether it works on single files, directories, or both, etc.

lukmis · 2018-01-30T19:46:53Z

Addressed the comments.

winstonaws · 2018-01-30T22:55:16Z

Thanks!

… strategy is formed.

* Add data_type to hyperparameters (aws#54) When we describe a training job the data type of the hyper parameters is lost because we use a dict[str, str]. This adds a new field to Hyperparameter so that we can convert the datatypes at runtime. instead of validating with isinstance(), we cast the hp value to the type it is meant to be. This enforces a "strongly typed" value. When we deserialize from the API string responses it becomes easier to deal with too. * Add wrapper for LDA. (aws#56) Update CHANGELOG and bump the version number. * Add support for async fit() (aws#59) when calling fit(wait=False) it will return immediately. The training job will carry on even if the process exits. by using attach() the estimator can be retrieved by providing the training job name. _prepare_init_params_from_job_description() is now a classmethod instead of being a static method. Each class is responsible to implement their specific logic to convert a training job description into arguments that can be passed to its own __init__() * Fix Estimator role expansion (aws#68) Instead of manually constructing the role ARN, use the IAM boto client to do it. This properly expands service-roles and regular roles. * Add FM and LDA to the documentation. (aws#66) * Fix description of an argument of sagemaker.session.train (aws#69) * Fix description of an argument of sagemaker.session.train 'input_config' should be an array which has channel objects. * Add a link to the botocore docs * Use 'list' instead of 'array' in the description * Add ntm algorithm with doc, unit tests, integ tests (aws#73) * JSON serializer: predictor.predict accepts dictionaries (aws#62) Add support for serializing python dictionaries to json Add prediction with dictionary in tf iris integ test * Fixing timeouts for PCA async integration test. (aws#78) Execute tf_cifar test without logs to eliminate delay to detect that job has finished. * Fixes in LinearLearner and unit tests addition. (aws#77) * Print out billable seconds after training completes (aws#30) * Added: print out billable seconds after training completes * Fixed: test_session.py to pass unit tests * Fixed: removed offending tzlocal() * Use sagemaker_timestamp when creating endpoint names in integration tests. (aws#81) * Support TensorFlow-1.5.0 and MXNet-1.0.0 (aws#82) * Update .gitignore to ignore pytest_cache. * Support TensorFlow-1.5.0 and MXNet-1.0.0 * Update and refactor tests. Add tests for fw_utils. * Fix typo. * Update changelog for 1.1.0 (aws#85)

spelling fix

lukmis and others added 7 commits January 24, 2018 09:11

Add wrapper for LDA.

e2b8f0d

Update CHANGELOG and bump the version number.

Fix values used in unit tests to be matching the type of the parameter.

33fc211

Revert previous values changes in tests and fix the validation order …

30c65e0

…in hp declaration.

Fix docstring.

e32aac7

Fix file reading mode.

b8658b3

Merge branch 'master' into add_lda

4e1c134

Update after merge from master.

324655d

lukmis requested a review from iquintero January 25, 2018 22:36

winstonaws suggested changes Jan 25, 2018

View reviewed changes

winstonaws reviewed Jan 26, 2018

View reviewed changes

lukmis added 2 commits January 29, 2018 12:50

Comments.

5094e97

Fix flake.

238db37

winstonaws reviewed Jan 29, 2018

View reviewed changes

Comments.

7cc945a

winstonaws previously approved these changes Jan 30, 2018

View reviewed changes

Move construction of the RecordSet to test part until a more complete…

1dfc4c3

… strategy is formed.

lukmis dismissed winstonaws’s stale review via 1dfc4c3 January 31, 2018 16:33

winstonaws approved these changes Jan 31, 2018

View reviewed changes

lukmis merged commit 354ded3 into aws:master Jan 31, 2018

apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this pull request Nov 15, 2018

Merge pull request aws#56 from awslabs/tomf-spell

a1769f0

spelling fix

		HP_TRAIN_CALL.update({'hyperparameters': STRINGIFIED_HYPERPARAMS})


		def test_call_fit(sagemaker_session):

		@@ -259,14 +223,14 @@ def upload_numpy_to_s3_shards(num_shards, s3, bucket, key_prefix, array, labels=

		def registry(region_name, algorithm=None):

Add wrapper for LDA. #56

Add wrapper for LDA. #56

Conversation

lukmis commented Jan 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukmis commented Jan 30, 2018

winstonaws commented Jan 30, 2018