Skip to content

Add wrapper for LDA. #56

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jan 31, 2018
Merged

Add wrapper for LDA. #56

merged 11 commits into from
Jan 31, 2018

Conversation

lukmis
Copy link
Contributor

@lukmis lukmis commented Jan 24, 2018

Add wrapper class for LDA algorithm.

Include unit tests for the new class.
Include an integration test for the new class.

Add static method to construct RecordSet for existing files in S3.

Bump up version and update CHANGELOG with changes since last version.

@lukmis lukmis requested a review from iquintero January 25, 2018 22:36
@@ -47,7 +49,7 @@ def __init__(self, role, train_instance_count, train_instance_type, data_locatio
self.data_location = data_location

def train_image(self):
return registry(self.sagemaker_session.boto_region_name) + "/" + type(self).repo
return registry(self.sagemaker_session.boto_region_name, type(self).__name__) + "/" + type(self).repo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type(self).name here isn't ideal since this will break if the LDA class is subclassed. (I think the type(self).repo stuff we're already doing isn't the best either, but at least that still works in that case).

Changing the new algorithm parameter to take in the class instead of the name may be an improvement. Open to other suggestions as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation. Not sure about the scenario for subclassing these wrapper classes but in any case there are two options: either you add your new class to registry() mapping (just like you add any new class) or provide custom 'train_image' in a subclass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm talking about end users subclassing them in their own code. There's no super obvious case for doing so, but it's also not wrong to do so, and all else being equal there shouldn't be unexpected things that break when it happens.

The cost to make this work in the case of subclassing isn't big IMO - you can modify the registry() method to accept a Class instead of a string, then instead of just using direct equality checks, you use is https://docs.python.org/3/library/functions.html#issubclass instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation with class has an issue with circular dependency, let's use same approach as with repo name then

@@ -152,6 +154,61 @@ def __repr__(self):
"""Return an unambiguous representation of this RecordSet"""
return str((RecordSet, self.__dict__))

@staticmethod
def from_s3(data_path, num_records, feature_dim, channel='train'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed. EASE already has the logic to list files in an S3 prefix and create its internal manifest file - that's what happens when you use its S3Prefix mode and just pass it a prefix. We shouldn't duplicate that logic here.

So, you can already create a RecordSet object and set the s3_data_type constructor argument to 'S3Prefix' to get this functionality. (Or if you just pass an S3 URI string as input to .fit(), you'll get it as well.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed no need to duplicate. I'll be addressing your comment below about the test and this will likely go away.

repo = 'lda:1'

num_topics = hp('num_topics', (isint, gt(0)), 'An integer greater than zero')
alpha0 = hp('alpha0', isnumber, "A float value")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alpha0 = hp('alpha0', isnumber, "A float value")
max_restarts = hp('max_restarts', (isint, gt(0)), 'An integer greater than zero')
max_iterations = hp('max_iterations', (isint, gt(0)), 'An integer greater than zero')
tol = hp('tol', (isnumber, gt(0)), "A positive float")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use single quotes for consistency (also applies to line 27).

the inference code might use the IAM role, if accessing AWS resource.
train_instance_type (str): Type of EC2 instance to use for training, for example, 'ml.c4.xlarge'.
num_topics (int): The number of topics for LDA to find within the data.
alpha0 (float): Initial guess for the concentration parameter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indicate that these hps are optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

"""

# this algorithm only supports single instance training
super(LDA, self).__init__(role, 1, train_instance_type, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a link to the docs for this.

It also indicates that it only supports CPU instances for training. That seems like it would be good to validate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The training job will not fail if not started on CPU instance so I think we shouldn't fail too fast here. Will add comment/link.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that makes sense. I misunderstood.


def fit(self, records, mini_batch_size, **kwargs):
# mini_batch_size is required
if mini_batch_size is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check isn't needed since mini_batch_size isn't an optional parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is required parameter we must fail if you call: fit(records, None)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think this kind of validation is generally not worth the code clutter, since Python's support for optional parameters as a first-class feature means that people don't randomly pass None to things and expect it to work. However this is pretty minor and it can add value in some cases so I'm okay with leaving this if you feel strongly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified the checks

# mini_batch_size is required
if mini_batch_size is None:
raise ValueError("mini_batch_size must be set")
if not isinstance(mini_batch_size, int) or mini_batch_size < 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this check should go here - mini_batch_size is a shared concept across all the 1P algorithms, so it'd be better to put it in the base class if we want it at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not entirely true. Some algorithms have it as optional (e.g. FM) and some do not even have it at all (e.g. XGBoost). Since it is algorithm dependent we need to validate it there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mini_batch_size is a parameter in the fit() method of the base class of all 1P algorithms: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/amazon/amazon_estimator.py#L67

It takes either an int or None; None represents the cases where it's not required. Validating that it's an int if it's not None will apply for all cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, eventually code will check the type, just leaving the value check only

sagemaker_session=sagemaker_session, base_job_name='test-lda')

# upload data and prepare the set
data_location_key = "integ-test-data/lda-" + sagemaker_timestamp()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to do something like "lda.record_set(...)" instead of uploading to S3 separately, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"lda.record_set(...)" works with numpy arrays but not with binary files.

With respect you your other comment about "from_s3". I'll see if it can be nicely refactored into one call.

HP_TRAIN_CALL.update({'hyperparameters': STRINGIFIED_HYPERPARAMS})


def test_call_fit(sagemaker_session):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should write unit tests that test functionality past the point where the LDA class hands off to the base class. The base class should already have test coverage on this functionality (and if it doesn't, we should add it there).

We should really just be (comprehensively) unit testing that the new code we wrote works correctly, and not more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that principle. These unit tests verify logic inside fit() on this level (verification of batch_size). This particular one tests that it delegates to base implementation correctly. It's easiest to check how it calls to the service since this is where we intercept calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to use patching to replace the fit() method of the base class with a Mock object, and verify that parameters are passed correctly to the mock. That would be much more direct and concise (and thus maintainable).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah - this seems cleaner, changing

repo = 'lda:1'

num_topics = hp('num_topics', gt(0), 'An integer greater than zero', int)
alpha0 = hp('alpha0', (), "A float value", float)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also be gt(0)? (Basing this off: https://docs.aws.amazon.com/sagemaker/latest/dg/lda_hyperparameters.html )

Also, use single quotes for consistency (here and also line 30).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@@ -21,7 +21,8 @@

class FactorizationMachines(AmazonAlgorithmEstimatorBase):

repo = 'factorization-machines:1'
alg_name = 'factorization-machines'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still name these "repo_name" and "repo_tag" or similar, since that's what they are first and foremost.

@@ -194,7 +195,8 @@ class FactorizationMachinesModel(Model):

def __init__(self, model_data, role, sagemaker_session=None):
sagemaker_session = sagemaker_session or Session()
image = registry(sagemaker_session.boto_session.region_name) + "/" + FactorizationMachines.repo
repo = '{}:{}'.format(FactorizationMachines.alg_name, FactorizationMachines.alg_version)
image = registry(sagemaker_session.boto_session.region_name) + "/" + repo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use format here as well for consistency?

@@ -259,14 +223,14 @@ def upload_numpy_to_s3_shards(num_shards, s3, bucket, key_prefix, array, labels=

def registry(region_name, algorithm=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change "algorithm" to "algorithm_repo_name" or similar?

@@ -127,6 +126,26 @@ def record_set(self, train, labels=None, channel="train"):
logger.debug("Created manifest file {}".format(manifest_s3_file))
return RecordSet(manifest_s3_file, num_records=train.shape[0], feature_dim=train.shape[1], channel=channel)

def record_set_from_local_files(self, data_path, num_records, feature_dim, channel="train"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests please.

"""Build a :class:`~RecordSet` by pointing to local files.

Args:
data_path (string): Path to local file to be uploaded for training.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need more specificity either here or in the overall comment. The user needs to know whether it works on single files, directories, or both, etc.

@lukmis
Copy link
Contributor Author

lukmis commented Jan 30, 2018

Addressed the comments.

winstonaws
winstonaws previously approved these changes Jan 30, 2018
@winstonaws
Copy link
Contributor

Thanks!

@lukmis lukmis merged commit 354ded3 into aws:master Jan 31, 2018
jalabort added a commit to hudl/sagemaker-python-sdk that referenced this pull request Mar 1, 2018
* Add data_type to hyperparameters (aws#54)

When we describe a training job the data type of the hyper parameters is
lost because we use a dict[str, str]. This adds a new field to
Hyperparameter so that we can convert the datatypes at runtime.

instead of validating with isinstance(), we cast the hp value to the type it
is meant to be. This enforces a "strongly typed" value. When we
deserialize from the API string responses it becomes easier to deal with
too.

* Add wrapper for LDA. (aws#56)

Update CHANGELOG and bump the version number.

* Add support for async fit() (aws#59)

when calling fit(wait=False) it will return immediately. The training
job will carry on even if the process exits. by using attach() the
estimator can be retrieved by providing the training job name.

_prepare_init_params_from_job_description() is now a classmethod instead
of being a static method. Each class is responsible to implement their
specific logic to convert a training job description into arguments that
can be passed to its own __init__()

* Fix Estimator role expansion (aws#68)

Instead of manually constructing the role ARN, use the IAM boto client
to do it. This properly expands service-roles and regular roles.

* Add FM and LDA to the documentation. (aws#66)

* Fix description of an argument of sagemaker.session.train (aws#69)

* Fix description of an argument of sagemaker.session.train

'input_config' should be an array which has channel objects.

* Add a link to the botocore docs

* Use 'list' instead of 'array' in the description

* Add ntm algorithm with doc, unit tests, integ tests (aws#73)

* JSON serializer: predictor.predict accepts dictionaries (aws#62)

Add support for serializing python dictionaries to json
Add prediction with dictionary in tf iris integ test

* Fixing timeouts for PCA async integration test. (aws#78)

Execute tf_cifar test without logs to eliminate delay to detect that job has finished.

* Fixes in LinearLearner and unit tests addition. (aws#77)

* Print out billable seconds after training completes (aws#30)

* Added: print out billable seconds after training completes

* Fixed: test_session.py to pass unit tests

* Fixed: removed offending tzlocal()

* Use sagemaker_timestamp when creating endpoint names in integration tests. (aws#81)

* Support TensorFlow-1.5.0 and MXNet-1.0.0  (aws#82)

* Update .gitignore to ignore pytest_cache.

* Support TensorFlow-1.5.0 and MXNet-1.0.0

* Update and refactor tests. Add tests for fw_utils.

* Fix typo.

* Update changelog for 1.1.0 (aws#85)
apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this pull request Nov 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants