Skip to content

Can't use record_set() to create data for RCF "test" channel #2925

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
phschimm opened this issue Feb 9, 2022 · 2 comments
Closed

Can't use record_set() to create data for RCF "test" channel #2925

phschimm opened this issue Feb 9, 2022 · 2 comments

Comments

@phschimm
Copy link

phschimm commented Feb 9, 2022

Describe the bug
The method sagemaker.RandomCutForest.record_set() can't be used to create a RecordSet for the "test" channel of the RCF algorithm.

To reproduce
Configure a RandomCutForest estimator and try fitting it to data ingested via record_set(..., channel='test'):

from sagemaker import RandomCutForest

rcf = RandomCutForest(
    role=execution_role,
    instance_count=1,
    instance_type='ml.m5.large',
    data_location=f's3://{bucket}/{prefix}/',
    output_path=f's3://{bucket}/{prefix}/output',
    num_samples_per_tree=512,
    num_trees=50,
    base_job_name=base_job_name,
    eval_metrics=['accuracy', 'precision_recall_fscore']
)

test_set = rcf.record_set(
    features,
    labels=labels,
    channel='test' # breaking
)

rcf.fit(test_set)

Expected behavior
A RecordSet returned by record_set(..., channel='test') should have "S3DataDistributionType": "FullyReplicated".

Screenshots or logs

image

Docker entrypoint called with argument(s): train
Running default environment configuration script
[02/09/2022 18:27:59 INFO 140001573062464] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-conf.json: {'num_samples_per_tree': 256, 'num_trees': 100, 'force_dense': 'true', 'eval_metrics': ['accuracy', 'precision_recall_fscore'], 'epochs': 1, 'mini_batch_size': 1000, '_log_level': 'info', '_kvstore': 'dist_async', '_num_kv_servers': 'auto', '_num_gpus': 'auto', '_tuning_objective_metric': '', '_ftp_port': 8999}
[02/09/2022 18:27:59 INFO 140001573062464] Merging with provided configuration from /opt/ml/input/config/hyperparameters.json: {'num_trees': '563', 'num_samples_per_tree': '125', 'feature_dim': '71', '_tuning_objective_metric': 'test:f1', 'eval_metrics': '["accuracy", "precision_recall_fscore"]', 'mini_batch_size': '1000'}
[02/09/2022 18:27:59 INFO 140001573062464] Final configuration: {'num_samples_per_tree': '125', 'num_trees': '563', 'force_dense': 'true', 'eval_metrics': '["accuracy", "precision_recall_fscore"]', 'epochs': 1, 'mini_batch_size': '1000', '_log_level': 'info', '_kvstore': 'dist_async', '_num_kv_servers': 'auto', '_num_gpus': 'auto', '_tuning_objective_metric': 'test:f1', '_ftp_port': 8999, 'feature_dim': '71'}
[02/09/2022 18:27:59 ERROR 140001573062464] Customer Error: Unable to initialize the algorithm. Failed to validate input data configuration. (caused by ValidationError)
Caused by: 'ShardedByS3Key' is not one of ['FullyReplicated']
Failed validating 'enum' in schema['properties']['test']['properties']['S3DistributionType']:
    {'enum': ['FullyReplicated'], 'type': 'string'}
On instance['test']['S3DistributionType']:
    'ShardedByS3Key'

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.72.2
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Random Cut Forest
  • Framework version: v1
  • Python version: 3.7.10
  • CPU or GPU: CPU (instance_type='ml.m5.large')
  • Custom Docker image (Y/N): N

Additional context
This property is hardcoded in the RecordSet class utilized by record_set():

self.s3_data, distribution="ShardedByS3Key", s3_data_type=self.s3_data_type

@mufaddal-rohawala @jeniyat or anyone else: In the meantime, is there any other way to create a RecordSet for RCF from Numpy data?

@phschimm
Copy link
Author

I've posted a more detailed investigation about this problem on StackOverflow:
https://stackoverflow.com/questions/71053554/why-can-random-cut-forests-record-set-method-for-data-conversion-upload-not

Can someone maybe identify, which API version was used in this post here?

If I had that information, I could downgrade my notebook instance, execute my experiments, and get the quality metrics I need.

@natbukowski
Copy link

Hello, I am also experiencing this issue and wanted to know if there is any work around this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants