Can't use `record_set()` to create data for RCF "test" channel #2925

phschimm · 2022-02-09T20:57:43Z

Describe the bug
The method sagemaker.RandomCutForest.record_set() can't be used to create a RecordSet for the "test" channel of the RCF algorithm.

To reproduce
Configure a RandomCutForest estimator and try fitting it to data ingested via record_set(..., channel='test'):

from sagemaker import RandomCutForest

rcf = RandomCutForest(
    role=execution_role,
    instance_count=1,
    instance_type='ml.m5.large',
    data_location=f's3://{bucket}/{prefix}/',
    output_path=f's3://{bucket}/{prefix}/output',
    num_samples_per_tree=512,
    num_trees=50,
    base_job_name=base_job_name,
    eval_metrics=['accuracy', 'precision_recall_fscore']
)

test_set = rcf.record_set(
    features,
    labels=labels,
    channel='test' # breaking
)

rcf.fit(test_set)

Expected behavior
A RecordSet returned by record_set(..., channel='test') should have "S3DataDistributionType": "FullyReplicated".

Screenshots or logs

Docker entrypoint called with argument(s): train
Running default environment configuration script
[02/09/2022 18:27:59 INFO 140001573062464] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-conf.json: {'num_samples_per_tree': 256, 'num_trees': 100, 'force_dense': 'true', 'eval_metrics': ['accuracy', 'precision_recall_fscore'], 'epochs': 1, 'mini_batch_size': 1000, '_log_level': 'info', '_kvstore': 'dist_async', '_num_kv_servers': 'auto', '_num_gpus': 'auto', '_tuning_objective_metric': '', '_ftp_port': 8999}
[02/09/2022 18:27:59 INFO 140001573062464] Merging with provided configuration from /opt/ml/input/config/hyperparameters.json: {'num_trees': '563', 'num_samples_per_tree': '125', 'feature_dim': '71', '_tuning_objective_metric': 'test:f1', 'eval_metrics': '["accuracy", "precision_recall_fscore"]', 'mini_batch_size': '1000'}
[02/09/2022 18:27:59 INFO 140001573062464] Final configuration: {'num_samples_per_tree': '125', 'num_trees': '563', 'force_dense': 'true', 'eval_metrics': '["accuracy", "precision_recall_fscore"]', 'epochs': 1, 'mini_batch_size': '1000', '_log_level': 'info', '_kvstore': 'dist_async', '_num_kv_servers': 'auto', '_num_gpus': 'auto', '_tuning_objective_metric': 'test:f1', '_ftp_port': 8999, 'feature_dim': '71'}
[02/09/2022 18:27:59 ERROR 140001573062464] Customer Error: Unable to initialize the algorithm. Failed to validate input data configuration. (caused by ValidationError)
Caused by: 'ShardedByS3Key' is not one of ['FullyReplicated']
Failed validating 'enum' in schema['properties']['test']['properties']['S3DistributionType']:
    {'enum': ['FullyReplicated'], 'type': 'string'}
On instance['test']['S3DistributionType']:
    'ShardedByS3Key'

System information
A description of your system. Please provide:

SageMaker Python SDK version: 2.72.2
Framework name (eg. PyTorch) or algorithm (eg. KMeans): Random Cut Forest
Framework version: v1
Python version: 3.7.10
CPU or GPU: CPU (instance_type='ml.m5.large')
Custom Docker image (Y/N): N

Additional context
This property is hardcoded in the RecordSet class utilized by record_set():

sagemaker-python-sdk/src/sagemaker/amazon/amazon_estimator.py

Line 340 in 2ebba8a

self.s3_data, distribution="ShardedByS3Key", s3_data_type=self.s3_data_type

@mufaddal-rohawala @jeniyat or anyone else: In the meantime, is there any other way to create a RecordSet for RCF from Numpy data?

The text was updated successfully, but these errors were encountered:

phschimm · 2022-02-10T12:43:00Z

I've posted a more detailed investigation about this problem on StackOverflow:
https://stackoverflow.com/questions/71053554/why-can-random-cut-forests-record-set-method-for-data-conversion-upload-not

Can someone maybe identify, which API version was used in this post here?

If I had that information, I could downgrade my notebook instance, execute my experiments, and get the quality metrics I need.

natbukowski · 2023-12-04T19:56:58Z

Hello, I am also experiencing this issue and wanted to know if there is any work around this problem?

phschimm added the type: bug label Feb 9, 2022

SifeiLi mentioned this issue Feb 3, 2024

change: Add "distribution" parameter into record_set #4408

Merged

9 tasks

knikure closed this as completed Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't use `record_set()` to create data for RCF "test" channel #2925

Can't use `record_set()` to create data for RCF "test" channel #2925

phschimm commented Feb 9, 2022 •

edited

Loading

phschimm commented Feb 10, 2022

natbukowski commented Dec 4, 2023

Can't use record_set() to create data for RCF "test" channel #2925

Can't use record_set() to create data for RCF "test" channel #2925

Comments

phschimm commented Feb 9, 2022 • edited Loading

phschimm commented Feb 10, 2022

natbukowski commented Dec 4, 2023

Can't use `record_set()` to create data for RCF "test" channel #2925

Can't use `record_set()` to create data for RCF "test" channel #2925

phschimm commented Feb 9, 2022 •

edited

Loading