Passing Pipeline Variable for entry_point while using XGBoost Estimator in script mode fails #3078

rohangpatil · 2022-04-23T00:48:09Z

Describe the bug
Passing Pipeline Variable (e.g. S3 URL) for entry_point while using XGBoost Estimator in script mode fails to generate a pipeline definition. As the estimator tries to parse the variable and fails at url parsing.

To reproduce

model_output_path = ParameterString(name="ModelOutputPath")

xgb_script_mode_estimator = XGBoost(
        entry_point=script_path,
        framework_version="1.5-1",  # Note: framework_version is mandatory
        role=role,
        instance_count=training_instance_count,
        instance_type=training_instance_type,
        output_path=model_output_path,
        hyperparameters={
            "eval_metric": eval_metric,
            "min_child_weight": min_child_weight
        },
        checkpoint_s3_uri=checkpoint_path
    )

    train_input = TrainingInput(
        s3_processed_training_samples_url, content_type=content_type
    )
    validation_input = TrainingInput(
        s3_processed_testing_samples_url, content_type=content_type
    )

    step_train = TrainingStep(
        name="TrainModel",
        estimator=xgb_script_mode_estimator,
        inputs={"train": train_input, "validation": validation_input},
    )
pipeline = get_pipeline(
    region=region,
    role=role,
    default_bucket=default_bucket,
    model_package_group_name=model_package_group_name,
    pipeline_name=pipeline_name
)

import json
json.loads(pipeline.definition())

generates the error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-7d78ecac635c> in <module>
      1 import json
----> 2 json.loads(pipeline.definition())

/opt/conda/lib/python3.7/site-packages/sagemaker/workflow/pipeline.py in definition(self)
    299     def definition(self) -> str:
    300         """Converts a request structure to string representation for workflow service calls."""
--> 301         request_dict = self.to_request()
    302         request_dict["PipelineExperimentConfig"] = interpolate(
    303             request_dict["PipelineExperimentConfig"], {}, {}

/opt/conda/lib/python3.7/site-packages/sagemaker/workflow/pipeline.py in to_request(self)
     89             if self.pipeline_experiment_config is not None
     90             else None,
---> 91             "Steps": list_to_request(self.steps),
     92         }
     93 

/opt/conda/lib/python3.7/site-packages/sagemaker/workflow/utilities.py in list_to_request(entities)
     40     for entity in entities:
     41         if isinstance(entity, Entity):
---> 42             request_dicts.append(entity.to_request())
     43         elif isinstance(entity, StepCollection):
     44             request_dicts.extend(entity.request_dicts())

/opt/conda/lib/python3.7/site-packages/sagemaker/workflow/steps.py in to_request(self)
    324     def to_request(self) -> RequestType:
    325         """Updates the request dictionary with cache configuration."""
--> 326         request_dict = super().to_request()
    327         if self.cache_config:
    328             request_dict.update(self.cache_config.config)

/opt/conda/lib/python3.7/site-packages/sagemaker/workflow/steps.py in to_request(self)
    212     def to_request(self) -> RequestType:
    213         """Gets the request structure for `ConfigurableRetryStep`."""
--> 214         step_dict = super().to_request()
    215         if self.retry_policies:
    216             step_dict["RetryPolicies"] = self._resolve_retry_policy(self.retry_policies)

/opt/conda/lib/python3.7/site-packages/sagemaker/workflow/steps.py in to_request(self)
    101             "Name": self.name,
    102             "Type": self.step_type.value,
--> 103             "Arguments": self.arguments,
    104         }
    105         if self.depends_on:

/opt/conda/lib/python3.7/site-packages/sagemaker/workflow/steps.py in arguments(self)
    306         """
    307 
--> 308         self.estimator._prepare_for_training(self.job_name)
    309         train_args = _TrainingJob._get_train_args(
    310             self.estimator, self.inputs, experiment_config=dict()

/opt/conda/lib/python3.7/site-packages/sagemaker/estimator.py in _prepare_for_training(self, job_name)
   2660                 constructor if applicable.
   2661         """
-> 2662         super(Framework, self)._prepare_for_training(job_name=job_name)
   2663 
   2664         self._validate_and_set_debugger_configs()

/opt/conda/lib/python3.7/site-packages/sagemaker/estimator.py in _prepare_for_training(self, job_name)
    659                 self.code_uri = self.uploaded_code.s3_prefix
    660             else:
--> 661                 self.uploaded_code = self._stage_user_code_in_s3()
    662                 code_dir = self.uploaded_code.s3_prefix
    663                 script = self.uploaded_code.script_name

/opt/conda/lib/python3.7/site-packages/sagemaker/estimator.py in _stage_user_code_in_s3(self)
    699             kms_key = None
    700         elif self.code_location is None:
--> 701             code_bucket, _ = parse_s3_url(self.output_path)
    702             code_s3_prefix = "{}/{}".format(self._current_job_name, "source")
    703             kms_key = self.output_kms_key

/opt/conda/lib/python3.7/site-packages/sagemaker/s3.py in parse_s3_url(url)
     37     parsed_url = urlparse(url)
     38     if parsed_url.scheme != "s3":
---> 39         raise ValueError("Expecting 's3' scheme, got: {} in {}.".format(parsed_url.scheme, url))
     40     return parsed_url.netloc, parsed_url.path.lstrip("/")
     41 

/opt/conda/lib/python3.7/site-packages/sagemaker/workflow/entities.py in __str__(self)
     79     def __str__(self):
     80         """Override built-in String function for PipelineVariable"""
---> 81         raise TypeError("Pipeline variables do not support __str__ operation.")
     82 
     83     def __int__(self):

TypeError: Pipeline variables do not support __str__ operation.

Expected behavior
The pipeline definition has to be generated.

Screenshots or logs
N/A

System information
A description of your system. Please provide:

SageMaker Python SDK version: 2.87.0
Framework name (eg. PyTorch) or algorithm (eg. KMeans): XGBoost
Framework version: 1.5-1
Python version: 3.7.10
CPU or GPU: CPU
Custom Docker image (Y/N): N

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

jonsnowseven · 2022-04-27T14:57:02Z

Let me just add this happens with TensorFlow estimator as well.

rob-burrus · 2022-04-29T00:16:40Z

I'm seeing this error as well, for any pipeline parameter. Even just basic example in a jupyter notebook:

x = ParameterString(name="something", default_value="hello")
str(x)

File .venv/lib/python3.8/site-packages/sagemaker/workflow/entities.py:81, in PipelineVariable.__str__(self)
     [79]def __str__(self):
     [80]"""Override built-in String function for PipelineVariable"""
-> [81]raise TypeError("Pipeline variables do not support __str__ operation.")
TypeError: Pipeline variables do not support __str__ operation.

jafebabe · 2022-04-29T08:41:47Z

Start seeing this issue today after I upgraded sagemaker python sdk.

phall1 · 2022-05-05T23:02:08Z

+1 on this. Using fstrings in parameters in ProcessingOutput parameters raised the same TypeError

jerrypeng7773 · 2022-05-09T17:30:12Z

Thanks for using SageMaker Model Building Pipeline, the SageMaker Estimator is expecting the entry_point points to a local python source file. That means the path needs to be available at compile time, hence entry_point parameterization won't work. We have a PR under review to throw exception when parameterized entry_point is given.

https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html

The absolute or relative path to the local Python source file that should be executed as the entry point to training. (Default: None). If source_dir is specified, then entry_point must point to a file located at the root of source_dir. If ‘git_config’ is provided, ‘entry_point’ should be a relative location to the Python source file in the Git repo.

jerrypeng7773 · 2022-05-09T17:33:52Z

+1 on this. Using fstrings in parameters in ProcessingOutput parameters raised the same TypeError

sorry, this is a known limitation, python built-in functions can not be applied to pipeline parameters directly. You can use .expr() to print the parameter.

In addition, we are preparing a comprehensive readthedocs to list all these limitations.

rohangpatil · 2022-05-11T05:26:54Z

Thanks for using SageMaker Model Building Pipeline, the SageMaker Estimator is expecting the entry_point points to a local python source file. That means the path needs to be available at compile time, hence entry_point parameterization won't work. We have a PR under review to throw exception when parameterized entry_point is given.

https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html

The absolute or relative path to the local Python source file that should be executed as the entry point to training. (Default: None). If source_dir is specified, then entry_point must point to a file located at the root of source_dir. If ‘git_config’ is provided, ‘entry_point’ should be a relative location to the Python source file in the Git repo.

This is not happening because of the entry_point. But rather, when entry point is specified, the pipeline would try to upload the code to a S3 location, during the pipeline initialization (rather than at run time). As you can see in line 701 of /opt/conda/lib/python3.7/site-packages/sagemaker/estimator.py in _stage_user_code_in_s3(self) , its trying to upload the code to a location that location is specified using a PipelineParameter which cannot be resolved during pipeline initiation. This only happens when we specify entry_point to the estimator and not during other scenarios.

rypoll · 2022-05-26T12:01:39Z

Does anyone know how to bypass this error? I'm trying to build a model but this error is stopping me completely in my tracks.

I'm getting the same error here and I've tried many things but to no avail:

 data_quality_check_config = DataQualityCheckConfig(
        baseline_dataset=f"{processed_data_path}/data-quality/"
        dataset_format=DatasetFormat.csv(
            header=True, 
            output_columns_position="START"
        ),
        output_s3_uri=data_check_output_path,
    )

Where processed_data_path is an S3 uri.

jessieweiyi · 2022-05-26T15:28:11Z

Does anyone know how to bypass this error? I'm trying to build a model but this error is stopping me completely in my tracks.

I'm getting the same error here and I've tried many things but to no avail:
 data_quality_check_config = DataQualityCheckConfig(
        baseline_dataset=f"{processed_data_path}/data-quality/"
        dataset_format=DatasetFormat.csv(
            header=True, 
            output_columns_position="START"
        ),
        output_s3_uri=data_check_output_path,
    )
Where processed_data_path is an S3 uri.

Hi @rypoll have you tried Join from from sagemaker.workflow.functions?

jerrypeng7773 · 2022-05-26T16:31:24Z

thanks @jessieweiyi, yea, python built-in function cannot be applied on pipeline variable.

Instead do this:

from sagemaker.workflow.functions import Join

data_quality_check_config = DataQualityCheckConfig(
        baseline_dataset=Join(on="/", values=[processed_data_path, "data-quality"]),
        dataset_format=DatasetFormat.csv(
            header=True, 
            output_columns_position="START"
        ),
        output_s3_uri=data_check_output_path,
    )

jerrypeng7773 · 2022-05-26T18:16:19Z

@rohangpatil, PR 3111 should fix this issue, can you try it out and let us know?

mouhannadali · 2022-05-30T22:32:02Z

Hi @jessieweiyi , I have the same problem. I am doing the following:

input_path = "s3://bucket/key/"
inputs = TrainingInput(s3_data=input_path)

step_train = TrainingStep(
    name="MyTrainingStep",
    estimator = estimator,
    inputs=inputs,
)

pipeline_name = f"TF2Workflow"

pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
                training_instance_type, 
                training_instance_count,
               ],
    steps=[step_train],
    sagemaker_session=sess
)

running pipeline.definition() will result in an error:
Pipeline variables do not support __str__ operation. Please use .to_string()to convert it to string type in execution timeor use.expr to translate it to Json for display purpose in Python SDK.

any idea?

rohangpatil added the type: bug label Apr 23, 2022

staubhp added the component: pipelines Relates to the SageMaker Pipeline Platform label Apr 23, 2022

jessieweiyi mentioned this issue May 5, 2022

Create Model Step fails when when we pass the URL as a Pipeline Parameter for model_data #3091

Closed

lkev mentioned this issue May 12, 2022

Passing ParameterString for Estimator output_path fails in Pipeline definition #3104

Closed

This was referenced May 13, 2022

fix: support estimator output path parameterization #3108

Closed

fix: support estimator output path parameterization #3111

Merged

qidewenwhen closed this as completed May 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Passing Pipeline Variable for entry_point while using XGBoost Estimator in script mode fails #3078

Passing Pipeline Variable for entry_point while using XGBoost Estimator in script mode fails #3078

rohangpatil commented Apr 23, 2022 •

edited

Loading

jonsnowseven commented Apr 27, 2022

rob-burrus commented Apr 29, 2022 •

edited

Loading

jafebabe commented Apr 29, 2022

phall1 commented May 5, 2022

jerrypeng7773 commented May 9, 2022 •

edited

Loading

jerrypeng7773 commented May 9, 2022 •

edited

Loading

rohangpatil commented May 11, 2022

rypoll commented May 26, 2022

jessieweiyi commented May 26, 2022

jerrypeng7773 commented May 26, 2022 •

edited

Loading

jerrypeng7773 commented May 26, 2022

mouhannadali commented May 30, 2022

Passing Pipeline Variable for entry_point while using XGBoost Estimator in script mode fails #3078

Passing Pipeline Variable for entry_point while using XGBoost Estimator in script mode fails #3078

Comments

rohangpatil commented Apr 23, 2022 • edited Loading

jonsnowseven commented Apr 27, 2022

rob-burrus commented Apr 29, 2022 • edited Loading

jafebabe commented Apr 29, 2022

phall1 commented May 5, 2022

jerrypeng7773 commented May 9, 2022 • edited Loading

jerrypeng7773 commented May 9, 2022 • edited Loading

rohangpatil commented May 11, 2022

rypoll commented May 26, 2022

jessieweiyi commented May 26, 2022

jerrypeng7773 commented May 26, 2022 • edited Loading

jerrypeng7773 commented May 26, 2022

mouhannadali commented May 30, 2022

rohangpatil commented Apr 23, 2022 •

edited

Loading

rob-burrus commented Apr 29, 2022 •

edited

Loading

jerrypeng7773 commented May 9, 2022 •

edited

Loading

jerrypeng7773 commented May 9, 2022 •

edited

Loading

jerrypeng7773 commented May 26, 2022 •

edited

Loading