-
Notifications
You must be signed in to change notification settings - Fork 1.2k
ClientError when executing SM Pipeline with multiple Spark processing steps #3477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Looks like the error is not really about the multiple jobs having @jpcpereira Downgrading the version of sagemaker to |
Hi @wsykala , Ok, that might be the case indeed, but then I would think that the fix that we implemented would not make it work: processing_step_args = spark_processor.run(
submit_app=f"scripts/{submit_app}",
job_name=spark_processor.base_job_name,
arguments=[str(i) for i in spark_arguments],
inputs=[
ProcessingInput(
source="scripts",
input_name="scripts",
destination="/opt/ml/processing/input/code/scripts",
),
ProcessingInput(
source=spark_config_path,
input_name=f"conf-{step_name}",
destination="/opt/ml/processing/input/conf",
),
],
#configuration=spark_config,
) This runs successfully with SageMaker 2.117.0. |
@jpcpereira I reproduced your example - created a dummy My guess is that somewhere in your code you are accessing the |
Hi @wsykala , Ok I see, so we just really need to pass the spark config directly as a Thanks, |
Hi @wsykala and @jpcpereira, thanks for the info in this thread. I will leave a little more context here for this issue, it is due to idempotency problems with certain Processor subclasses, and the root issue is being tracked here. The configuration input is another example of this. In the meantime, as you stated, removing the |
Hi @wsykala and @jpcpereira, a new sagemaker package version ( |
Hi @brockwade633 & @wsykala , we have tested running the SageMaker Pipeline with the Processing jobs' definition that was failing before and confirmed that the fix works. We are now able to pass the Spark configuration without issues. |
Describe the bug
When executing a SageMaker Pipeline that includes several PySpark Processing steps, a
ClientError
occurs at the start of the pipeline execution. This seems to be due to the fact that the Spark settings passed to the PySparkProcessor via theconfiguration
argument are handled by the SageMaker Python SDK using aProcessingInput
, which usesconf
as input name for each and everyProcessingInput
created for each of the Spark Processing jobs. This leads to theClientError
below, as the input names have to be globally unique.To reproduce
Run a PySparkProcessor making use of the
configuration
argument to pass Spark configurations such as the following:Expected behavior
SageMaker Pipeline with multiple Spark Processing steps executes successfully.
Screenshots or logs
The following
ClientError
is displayed in the SageMaker Studio UI:ProcessingInput
created by SageMaker in multiple jobs has always the same input name (conf
):System information
A description of your system. Please provide:
Additional context
Let me know if you need further information about the reported behavior.
The text was updated successfully, but these errors were encountered: