Skip to content

Accessing argument property in ProcessingStep adds new code ProcessingInput every time #3484

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wsykala opened this issue Nov 23, 2022 · 7 comments
Assignees
Labels
component: pipelines Relates to the SageMaker Pipeline Platform type: bug

Comments

@wsykala
Copy link

wsykala commented Nov 23, 2022

Describe the bug

Every time you access the arguments property in ProcessingStep, a new code input is added to the ProcessingInputs. Additionally each time the property is accessed, a new sourcedir.tar.gz file is uploaded to a S3 bucket.

If you were to remove the custom inputs argument from the example below, the code input would not be added, although the sourcedir.tar.gz file will still get uploaded to S3.

Because input and output names must be unique for a job, this behaviour can result in the error below:

ClientError: Failed to invoke sagemaker:CreateProcessingJob. Error Details: Input and Output names must be globally unique: [code]

To reproduce

from sagemaker.sklearn import SKLearn
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.processing import (
    FrameworkProcessor,
    ProcessingInput,
    ProcessingOutput,
)
from sagemaker.workflow.pipeline_context import PipelineSession


bucket = ''   # Pass your bucket
code_path = f's3://{bucket}/code'
session = PipelineSession(default_bucket=bucket)

proc = FrameworkProcessor(
    estimator_cls=SKLearn,
    sagemaker_session=session,
    framework_version='1.0-1',
    role=session.get_caller_identity_arn(),
    instance_type='ml.m5.xlarge',
    instance_count=1,
    code_location=code_path,
)
step_args = proc.run(
    code='split.py',
    inputs=[
        ProcessingInput(
            source=f's3://{bucket}',
            destination='/opt/ml/processing/input',
        ),
    ],
    source_dir='proc/'
)
proc_step = ProcessingStep(
    name='Ingest',
    step_args=step_args
)

def _print_inputs(args):
    d = {}
    for arg in args:
        name = arg['InputName']
        d[name] = d.get(name, 0) + 1
    print(d)

_print_inputs(proc_step.arguments['ProcessingInputs'])
_print_inputs(proc_step.arguments['ProcessingInputs'])
_print_inputs(proc_step.arguments['ProcessingInputs'])

Expected behavior
The number of inputs should be equal to 3. And the code should be uploaded to S3 only once. This was the case in previous versions of sagemaker (<=2.115.0).

# Output cut for visibility sake (not showing print statements from sagemaker)
Job Name:  <your-job-name>
{'input-1': 1, 'code': 1, 'entrypoint': 1}
{'input-1': 1, 'code': 1, 'entrypoint': 1}
{'input-1': 1, 'code': 1, 'entrypoint': 1}

Screenshots or logs

Meanwhile, the number of inputs is incremented by one every time arguments property is accessed. Moreover, the code gets uploaded to S3 every time.

# Output cut for visibility sake (not showing print statements from sagemaker)
Job Name:  <your-job-name-1>
{'input-1': 1, 'code': 1, 'entrypoint': 1}
Job Name:  <your-job-name-2>
{'input-1': 1, 'code': 2, 'entrypoint': 1}
Job Name:  <your-job-name-3>
{'input-1': 1, 'code': 3, 'entrypoint': 1}

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.116.0, 2.117.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): SKLearn (N/A - it looks like an error in ProcessingStep class.)
  • Framework version: 1.0-1
  • Python version: 3.7.10
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): N

Additional context
Possibly, the exact same bug appeared in #3477.
This does not happen in versions lower than 2.116.0, meaning that this behaviour was introduced with the addition of context object in PipelineSession.

@brockwade633
Copy link
Contributor

brockwade633 commented Nov 28, 2022

Hi @wsykala, thanks for the information. This behavior is a result of an idempotency problem involving a few ProcessingStep sub classes. The issue is being tracked here, where there are some more details including workarounds and a link to the PR to fix the issue. Will keep this issue updated with latest info.

@brockwade633 brockwade633 added the component: pipelines Relates to the SageMaker Pipeline Platform label Nov 28, 2022
@brockwade633 brockwade633 self-assigned this Nov 28, 2022
@wsykala
Copy link
Author

wsykala commented Dec 1, 2022

Hello @brockwade633, thanks for the reply. I didn't realize that there was an issue open for that bug already, that's why I created a new one. It's good to see that a fix for that is already in the pipeline.

@shenshaoyong
Copy link

I also met this same issue. Do we have any roadmap to resolve it? thank you.

@brockwade633
Copy link
Contributor

Hi @shenshaoyong, thank you for your patience. I will update this thread as soon as I have more information to share. In the meantime, there are a couple workarounds discussed in the other issue mentioned above.

@brockwade633
Copy link
Contributor

Hi @wsykala and @shenshaoyong, a new sagemaker package version (2.120.0) is available that includes the fix for the idempotency issues. Are you able to confirm that you are no longer seeing the duplicate input behavior? Thanks.

@jerrypeng7773
Copy link
Contributor

closing this issue, feel free to re-open.

@wsykala
Copy link
Author

wsykala commented Dec 20, 2022

Hi @wsykala and @shenshaoyong, a new sagemaker package version (2.120.0) is available that includes the fix for the idempotency issues. Are you able to confirm that you are no longer seeing the duplicate input behavior? Thanks.

Hey @brockwade633. Sorry for the delay answering. I confirm that the new version fixed this issue. Thanks for taking care of it so fast.
@jerrypeng7773 Thanks for closing the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: pipelines Relates to the SageMaker Pipeline Platform type: bug
Projects
None yet
Development

No branches or pull requests

4 participants