Restore SKLearn FrameworkProcessor via _normalize_args #2633

athewsey · 2021-09-14T05:59:06Z

Issue #, if available: #2656

Description of changes:

An alternative proposal to #2564 for restoring FrameworkProcessor to SKLearnProcessor and fixing how FrameworkProcessor works with SageMaker Pipelines (which currently seems broken).

Standardize on existing Processor._normalize_args() (as used by ProcessingStep) as the interface point to handle user and extra code inputs
- Integrate FrameworkProcessor's handling of user code and framework code upload into this function, instead of hooking in at run() or get_run_args()
- Support plain (non-tarballed and no requirements.txt) source_dir functionality even in ScriptProcessor
Migrate FrameworkProcessor entrypoint from /bin/bash to POSIX-compliant /bin/sh, because some lightweight Linux distributions optimized for containers (e.g. Alpine) may not install bash by default: This should be a more flexible option for bringing custom containers with FrameworkProcessor
Make some minor improvements to the documentation for FrameworkProcessor: Explicitly mentioning source_dir on the Processing user guide page, and ensuring *Processor classes get listed in framework API directories.

Additional context & design:

feature: processors that support multiple Python files, requirements.txt, and dependencies. #2251 added support for Processing Jobs to consume folders of code (instead of single files) and install user code dependencies at runtime (via requirements.txt in the folder) similarly to what users are used to with Framework-based training jobs.
- User's submitted code/folder is bundled to a tarball on S3 as a code input to the processing job
- A small shell script is uploaded to S3 as an entrypoint input and overrides the entrypoint of the job: The shell script extracts the bundle and installs requirements.txt, then kicks off the user's actual requested entrypoint/script.
...But it seemed to introduce bugs in existing SM Pipelines where processing jobs would try to execute Python scripts as shell scripts (e.g. import: command not found) - so specifically SKLearnProcessor was rolled back to its old implementation by fix: revert #2251 changes for sklearn processor #2560.
Seeking to fix this symptom, Revert #2560 #2564 has been in review for a while to restore SKLearnProcessor's framework functionality by changing FrameworkProcessor to use a python entrypoint instead of shell script.

From my investigations, it seems like #2564 proposes to patch a symptom of the issue rather than the root cause: I've not seen any reports that the entrypoint script being shell-based is an issue per-se (e.g. no complaints about custom containers not having /bin/bash installed) - the problem is the inconsistency of trying to use the command for the (shell-script) framework entrypoint against the file for the user's (Python) script.

In fact, switching to a Python-based framework entrypoint script would:

Risk masking the issue by making the user's code run OK but the requirements.txt be ignored
Introduce a hard dependency on Python in the target container that could make life harder for any users wanting to derive new FrameworkProcessors for other languages e.g. R, Julia, etc.

In this alternative solution, we integrate FrameworkProcessor more closely with ScriptProcessor by overriding parent private methods (including _normalize_args, _include_code_in_inputs, etc) rather than adding extra new methods hooked in to specific events as before.

This allows FrameworkProcessor to work correctly with SM Pipelines without changing the API between them, with the trade-off that several closely coupled overrides are needed with some code duplication: Every new parameter a child class adds to run() (e.g. git_config, dependencies, etc) needs to be pushed through a few internal methods where the only difference is the number of args transmitted.

Alternatives I think could include:

Creating a new standard API on Processor similar to _normalize_args for external classes needing to generate the actual job arguments; having existing Processor/ScriptProcessor pass this through to _normalize_args, and having FrameworkProcessor add in the additional required hook for preparing code bundle.
Refactoring Processor.get_run_args() to be this API - but I think this would be a breaking change
A more significant refactor of the hierarchy of internal methods in *Processor, which could accidental risk breaking changes or finding that not much could be improved without breaking changes

Testing done:

Some minor additions to test coverage (including checking ScriptProcessor with a folder works), but mostly captured under the scope of existing tests - just tweaked arguments for some diversity between providing a script or a full folder + requirements.txt to SKLearnProcessor.

Verified the fix against the pipelines sample notebook that was known to be broken before.

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

I have read the CONTRIBUTING doc
I used the commit message format described in CONTRIBUTING
~~I have passed the region in to all S3 and STS clients that I've initialized as part of this change.~~
I have updated any necessary documentation, including READMEs and API docs (if appropriate)
- Improved presentation of framework processing classes in the API doc and SageMaker Processing user guide.

Tests

I have added tests that prove my fix is effective or that my feature works (if appropriate)
- Tweaked existing test configurations to try and get good coverage of possible axes
- Added a unit test to specifically validate SKLearnProcessor (and by extension FrameworkProcessor) behaves as expected by pipelines ProcessingStep
I have checked that my tests are not configured for a specific region or account (if appropriate)
~~I have used unique_name_from_base to create resource names in integ tests (if appropriate)~~

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

This is to conform to the existing style adopted in the sagemaker python sdk.

Drop SKLearnProcessorAlt altogether.

…r_cls

…rk processor

…_twice}()

…ound-02

…into feat/fw-processor

…nt=1 This fix specific bugs with SKLearnEstimator which raises exception on instance_count > 1.

…into feat/fw-processor

Add comment for the SKLearnProcessor multiple instances fix and remove the now-unused function parameter. Merge remote-tracking branch 'verdi/pr-framework-processor-round-02' into feat/fw-processor

Swap mandatory 's3_prefix' param in FrameworkProcessors for optional 'code_location' param inkeeping with Framework estimators. Separate out the 'code' from 'entrypoint' processing inputs to avoid having one input channel inside another in the container (input/payload). Update existing unit & integration tests to work with the new code.

Swap s3_prefix for code_location and separate code/entrypoint

R1720: Unnecessary "else" after "raise" (no-else-raise)

FrameworkProcessor now accepts a 'command' param like the ScriptProcessor class did previously, and tries to maintain similar API by forwarding it to the parent class... But patching the container run command generation from the parent to override. Also add initial unit tests for multi-instance and overridden commands.

Restore 'command' for FrameworkProcessors

sagemaker-bot · 2021-09-20T14:06:12Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-unit-tests
Commit ID: 0fd7f1d
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Ensure Processor classes are listed in framework API ToCs, and add a brief introduction to usage in the SageMaker Processing guide.

sagemaker-bot · 2021-09-20T17:34:20Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-unit-tests
Commit ID: 5fc171b
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Add a unit test to check SKLearn FrameworkProcessor behaves as ProcessingStep expects to avoid previous regression on SKLearnProcessor breaking SageMaker Pipelines examples.

Some lightweight Linux distributions oriented towards containers (e.g. Alpine) might not include bash by default: The POSIX-compliant /bin/sh shell is more portable and a less stringent requirement on custom containers to work with FrameworkProcessor. Also expanded scope of tests.

sagemaker-bot · 2021-09-21T05:22:08Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-unit-tests
Commit ID: 2903f39
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

…-normargs

sagemaker-bot · 2021-09-21T05:43:32Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-unit-tests
Commit ID: d326c9f
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

verdimrc · 2021-09-22T07:41:46Z

A note on alternative (1): although this PR removes FrameworkProcessor.get_run_args() (as it's unused), practically get_run_args() can be simply an alias to _normalize_args(). Right now, only ProcessingStep directly uses the private method, so re-introducing the public API would only require to change only that sole caller.

Maintain API compatibility with the base Processor class

sagemaker-bot · 2021-09-23T16:12:47Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-unit-tests
Commit ID: 0c07814
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

…-normargs

athewsey · 2021-10-01T08:33:33Z

(Pushed to draft in favour of #2664)

sagemaker-bot · 2021-10-09T08:16:16Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-unit-tests
Commit ID: 00d86ba
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

ahsan-z-khan

@athewsey
Is this related to #2664 ? Are these dependent on each other? OR can we close this?

trajanikant · 2023-03-28T00:18:51Z

@athewsey
This has been open since 2021. Kindly close this or share an update related to this.

mufaddal-rohawala · 2023-05-09T01:32:28Z

Closing as this has been in draft state for too long! Please reopen if the change is still relevant.

verdimrc and others added 30 commits March 29, 2021 14:25

Framework processor: first port

087482c

Subclasses to go to their respective submodule

b8972cc

FrameworkProcessor: source_dir defaults to None

8d554fa

Remove type annotations from public APIs

b83b0d0

This is to conform to the existing style adopted in the sagemaker python sdk.

Fix circular dependency between processing.py and estimator.py

547e1e6

Disable type-checker on line sagemaker.estimator.Framework

28a3a44

Fix pylint errors & warnings

3a1907f

Merge branch 'master' into pr-framework-processor

7a4d2f5

Bugfix: FrameworkProcessor propagates error to SageMaker

d05a51c

One-liner docstring update on FrameworkProcessor

82836b0

Directly refactor SKLearnProcessor (with breaking change in its API)

fb80bae

Drop SKLearnProcessorAlt altogether.

Updated docstring in subclasses of FrameworkProcessor

5256d02

Bugfix: FrameworkProcessor did not pass sagemaker_session to estimato…

1a28de5

…r_cls

mock_create_tar_file() in test_processing.py

047e049

Updated test_sklearn_processor_with_required_parameters() for framewo…

163c668

…rk processor

Fix all tests that rely on _get_expected_args()

e884ba0

Merge branch 'test-sklearn-modular' into pr-framework-processor-round-02

85dd50f

Fixed test_sklearn_with_all_parameters()

dac7ed3

Fixed 2 tests: test_sklearn_with_all_parameters_via_run_args{,_called…

24534b3

…_twice}()

Merge branch 'test-sklearn-all-modular' into pr-framework-processor-r…

dd7a869

…ound-02

Merge remote-tracking branch 'verdi/pr-framework-processor-round-02' …

2926db1

…into feat/fw-processor

Fix FrameworkProcessor to call underlying estimator with instance_cou…

9b98e50

…nt=1 This fix specific bugs with SKLearnEstimator which raises exception on instance_count > 1.

Merge remote-tracking branch 'verdi/pr-framework-processor-round-02' …

8ec3226

…into feat/fw-processor

refactor: SKLearn fix comment & unused param

34f95bd

Add comment for the SKLearnProcessor multiple instances fix and remove the now-unused function parameter. Merge remote-tracking branch 'verdi/pr-framework-processor-round-02' into feat/fw-processor

Merge pull request aws#4 from athewsey/feat/fw-processor

ffcfce6

Swap s3_prefix for code_location and separate code/entrypoint

Updated documents on internal framework processor payload

0987f2b

framework processor: fix linter warning

7aeffc5

R1720: Unnecessary "else" after "raise" (no-else-raise)

Merge pull request aws#6 from athewsey/feat/fw-processor

ca8490a

Restore 'command' for FrameworkProcessors

change(tests): fix failing processing unit tests

0fd7f1d

doc: improve FrameworkProcessor guide docs

5fc171b

Ensure Processor classes are listed in framework API ToCs, and add a brief introduction to usage in the SageMaker Processing guide.

athewsey added 3 commits September 21, 2021 02:21

change: add unit for framework ProcessingStep

ff67bb3

Add a unit test to check SKLearn FrameworkProcessor behaves as ProcessingStep expects to avoid previous regression on SKLearnProcessor breaking SageMaker Pipelines examples.

style: fix linting errors

2903f39

Merge remote-tracking branch 'upstream/master' into feat/fw-processor…

d326c9f

…-normargs

athewsey changed the title ~~[DRAFT] Restore SKLearn FrameworkProcessor via _normalize_args~~ Restore SKLearn FrameworkProcessor via _normalize_args Sep 21, 2021

verdimrc mentioned this pull request Sep 22, 2021

Revert #2560 #2564

Closed

7 tasks

fix(DataWranglerProcessor): add source_dir arg

0c07814

Maintain API compatibility with the base Processor class

Merge remote-tracking branch 'upstream/master' into feat/fw-processor…

02e0db1

…-normargs

athewsey mentioned this pull request Sep 24, 2021

Fix FrameworkProcessor for SageMaker Pipelines #2664

Closed

9 tasks

athewsey marked this pull request as draft September 29, 2021 05:15

Merge branch 'master' into feat/fw-processor-normargs

00d86ba

ahsan-z-khan reviewed Jan 24, 2022

View reviewed changes

vikas0203 requested review from a team, trajanikant and claytonparnell and removed request for a team November 23, 2022 21:39

vikas0203 assigned trajanikant Nov 23, 2022

mufaddal-rohawala closed this May 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore SKLearn FrameworkProcessor via _normalize_args #2633

Restore SKLearn FrameworkProcessor via _normalize_args #2633

athewsey commented Sep 14, 2021 •

edited

Loading

sagemaker-bot commented Sep 20, 2021

sagemaker-bot commented Sep 20, 2021

sagemaker-bot commented Sep 21, 2021

sagemaker-bot commented Sep 21, 2021

verdimrc commented Sep 22, 2021 •

edited

Loading

sagemaker-bot commented Sep 23, 2021

athewsey commented Oct 1, 2021

sagemaker-bot commented Oct 9, 2021

ahsan-z-khan left a comment

trajanikant commented Mar 28, 2023

mufaddal-rohawala commented May 9, 2023

Restore SKLearn FrameworkProcessor via _normalize_args #2633

Restore SKLearn FrameworkProcessor via _normalize_args #2633

Conversation

athewsey commented Sep 14, 2021 • edited Loading

Merge Checklist

General

Tests

sagemaker-bot commented Sep 20, 2021

AWS CodeBuild CI Report

sagemaker-bot commented Sep 20, 2021

AWS CodeBuild CI Report

sagemaker-bot commented Sep 21, 2021

AWS CodeBuild CI Report

sagemaker-bot commented Sep 21, 2021

AWS CodeBuild CI Report

verdimrc commented Sep 22, 2021 • edited Loading

sagemaker-bot commented Sep 23, 2021

AWS CodeBuild CI Report

athewsey commented Oct 1, 2021

sagemaker-bot commented Oct 9, 2021

AWS CodeBuild CI Report

ahsan-z-khan left a comment

Choose a reason for hiding this comment

trajanikant commented Mar 28, 2023

mufaddal-rohawala commented May 9, 2023

athewsey commented Sep 14, 2021 •

edited

Loading

verdimrc commented Sep 22, 2021 •

edited

Loading