Skip to content

Restore SKLearn FrameworkProcessor via _normalize_args #2633

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 109 commits into from

Conversation

athewsey
Copy link
Collaborator

@athewsey athewsey commented Sep 14, 2021

Issue #, if available: #2656

Description of changes:

An alternative proposal to #2564 for restoring FrameworkProcessor to SKLearnProcessor and fixing how FrameworkProcessor works with SageMaker Pipelines (which currently seems broken).

  • Standardize on existing Processor._normalize_args() (as used by ProcessingStep) as the interface point to handle user and extra code inputs
    • Integrate FrameworkProcessor's handling of user code and framework code upload into this function, instead of hooking in at run() or get_run_args()
    • Support plain (non-tarballed and no requirements.txt) source_dir functionality even in ScriptProcessor
  • Migrate FrameworkProcessor entrypoint from /bin/bash to POSIX-compliant /bin/sh, because some lightweight Linux distributions optimized for containers (e.g. Alpine) may not install bash by default: This should be a more flexible option for bringing custom containers with FrameworkProcessor
  • Make some minor improvements to the documentation for FrameworkProcessor: Explicitly mentioning source_dir on the Processing user guide page, and ensuring *Processor classes get listed in framework API directories.

Additional context & design:

  • feature: processors that support multiple Python files, requirements.txt, and dependencies. #2251 added support for Processing Jobs to consume folders of code (instead of single files) and install user code dependencies at runtime (via requirements.txt in the folder) similarly to what users are used to with Framework-based training jobs.
    • User's submitted code/folder is bundled to a tarball on S3 as a code input to the processing job
    • A small shell script is uploaded to S3 as an entrypoint input and overrides the entrypoint of the job: The shell script extracts the bundle and installs requirements.txt, then kicks off the user's actual requested entrypoint/script.
  • ...But it seemed to introduce bugs in existing SM Pipelines where processing jobs would try to execute Python scripts as shell scripts (e.g. import: command not found) - so specifically SKLearnProcessor was rolled back to its old implementation by fix: revert #2251 changes for sklearn processor #2560.
  • Seeking to fix this symptom, Revert #2560 #2564 has been in review for a while to restore SKLearnProcessor's framework functionality by changing FrameworkProcessor to use a python entrypoint instead of shell script.

From my investigations, it seems like #2564 proposes to patch a symptom of the issue rather than the root cause: I've not seen any reports that the entrypoint script being shell-based is an issue per-se (e.g. no complaints about custom containers not having /bin/bash installed) - the problem is the inconsistency of trying to use the command for the (shell-script) framework entrypoint against the file for the user's (Python) script.

In fact, switching to a Python-based framework entrypoint script would:

  • Risk masking the issue by making the user's code run OK but the requirements.txt be ignored
  • Introduce a hard dependency on Python in the target container that could make life harder for any users wanting to derive new FrameworkProcessors for other languages e.g. R, Julia, etc.

In this alternative solution, we integrate FrameworkProcessor more closely with ScriptProcessor by overriding parent private methods (including _normalize_args, _include_code_in_inputs, etc) rather than adding extra new methods hooked in to specific events as before.

This allows FrameworkProcessor to work correctly with SM Pipelines without changing the API between them, with the trade-off that several closely coupled overrides are needed with some code duplication: Every new parameter a child class adds to run() (e.g. git_config, dependencies, etc) needs to be pushed through a few internal methods where the only difference is the number of args transmitted.

Alternatives I think could include:

  • Creating a new standard API on Processor similar to _normalize_args for external classes needing to generate the actual job arguments; having existing Processor/ScriptProcessor pass this through to _normalize_args, and having FrameworkProcessor add in the additional required hook for preparing code bundle.
  • Refactoring Processor.get_run_args() to be this API - but I think this would be a breaking change
  • A more significant refactor of the hierarchy of internal methods in *Processor, which could accidental risk breaking changes or finding that not much could be improved without breaking changes

Testing done:

Some minor additions to test coverage (including checking ScriptProcessor with a folder works), but mostly captured under the scope of existing tests - just tweaked arguments for some diversity between providing a script or a full folder + requirements.txt to SKLearnProcessor.

Verified the fix against the pipelines sample notebook that was known to be broken before.

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • I have read the CONTRIBUTING doc
  • I used the commit message format described in CONTRIBUTING
  • I have passed the region in to all S3 and STS clients that I've initialized as part of this change.
  • I have updated any necessary documentation, including READMEs and API docs (if appropriate)
    • Improved presentation of framework processing classes in the API doc and SageMaker Processing user guide.

Tests

  • I have added tests that prove my fix is effective or that my feature works (if appropriate)
    • Tweaked existing test configurations to try and get good coverage of possible axes
    • Added a unit test to specifically validate SKLearnProcessor (and by extension FrameworkProcessor) behaves as expected by pipelines ProcessingStep
  • I have checked that my tests are not configured for a specific region or account (if appropriate)
  • I have used unique_name_from_base to create resource names in integ tests (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

verdimrc and others added 30 commits March 29, 2021 14:25
This is to conform to the existing style adopted in the sagemaker
python sdk.
…nt=1

This fix specific bugs with SKLearnEstimator which raises exception on
instance_count > 1.
Add comment for the SKLearnProcessor multiple instances fix and
remove the now-unused function parameter.

Merge remote-tracking branch 'verdi/pr-framework-processor-round-02' into feat/fw-processor
Swap mandatory 's3_prefix' param in FrameworkProcessors for optional
'code_location' param inkeeping with Framework estimators. Separate
out the 'code' from 'entrypoint' processing inputs to avoid having
one input channel inside another in the container (input/payload).
Update existing unit & integration tests to work with the new code.
Swap s3_prefix for code_location and separate code/entrypoint
R1720: Unnecessary "else" after "raise" (no-else-raise)
FrameworkProcessor now accepts a 'command' param like the
ScriptProcessor class did previously, and tries to maintain similar
API by forwarding it to the parent class... But patching the
container run command generation from the parent to override. Also
add initial unit tests for multi-instance and overridden commands.
Restore 'command' for FrameworkProcessors
@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 0fd7f1d
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Ensure Processor classes are listed in framework API ToCs, and add
a brief introduction to usage in the SageMaker Processing guide.
@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 5fc171b
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Add a unit test to check SKLearn FrameworkProcessor behaves as
ProcessingStep expects to avoid previous regression on
SKLearnProcessor breaking SageMaker Pipelines examples.
Some lightweight Linux distributions oriented towards containers
(e.g. Alpine) might not include bash by default: The POSIX-compliant
/bin/sh shell is more portable and a less stringent requirement on
custom containers to work with FrameworkProcessor. Also expanded
scope of tests.
@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 2903f39
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@athewsey athewsey changed the title [DRAFT] Restore SKLearn FrameworkProcessor via _normalize_args Restore SKLearn FrameworkProcessor via _normalize_args Sep 21, 2021
@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: d326c9f
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@verdimrc
Copy link
Contributor

verdimrc commented Sep 22, 2021

A note on alternative (1): although this PR removes FrameworkProcessor.get_run_args() (as it's unused), practically get_run_args() can be simply an alias to _normalize_args(). Right now, only ProcessingStep directly uses the private method, so re-introducing the public API would only require to change only that sole caller.

@verdimrc verdimrc mentioned this pull request Sep 22, 2021
7 tasks
Maintain API compatibility with the base Processor class
@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 0c07814
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@athewsey athewsey marked this pull request as draft September 29, 2021 05:15
@athewsey
Copy link
Collaborator Author

athewsey commented Oct 1, 2021

(Pushed to draft in favour of #2664)

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 00d86ba
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Copy link
Member

@ahsan-z-khan ahsan-z-khan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@athewsey
Is this related to #2664 ? Are these dependent on each other? OR can we close this?

@vikas0203 vikas0203 requested review from a team, trajanikant and claytonparnell and removed request for a team November 23, 2022 21:39
@trajanikant
Copy link
Contributor

@athewsey
This has been open since 2021. Kindly close this or share an update related to this.

@mufaddal-rohawala
Copy link
Member

Closing as this has been in draft state for too long! Please reopen if the change is still relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants