Skip to content

"FastFile" for Processing Job Input #3962

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lorenzwalthert opened this issue Jun 27, 2023 · 11 comments
Closed

"FastFile" for Processing Job Input #3962

lorenzwalthert opened this issue Jun 27, 2023 · 11 comments
Labels
component: pipelines Relates to the SageMaker Pipeline Platform type: feature request

Comments

@lorenzwalthert
Copy link

lorenzwalthert commented Jun 27, 2023

Describe the feature you'd like

"FastFile" to be an available option for s3_input_mode in sagemaker.Processing.ProcessingInput, in addition to "File" and "Pipe". The s3 input mode is already available for TrainingInput since 2021 and greatly improves speed (-82%) according to an AWS Blog post.

How would this feature be used? Please describe.

To speed up processing jobs compared to donwloading all data and allow complex filtering of files before accessing them.

Describe alternatives you've considered

Other methods like

  • downloading relevant files as part of training job with sagemaker.s3.S3Downloader(). Problem: I can't shard by s3 key and have to build my own sharding logic.
  • Using S3 prefix as the s3_data_type in sagemaker.Processing.ProcessingInput to filter out by prefix: Problem: Some data can't be easily filtered by prefix and you need more complex pattern matching.
  • Using a ManifestFile.

Additional context

I know it's not an SDK topic as long as the underlaying APIs don't provide that functionality but I don't know where I can put the feature request otherwise.

@trungleduc trungleduc added type: feature request component: pipelines Relates to the SageMaker Pipeline Platform labels Sep 21, 2023
@lorenzwalthert
Copy link
Author

I think this is actually low prio, since I was able to use ManifestFile to achieve my goal as well.

@martinRenou
Copy link
Collaborator

As far as I can see, there is no validation in the code that would prevent from using "FastFile" instead of "File" for s3_input_mode. Have you seen an error when trying?

The only thing that seems to be missing is mention to "FastFile" in the docstring (e.g. https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/processing.py#L1245). I will open a PR for that.

@martinRenou
Copy link
Collaborator

Closing as answered and the dosctrings was fixed. Feel free to reopen if you think the issue still stands.

@lorenzwalthert
Copy link
Author

lorenzwalthert commented Dec 8, 2023

Thanks @martinRenou.

@julestalloen
Copy link

It seems this doesn't actually work as the underlying API does not support it:

S3InputMode
Whether to use File or Pipe input mode. In File mode, Amazon SageMaker copies the data from the input source onto the local ML storage volume before starting your processing container. This is the most commonly used input mode. In Pipe mode, Amazon SageMaker streams input data from the source directly to your processing container into named pipes without using the ML storage volume.

Type: String
Valid Values: Pipe | File
Required: No

https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProcessingS3Input.html

I also get the following error when trying to use FastFile for my ProcessingInput:

botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the CreatePipeline operation: Unable to parse pipeline definition. Model Validation failed: Value 'FastFile' for 'ProcessingS3InputMode' failed to satisfy enum value set: [Pipe, File]

@lorenzwalthert
Copy link
Author

lorenzwalthert commented Dec 19, 2023

To clarify, I did not try the suggested solution by Martin.

@julestalloen
Copy link

@lorenzwalthert Thanks for the clarification. Could you reopen this issue? If not, I'll create a new one.

@lorenzwalthert
Copy link
Author

I can't reopen it seems.

@martinRenou martinRenou reopened this Dec 22, 2023
@ShwetaSingh801
Copy link
Collaborator

Hi @lorenzwalthert,

Thanks for using SageMaker and taking the time to suggest ways to improve SageMaker Python SDK. We have added your feature request it to our backlog of feature requests and may consider putting it into future SDK versions. I will go ahead and close the issue now, please let me know if you have any more feedback. Let me know if you have any other questions.

Best,
Shweta

@jbschiratti
Copy link

jbschiratti commented Apr 17, 2024

@ShwetaSingh801 I recently had the same error:

ClientError: An error occurred (ValidationException) when calling the CreateProcessingJob operation: 1 validation error detected: Value 'FastFile' at 'processingInputs.1.member.s3Input.s3InputMode' failed to satisfy constraint: Member must satisfy enum value set: [Pipe, File]

I can provide a MWE if needed (using sagemaker==2.215.0). The ProcessingInput doc is misleading as it states that one could instantiate a ProcessingInput with s3_input_mode='FastFile'.

@martinRenou Can you, please, comment on this?

@jbschiratti
Copy link

Also tagging @mohanasudhan and @akrishna1995 on this as they approved #4311 (without adding/updating tests?!).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: pipelines Relates to the SageMaker Pipeline Platform type: feature request
Projects
None yet
Development

No branches or pull requests

6 participants