Intermittent failures on Processing job that depends on ProcessingOutput in Continuous mode #1339

ram-nadella · 2020-03-09T20:59:56Z

Describe the bug
Processing job fails intermittently when using ProcessingOutput in Continuous mode and when the destination S3 path is used in the same job to load data into Redshift. We are using sqlalchemy to run the COPY command on Redshift. Logs below.

We have had successful runs of this processing job, but it fails sometimes about an S3 prefix not being present. Same container image for all the runs.

To reproduce
Create a processing job that uses ProcessingOutput with output mode set to Continuous (instead of EndofJob) and try to use the S3 file path in the same job. In our case, we are loading data from the file into Redshift.

Expected behavior
In Continuous mode, expecting the S3 path to be accessible and ready as soon as file write to local path completes.

Also, the exact behaviour of what Continuous really means in ProcessingOutput appears to be undocumented.

Screenshots or logs

Truncated log lines (full logs below):

psycopg2.errors.InternalError_: The specified S3 prefix '<job-base-name>-2020-03-09-19-01-48-893/output/processing-output/<file-name>.csv' does not exist

...

 ---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-7-9e65ad912afd> in <module>
      4     s3_upload_mode='Continuous')
      5
----> 6 etl_processor.run(outputs=[po])

...

UnexpectedStatusException: Error for Processing job <base-name>-2020-03-09-19-01-48-893: Failed. Reason: AlgorithmError: See job logs for more information

System information
A description of your system. Please provide:

SageMaker Python SDK version: 1.50.10
Framework name (eg. PyTorch) or algorithm (eg. KMeans): N/A (custom container)
Framework version: N/A
Python version: 3.7
CPU or GPU: CPU
Custom Docker image (Y/N): Y

Additional context

Traceback (most recent call last):
  File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1193, in _execute_context
    context)
  File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 509, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.InternalError_: The specified S3 prefix '<job-base-name>-2020-03-09-19-01-48-893/output/processing-output/<file-name>.csv' does not exist
DETAIL:
  -----------------------------------------------
  error:  The specified S3 prefix '<job-base-name>-2020-03-09-19-01-48-893/output/processing-output/<file-name>.csv' does not exist
  code:      8001
  context:
  query:     54196251
  location:  s3_utility.cpp:604
  process:   padbmaster [pid=26108]
  -----------------------------------------------



The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/<custom-processing-code-file>.py", line 339, in <module>
    res_redshift_load = conn.execution_options(autocommit=True).execute(sql_bulk_load)
  File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 942, in execute
    return self._execute_text(object, multiparams, params)
  File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1104, in _execute_text
    statement, parameters
  File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1200, in _execute_context
    context)
  File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1413, in _handle_dbapi_exception
    exc_info
  File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 265, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 248, in reraise
    raise value.with_traceback(tb)
  File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1193, in _execute_context
    context)
  File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 509, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.InternalError: (psycopg2.errors.InternalError_) The specified S3 prefix '<job-base-name>-2020-03-09-19-01-48-893/output/processing-output/<file-name>.csv' does not exist
DETAIL:
  -----------------------------------------------
  error:  The specified S3 prefix '<job-base-name>-2020-03-09-19-01-48-893/output/processing-output/<file-name>.csv' does not exist
  code:      8001
  context:
  query:     54196251
  location:  s3_utility.cpp:604
  process:   padbmaster [pid=26108]
  -----------------------------------------------

 [SQL: "copy <table-name> from 's3://sagemaker-<bucket>/<job-base-name>-2020-03-09-19-01-48-893/output/processing-output/<file-name>.csv' iam_role 'arn:aws:iam::<iam-role>' format as CSV;"] (Background on this error at: http://sqlalche.me/e/2j85)



 ---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-7-9e65ad912afd> in <module>
      4     s3_upload_mode='Continuous')
      5
----> 6 etl_processor.run(outputs=[po])

~/code/trfx/ml-workflow-test/venv/lib/python3.7/site-packages/sagemaker/processing.py in run(self, inputs, outputs, arguments, wait, logs, job_name, experiment_config)
    162         self.jobs.append(self.latest_job)
    163         if wait:
--> 164             self.latest_job.wait(logs=logs)
    165
    166     def _generate_current_job_name(self, job_name=None):

~/code/trfx/ml-workflow-test/venv/lib/python3.7/site-packages/sagemaker/processing.py in wait(self, logs)
    721         """
    722         if logs:
--> 723             self.sagemaker_session.logs_for_processing_job(self.job_name, wait=True)
    724         else:
    725             self.sagemaker_session.wait_for_processing_job(self.job_name)

~/code/trfx/ml-workflow-test/venv/lib/python3.7/site-packages/sagemaker/session.py in logs_for_processing_job(self, job_name, wait, poll)
   3109
   3110         if wait:
-> 3111             self._check_job_status(job_name, description, "ProcessingJobStatus")
   3112             if dot:
   3113                 print()

~/code/trfx/ml-workflow-test/venv/lib/python3.7/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
   2613                 ),
   2614                 allowed_statuses=["Completed", "Stopped"],
-> 2615                 actual_status=status,
   2616             )
   2617

UnexpectedStatusException: Error for Processing job <job-base-name>-2020-03-09-19-01-48-893: Failed. Reason: AlgorithmError: See job logs for more information

In [8]:

The text was updated successfully, but these errors were encountered:

laurenyu · 2020-03-19T23:49:31Z

sorry for the delayed response here. We've passed this along to the SageMaker Processing team to see if they have any insight. Thanks for your patience!

jigsaw004 · 2020-03-29T04:17:28Z

Hi @ram-nadella - I am from the SageMaker Processing team. Apologies that you ran into this error.

Create a processing job that uses ProcessingOutput with output mode set to Continuous (instead of EndofJob) and try to use the S3 file path in the same job. In our case, we are loading data from the file into Redshift.

Could you please clarify what you mean by "try to use the S3 file path in the same job"? It would be helpful for us to debug the issue if you could provide the ARN of the SageMaker Processing Job that experienced this error.

Also, the exact behaviour of what Continuous really means in ProcessingOutput appears to be undocumented.

Selecting the S3UploadMode as Continuous would imply that the result of the Processing job would be uploaded continuously to the S3 destination path specified in the config. For instance, while your Processing job is in progress, if your Processing container is producing output artifacts in a continuous fashion that you would want to be uploaded to S3 in near real time, you would specify this mode.

Thanks for your feedback regarding the unclear documentation. We've noted this and working towards improving it.

ram-nadella · 2020-03-29T22:38:18Z

Hi @jigsaw004

The logic in our processing job looks something like this:

Run some python code that generates/transforms data using pandas
Generate .csv - write this file to the processing output dir on the container
Use the S3 file path and run a Redshift COPY command to load contents of this CSV into a Redshift table

The last step is where we are seeing intermittent failures.

My original issue description has more details of both specifics about the steps and the actual error message. Copied the relevant parts below:

Processing job fails intermittently when using ProcessingOutput in Continuous mode and when the destination S3 path is used in the same job to load data into Redshift. We are using sqlalchemy to run the COPY command on Redshift.

sqlalchemy.exc.InternalError: (psycopg2.errors.InternalError_) The specified S3 prefix '<job-base-name>-2020-03-09-19-01-48-893/output/processing-output/<file-name>.csv' does not exist
DETAIL:
  -----------------------------------------------
  error:  The specified S3 prefix '<job-base-name>-2020-03-09-19-01-48-893/output/processing-output/<file-name>.csv' does not exist
  code:      8001
  context:
  query:     54196251
  location:  s3_utility.cpp:604
  process:   padbmaster [pid=26108]
  -----------------------------------------------

 [SQL: "copy <table-name> from 's3://sagemaker-<bucket>/<job-base-name>-2020-03-09-19-01-48-893/output/processing-output/<file-name>.csv' iam_role 'arn:aws:iam::<iam-role>' format as CSV;"] (Background on this error at: http://sqlalche.me/e/2j85)

Thanks for the clarification regarding Continuous mode. I think it's the near real-time aspect of the continuous mode that is affecting us here – we want to be able to generate output to S3 and also load the contents of the generated file into a database (Redshift).

jigsaw004 · 2020-03-30T04:35:08Z

Thanks for the comment, @ram-nadella, much appreciated! :)

I saw your initial post and the clarification above, and it makes more sense to me on what could be the problem here.

expecting the S3 path to be accessible and ready as soon as file write to local path completes.

SageMaker would try to upload the contents written to the disk in near-real time. Note that the upload to S3 may experience some intermittent delays and/or retries, which would govern the overall time between write to the local disk and the file being available in S3.

If you would like to work around the above assumption, could I ask that you add a retry in your Processing container logic to wait for the file to be available in S3, i.e. the S3 prefix to be valid? Since performance is a critical criteria here, having the retry logic be as aggressive as possible would make sense.

If you see intermittent failures even after the retry logic (or if you're already implemented the logic before filing this issue), then could you please provide Job ARNs which we can investigate?

laurenyu added Pending information type: question labels Mar 19, 2020

laurenyu added the NON-PY-SDK label May 8, 2020

bveeramani closed this as completed May 20, 2021

aws locked and limited conversation to collaborators May 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Intermittent failures on Processing job that depends on ProcessingOutput in Continuous mode #1339

Intermittent failures on Processing job that depends on ProcessingOutput in Continuous mode #1339

ram-nadella commented Mar 9, 2020

laurenyu commented Mar 19, 2020

jigsaw004 commented Mar 29, 2020

ram-nadella commented Mar 29, 2020

jigsaw004 commented Mar 30, 2020

This issue was moved to a discussion.

This issue was moved to a discussion.

Intermittent failures on Processing job that depends on ProcessingOutput in Continuous mode #1339

Intermittent failures on Processing job that depends on ProcessingOutput in Continuous mode #1339

Comments

ram-nadella commented Mar 9, 2020

laurenyu commented Mar 19, 2020

jigsaw004 commented Mar 29, 2020

ram-nadella commented Mar 29, 2020

jigsaw004 commented Mar 30, 2020

This issue was moved to a discussion.