Research: Streaming jobs with spark-k8s #119

adwk67 · 2022-08-18T13:38:09Z

As a user I want to have a plan/outline/overview of how to write streaming jobs that spark executes in k8s.

This is a research ticket: as a result we will able to

understand pros/cons
be able to estimate implementation effort
define specific implementation ticket(s)

adwk67 · 2022-08-19T08:31:19Z

Set the spark configuration option spark.kubernetes.submission.waitAppCompletion to false
See: https://spark.apache.org/docs/latest/running-on-kubernetes.html:-

In cluster mode, whether to wait for the application to finish before exiting the launcher process. When changed to false, the launcher has a "fire-and-forget" behavior when launching the Spark job.

Test with a streaming job that is not terminated e.g. hdfs_wordcount which counts the words in a target directory in a loop. For a file pyspark_streaming.yaml:

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: pyspark-streaming
  namespace: default
spec:
  version: "1.0"
  sparkImage: docker.stackable.tech/stackable/pyspark-k8s:3.3.0-stackable0.1.0
  mode: cluster
  mainApplicationFile: local:///stackable/spark/examples/src/main/python/streaming/hdfs_wordcount.py
  args:
    - "/tmp2"
  sparkConf:
    spark.kubernetes.submission.waitAppCompletion: "false"
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
  executor:
    cores: 1
    instances: 3
    memory: "512m"

Applying this file (kubectl apply -f pyspark-streaming.yaml) will start a driver/executors that run continually:

The job has been started as "fire-and-forget" but can be stopped and removed with kubectl delete -f pyspark-streaming.yaml.

adwk67 · 2022-08-19T09:33:44Z

A further note on the spark configuration option spark.kubernetes.submission.waitAppCompletion. When set to false, the initiating job completes as soon as the spark-submit command has been handed off to the driver, meaning that the operator can no longer track job status. This is thus in the hands of the application developer, though our recommendation would be to set it to true (whereby job, driver and executors run continuously until terminated) and configure logging accordingly (so that e.g. the job pod does not log out unnecessarily).

adwk67 added the type/research label Aug 18, 2022

adwk67 self-assigned this Aug 19, 2022

adwk67 closed this as completed Aug 19, 2022

adwk67 mentioned this issue Oct 17, 2022

Compute resources and number of executors are not set #161

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Research: Streaming jobs with spark-k8s #119

Research: Streaming jobs with spark-k8s #119

adwk67 commented Aug 18, 2022

adwk67 commented Aug 19, 2022

Uh oh!

adwk67 commented Aug 19, 2022

Uh oh!

Uh oh!

Research: Streaming jobs with spark-k8s #119

Research: Streaming jobs with spark-k8s #119

Comments

adwk67 commented Aug 18, 2022

adwk67 commented Aug 19, 2022

Uh oh!

adwk67 commented Aug 19, 2022

Uh oh!