Skip to content

Research: Streaming jobs with spark-k8s #119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
adwk67 opened this issue Aug 18, 2022 · 2 comments
Closed

Research: Streaming jobs with spark-k8s #119

adwk67 opened this issue Aug 18, 2022 · 2 comments
Assignees

Comments

@adwk67
Copy link
Member

adwk67 commented Aug 18, 2022

As a user I want to have a plan/outline/overview of how to write streaming jobs that spark executes in k8s.

This is a research ticket: as a result we will able to

  • understand pros/cons
  • be able to estimate implementation effort
  • define specific implementation ticket(s)
@adwk67 adwk67 self-assigned this Aug 19, 2022
@adwk67
Copy link
Member Author

adwk67 commented Aug 19, 2022

Set the spark configuration option spark.kubernetes.submission.waitAppCompletion to false
See: https://spark.apache.org/docs/latest/running-on-kubernetes.html:-

In cluster mode, whether to wait for the application to finish before exiting the launcher process. When changed to false, the launcher has a "fire-and-forget" behavior when launching the Spark job.

Test with a streaming job that is not terminated e.g. hdfs_wordcount which counts the words in a target directory in a loop. For a file pyspark_streaming.yaml:

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: pyspark-streaming
  namespace: default
spec:
  version: "1.0"
  sparkImage: docker.stackable.tech/stackable/pyspark-k8s:3.3.0-stackable0.1.0
  mode: cluster
  mainApplicationFile: local:///stackable/spark/examples/src/main/python/streaming/hdfs_wordcount.py
  args:
    - "/tmp2"
  sparkConf:
    spark.kubernetes.submission.waitAppCompletion: "false"
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
  executor:
    cores: 1
    instances: 3
    memory: "512m"

Applying this file (kubectl apply -f pyspark-streaming.yaml) will start a driver/executors that run continually:
image

The job has been started as "fire-and-forget" but can be stopped and removed with kubectl delete -f pyspark-streaming.yaml.

@adwk67
Copy link
Member Author

adwk67 commented Aug 19, 2022

A further note on the spark configuration option spark.kubernetes.submission.waitAppCompletion. When set to false, the initiating job completes as soon as the spark-submit command has been handed off to the driver, meaning that the operator can no longer track job status. This is thus in the hands of the application developer, though our recommendation would be to set it to true (whereby job, driver and executors run continuously until terminated) and configure logging accordingly (so that e.g. the job pod does not log out unnecessarily).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant