Skip to content

Compute resources and number of executors are not set #161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sbernauer opened this issue Oct 12, 2022 · 2 comments
Closed

Compute resources and number of executors are not set #161

sbernauer opened this issue Oct 12, 2022 · 2 comments

Comments

@sbernauer
Copy link
Member

sbernauer commented Oct 12, 2022

Affected version

nightly

Current and expected behavior

Configuring the following compute resources

      job:
        resources:
          cpu:
            min: 100m
            max: "1"
          memory:
            limit: 1Gi
      driver:
        resources:
          cpu:
            min: "1"
            max: "2"
          memory:
            limit: 2Gi
        volumeMounts:
          - name: script
            mountPath: /stackable/spark/jobs
      executor:
        instances: 5
        resources:
          cpu:
            min: "4"
            max: "6"
          memory:
            limit: 8Gi
        volumeMounts:
          - name: script
            mountPath: /stackable/spark/jobs

I would have expected the following settings (i manually overwrite them to get it working)

      sparkConf:
        spark.executor.instances: "5"
        spark.driver.cores: "2"
        spark.driver.memory: "2g"
        spark.kubernetes.driver.request.cores: "1"
        spark.kubernetes.driver.limit.cores: "2"
        spark.executor.cores: "6"
        spark.executor.memory: "8g"
        spark.kubernetes.executor.request.cores: "4"
        spark.kubernetes.executor.limit.cores: "6"

Whatever currently is done is way of, e.g. it only spawns 2 executors, executor cpu and memory seem to be the default values and the executor thinks it has a single core.

Possible solution

Did some prototyping in #160
In a nutshell we need to set all the above mentioned values (can be used as a test case)

  • spark.executor.instances obviously
  • spark.driver.memory and spark.executor.memory need to be fetched from the memory limit and somehow converted to the Spark memory convention. Special caution is needed as if a user specifies e.g. 8Gi for a executor, Spark will add overhead and the resulting Pods will have a memory limit of 11468Mi. We want to subtract that overhead and do calculations (and/or set overhead configurations) to achieve the requested 8Gi memory limit of the executor Pods.
  • spark.kubernetes.driver.(request|limit).cores and spark.kubernetes.driver.(request|limit).cores can be 1:1 mapped from the CRD i guess
  • spark.driver.cores and spark.executor.cores need to be fetched from the cpu limit, somehow converted to a floating point, rounded up and given as a while positive number

Additional context

No response

Environment

No response

Would you like to work on fixing this bug?

No response

@sbernauer sbernauer changed the title Compute resource are not set Compute resources and number of executors are not set Oct 12, 2022
@sbernauer sbernauer moved this to Idea/Proposal in Stackable Engineering Oct 12, 2022
@adwk67
Copy link
Member

adwk67 commented Oct 17, 2022

Regarding spark.executor.instances: the code setting this via spark.conf was incorrectly removed in the PR (by me :().

@lfrancke lfrancke moved this from Idea/Proposal to Refinement: Waiting for in Stackable Engineering Oct 17, 2022
@adwk67 adwk67 moved this from Refinement: Waiting for to Refinement: In Progress in Stackable Engineering Oct 17, 2022
@adwk67 adwk67 self-assigned this Oct 17, 2022
@adwk67
Copy link
Member

adwk67 commented Oct 17, 2022

Suggested settings:

Cores

  • it should not be necessary to set spark.driver.cores or spark.executor.cores if the following are set explicitly:
    • spark.kubernetes.driver.request.cores: overrides spark.driver.cores
    • spark.kubernetes.driver.limit.cores: hard limit for driver cores
    • spark.kubernetes.executor.request.cores: the docs state that this is distinct from spark.executor.cores (though it overrides it)
    • spark.kubernetes.executor.limit.cores: hard limit

Memory

Driver
  • sum of spark.driver.memory and spark.driver.memoryOverhead
    • where spark.driver.memoryOverhead = spark.driver.memory * spark.driver.memoryOverheadFactor, min. of 384Mi
    • overhead factor defaults to 0.1 for JVM jobs and 0.4 for non-JVM jobs (e.g. pyspark)
    • the CRD setting should reflect the total expected usage, so if 8GB is specified, we would have
      • JVM jobs: 7.27 GB heap (spark.driver.memory) + 0.73 GB non-heap (spark.driver.memory) = 8 GB total
      • non-JVM: 5.71 GB heap + 2.29 GB non-heap = 8 GB total

i.e. for driver memory settings everything can be derived from the resource limit definitions, as long as all memory values are specifically set. We should add a note in the docs that it is possible to define resource limits and spark.conf settings that are not compatible with each other.

Executor
  • sum of spark.executor.memory, spark.executor.memoryOverhead, spark.executor.pyspark.memory and spark.memory.offHeap.size
  • as above, the CRD should reflect the total expected usage
  • setting heap and non-heap such that their sum equates to the resource limits cannot prevent the user additionally setting spark.conf values that will be incompatible

If resource limits are defined, then these will be used to set memory, and the user should be aware that other configuration values may directly assign more memory than is allowed i.e. use either resource limits and omit spark.conf -related settings, or vice-versa.

Maybe use a spark streaming job (such as the one shown here) for an integration test so that the actual pods can be inspected.

@adwk67 adwk67 moved this from Refinement: In Progress to Refinement Acceptance: Waiting for in Stackable Engineering Oct 17, 2022
@lfrancke lfrancke moved this from Refinement Acceptance: Waiting for to Ready for Development in Stackable Engineering Oct 17, 2022
@adwk67 adwk67 moved this from Ready for Development to Development: In Progress in Stackable Engineering Oct 17, 2022
@adwk67 adwk67 moved this from Development: In Progress to Development: Waiting for Review in Stackable Engineering Oct 19, 2022
@sbernauer sbernauer moved this from Development: Waiting for Review to Development: In Review in Stackable Engineering Oct 20, 2022
@sbernauer sbernauer self-assigned this Oct 20, 2022
@bors bors bot closed this as completed in 3425391 Oct 20, 2022
@adwk67 adwk67 moved this from Development: In Review to Acceptance: Waiting for in Stackable Engineering Oct 20, 2022
@lfrancke lfrancke moved this from Acceptance: Waiting for to Acceptance: In Progress in Stackable Engineering Oct 24, 2022
@lfrancke lfrancke moved this from Acceptance: In Progress to Done in Stackable Engineering Oct 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants