Compute resources and number of executors are not set #161

sbernauer · 2022-10-12T06:52:12Z

Affected version

nightly

Current and expected behavior

Configuring the following compute resources

      job:
        resources:
          cpu:
            min: 100m
            max: "1"
          memory:
            limit: 1Gi
      driver:
        resources:
          cpu:
            min: "1"
            max: "2"
          memory:
            limit: 2Gi
        volumeMounts:
          - name: script
            mountPath: /stackable/spark/jobs
      executor:
        instances: 5
        resources:
          cpu:
            min: "4"
            max: "6"
          memory:
            limit: 8Gi
        volumeMounts:
          - name: script
            mountPath: /stackable/spark/jobs

I would have expected the following settings (i manually overwrite them to get it working)

      sparkConf:
        spark.executor.instances: "5"
        spark.driver.cores: "2"
        spark.driver.memory: "2g"
        spark.kubernetes.driver.request.cores: "1"
        spark.kubernetes.driver.limit.cores: "2"
        spark.executor.cores: "6"
        spark.executor.memory: "8g"
        spark.kubernetes.executor.request.cores: "4"
        spark.kubernetes.executor.limit.cores: "6"

Whatever currently is done is way of, e.g. it only spawns 2 executors, executor cpu and memory seem to be the default values and the executor thinks it has a single core.

Possible solution

Did some prototyping in #160
In a nutshell we need to set all the above mentioned values (can be used as a test case)

spark.executor.instances obviously
spark.driver.memory and spark.executor.memory need to be fetched from the memory limit and somehow converted to the Spark memory convention. Special caution is needed as if a user specifies e.g. 8Gi for a executor, Spark will add overhead and the resulting Pods will have a memory limit of 11468Mi. We want to subtract that overhead and do calculations (and/or set overhead configurations) to achieve the requested 8Gi memory limit of the executor Pods.
spark.kubernetes.driver.(request|limit).cores and spark.kubernetes.driver.(request|limit).cores can be 1:1 mapped from the CRD i guess
spark.driver.cores and spark.executor.cores need to be fetched from the cpu limit, somehow converted to a floating point, rounded up and given as a while positive number

Additional context

No response

Environment

No response

Would you like to work on fixing this bug?

No response

The text was updated successfully, but these errors were encountered:

adwk67 · 2022-10-17T07:58:52Z

Regarding spark.executor.instances: the code setting this via spark.conf was incorrectly removed in the PR (by me :().

adwk67 · 2022-10-17T09:39:16Z

Suggested settings:

Cores

it should not be necessary to set spark.driver.cores or spark.executor.cores if the following are set explicitly:
- spark.kubernetes.driver.request.cores: overrides spark.driver.cores
- spark.kubernetes.driver.limit.cores: hard limit for driver cores
- spark.kubernetes.executor.request.cores: the docs state that this is distinct from spark.executor.cores (though it overrides it)
- spark.kubernetes.executor.limit.cores: hard limit

Memory

Driver

sum of spark.driver.memory and spark.driver.memoryOverhead
- where spark.driver.memoryOverhead = spark.driver.memory * spark.driver.memoryOverheadFactor, min. of 384Mi
- overhead factor defaults to 0.1 for JVM jobs and 0.4 for non-JVM jobs (e.g. pyspark)
- the CRD setting should reflect the total expected usage, so if 8GB is specified, we would have
  - JVM jobs: 7.27 GB heap (spark.driver.memory) + 0.73 GB non-heap (spark.driver.memory) = 8 GB total
  - non-JVM: 5.71 GB heap + 2.29 GB non-heap = 8 GB total

i.e. for driver memory settings everything can be derived from the resource limit definitions, as long as all memory values are specifically set. We should add a note in the docs that it is possible to define resource limits and spark.conf settings that are not compatible with each other.

Executor

sum of spark.executor.memory, spark.executor.memoryOverhead, spark.executor.pyspark.memory and spark.memory.offHeap.size
as above, the CRD should reflect the total expected usage
setting heap and non-heap such that their sum equates to the resource limits cannot prevent the user additionally setting spark.conf values that will be incompatible

If resource limits are defined, then these will be used to set memory, and the user should be aware that other configuration values may directly assign more memory than is allowed i.e. use either resource limits and omit spark.conf -related settings, or vice-versa.

Maybe use a spark streaming job (such as the one shown here) for an integration test so that the actual pods can be inspected.

sbernauer added the type/bug label Oct 12, 2022

sbernauer changed the title ~~Compute resource are not set~~ Compute resources and number of executors are not set Oct 12, 2022

sbernauer added this to Stackable Engineering Oct 12, 2022

sbernauer moved this to Idea/Proposal in Stackable Engineering Oct 12, 2022

This was referenced Feb 1, 2024

[Tracker] Findings of demos stackabletech/demos#15

Open

Properly handle compute resources and number of executors #160

Closed

lfrancke added the size/M label Oct 17, 2022

lfrancke moved this from Idea/Proposal to Refinement: Waiting for in Stackable Engineering Oct 17, 2022

adwk67 moved this from Refinement: Waiting for to Refinement: In Progress in Stackable Engineering Oct 17, 2022

adwk67 self-assigned this Oct 17, 2022

adwk67 moved this from Refinement: In Progress to Refinement Acceptance: Waiting for in Stackable Engineering Oct 17, 2022

lfrancke moved this from Refinement Acceptance: Waiting for to Ready for Development in Stackable Engineering Oct 17, 2022

adwk67 moved this from Ready for Development to Development: In Progress in Stackable Engineering Oct 17, 2022

adwk67 mentioned this issue Oct 19, 2022

[Merged by Bors] - fix resource limit usage #166

Closed

7 tasks

adwk67 moved this from Development: In Progress to Development: Waiting for Review in Stackable Engineering Oct 19, 2022

sbernauer moved this from Development: Waiting for Review to Development: In Review in Stackable Engineering Oct 20, 2022

sbernauer self-assigned this Oct 20, 2022

bors bot closed this as completed in 3425391 Oct 20, 2022

adwk67 moved this from Development: In Review to Acceptance: Waiting for in Stackable Engineering Oct 20, 2022

lfrancke moved this from Acceptance: Waiting for to Acceptance: In Progress in Stackable Engineering Oct 24, 2022

lfrancke moved this from Acceptance: In Progress to Done in Stackable Engineering Oct 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Compute resources and number of executors are not set #161

Compute resources and number of executors are not set #161

sbernauer commented Oct 12, 2022 •

edited

Loading

adwk67 commented Oct 17, 2022 •

edited

Loading

Uh oh!

adwk67 commented Oct 17, 2022 •

edited

Loading

Uh oh!

Uh oh!

Compute resources and number of executors are not set #161

Compute resources and number of executors are not set #161

Comments

sbernauer commented Oct 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Affected version

Current and expected behavior

Possible solution

Additional context

Environment

Would you like to work on fixing this bug?

adwk67 commented Oct 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adwk67 commented Oct 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cores

Memory

Driver

Executor

Uh oh!

sbernauer commented Oct 12, 2022 •

edited

Loading

adwk67 commented Oct 17, 2022 •

edited

Loading

adwk67 commented Oct 17, 2022 •

edited

Loading