Skip to content

Implement resource limits and requests for Spark pods #128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
6 of 8 tasks
Tracked by #241
fhennig opened this issue Aug 31, 2022 · 10 comments
Closed
6 of 8 tasks
Tracked by #241

Implement resource limits and requests for Spark pods #128

fhennig opened this issue Aug 31, 2022 · 10 comments
Assignees

Comments

@fhennig
Copy link
Contributor

fhennig commented Aug 31, 2022

In other products we're introducing a common resource limit configuration. This ticket is about evaluating whether it makes sense to use it in this operator too. If it does, this ticket includes the implementation:

Part of this epic stackabletech/issues#241

Update

The spark-k8s operator hands management of driver and executor pods off to spark itself, and there are CRD fields for this purpose that are used to construct the relevant spark-submit arguments. This will remain unchanged. Currently there are no resources specified for the initiating/Job Pod, so this ticket will cover that. Specifically, since the operator does follow the role-pattern for other products, the struct will a top-level (directly under .spec) field: the resources defined here will be passed on to the ContainerBuilder used for the job, here.

Acceptance criteria

  • Resource requests and limits are configurable in CRD using the common structs from operator-rs
  • Resource requests and limits are configured for the initial Kubernetes Job pod (only)
  • Resource requests and limits are configured in the product (e.g. "-Xmx" etc. for Java based images)
  • Adapt/Add integration tests to specify and test correct amount of resources
  • Adapt/Add examples
  • Adapt documentation: New section in usage.adoc with product specific information and link to common shared resources concept
  • Optional: Use sensible defaults for each role (if reasonable and applicable) and document accordingly in usage.adoc
  • Feature Tracker is updated
@maltesander maltesander changed the title Implement resource limits for Spark pods Implement resource limits and requests for Spark pods Aug 31, 2022
@lfrancke lfrancke moved this to Idea/Proposal in Stackable Engineering Sep 15, 2022
@lfrancke lfrancke moved this from Idea/Proposal to Refinement Acceptance: Waiting for in Stackable Engineering Sep 15, 2022
@lfrancke lfrancke moved this from Refinement Acceptance: Waiting for to Refinement: Waiting for in Stackable Engineering Sep 15, 2022
@lfrancke
Copy link
Member

I left this in the refinement column because I vaguely remember someone saying that this is already implemented for Spark and might just need a different implementation? I'm not sure to be honest and I might be wrong.

@adwk67
Copy link
Member

adwk67 commented Sep 15, 2022

Cores, Core-limit and memory can be specified for driver & executor pods (but not for the initiating job): e.g. https://github.com/stackabletech/spark-k8s-operator/blob/main/examples/ny-tlc-report-image.yaml#L36-L42

@lfrancke
Copy link
Member

Thank you! I need some more input though I'm afraid :)

Is there still anything to be done to close this ticket or can it already be closed as done?

  1. Do we need this for the initiating job?
  2. Because it was implemented earlier than the others: Does it follow the CRD standard of the other operators and does it use the same implementation etc.?

@adwk67
Copy link
Member

adwk67 commented Sep 15, 2022

  1. I would say, no, as the initiating pod is only relatively short-lived and gets removed after a few minutes.
  2. The implementation for driver/executors is a little different, as we use the resource defs to add spark configs to a spark-submit command rather than building those pods directly. If we were to implement this for the initiating pod then it would be a little different, as we don't have any roles/rolegroups as such (so the resources field in the CRD could be defined the same way but added at a top level directly under .spec).

@razvan what do you think?

@adwk67
Copy link
Member

adwk67 commented Sep 19, 2022

Having thought about this, I think it is worthwhile to implement this for the job Pod (the implementation would follow the standard pattern of introducing a struct that is defined in the CRD and passed through to the ContainerBuilder), but to leave the driver/executor management to Spark (as is currently done).

@lfrancke
Copy link
Member

Great, thank you!
In that case I'll leave it in the Refinement column for someone to update the ticket accordingly.

@adwk67 adwk67 moved this from Refinement: Waiting for to Refinement: In Progress in Stackable Engineering Sep 19, 2022
@adwk67
Copy link
Member

adwk67 commented Sep 19, 2022

In that case I'll leave it in the Refinement column for someone to update the ticket accordingly.

Have updated the issue description accordingly.

@adwk67 adwk67 moved this from Refinement: In Progress to Refinement Acceptance: Waiting for in Stackable Engineering Sep 19, 2022
@lfrancke lfrancke moved this from Refinement Acceptance: Waiting for to Refinement Acceptance: Done in Stackable Engineering Sep 19, 2022
@sbernauer sbernauer moved this from Refinement Acceptance: Done to Development: Waiting for in Stackable Engineering Sep 21, 2022
@adwk67 adwk67 moved this from Development: Waiting for to Development: In Progress in Stackable Engineering Sep 21, 2022
@adwk67 adwk67 self-assigned this Sep 21, 2022
@razvan
Copy link
Member

razvan commented Sep 23, 2022

Sorry for having missed this and yes I agree with @adwk67 that the initiating spark-submit Pod should specify resource limits.

I'm not sure it's a good idea to make these limits configurable by the user.

@adwk67 adwk67 moved this from Development: In Progress to Development: In Review in Stackable Engineering Sep 29, 2022
@bors bors bot closed this as completed in 38698ad Sep 29, 2022
@adwk67 adwk67 moved this from Development: In Review to Acceptance: Waiting for in Stackable Engineering Sep 29, 2022
@lfrancke lfrancke moved this from Acceptance: Waiting for to Acceptance: In Progress in Stackable Engineering Sep 30, 2022
@lfrancke lfrancke self-assigned this Sep 30, 2022
@lfrancke
Copy link
Member

I will update the Feature Tracker, could you please check if all other boxes can be ticked? @razvan @adwk67

@adwk67
Copy link
Member

adwk67 commented Oct 4, 2022

All other boxes have been ticked, except

  • "only implement for Job pod" as the common struct has now been used for all pods (Job, driver and executor) to keep things consistent
  • "product limits" (Spark sets e.g. overall memory based on memory limit + memoryOverHead-factor which defaults to 10%)

@lfrancke lfrancke moved this from Acceptance: In Progress to Done in Stackable Engineering Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants