Skip to content

fix: ensure a spark application can only be submitted once #460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Sep 16, 2024
Merged

Conversation

razvan
Copy link
Member

@razvan razvan commented Sep 12, 2024

Description

Fixes #457

Kubernetes recycles Spark applications jobs after ttlSecondsAfterFinished (10min currently) but the application objects live forever (or until the user deletes them).
If a reconciliation is triggered on an app that has no child Job, the operator will submit the Job again.

The fix is to use the application's status field as guard for the reconciliation loop.

I tested it by:

  1. running the smoke integration test with --skip-delete
  2. After the test finished successfully I deleted the Spark pi Job object
  3. I deleted the operator pod (simulated a restart)
  4. The Job was not submitted again and in the operator logs I see:
│ 2024-09-12T13:14:51.650689Z  INFO app_controller:reconciling object{object.ref=SparkApplication.v1alpha1.spark.stackable.tech/spark-pi-s3-1.kuttl-test-meet-martin object.reason=object updated}: stackable_spark │
│ _k8s_operator::spark_k8s_controller: Skip reconciling SparkApplication [spark-pi-s3-1] with non empty status  

🟢 CI: https://testing.stackable.tech/view/02%20Operator%20Tests%20(custom)/job/spark-k8s-operator-it-custom/4/

Definition of Done Checklist

  • Not all of these items are applicable to all PRs, the author should update this template to only leave the boxes in that are relevant
  • Please make sure all these things are done and tick the boxes
# Author
- [x] Changes are OpenShift compatible
- [x] CRD changes approved
- [x] CRD documentation for all fields, following the [style guide](https://docs.stackable.tech/home/nightly/contributor/docs/style-guide).
- [x] Helm chart can be installed and deployed operator works
- [x] Integration tests passed (for non trivial changes)
- [x] Changes need to be "offline" compatible
# Reviewer
- [x] Code contains useful comments
- [x] Code contains useful logging statements
- [x] (Integration-)Test cases added
- [x] Documentation added or updated. Follows the [style guide](https://docs.stackable.tech/home/nightly/contributor/docs/style-guide).
- [x] Changelog updated
- [x] Cargo.toml only contains references to git tags (not specific commits or branches)
# Acceptance
- [ ] Feature Tracker has been updated
- [ ] Proper release label has been added
- [ ] [Roadmap](https://github.com/orgs/stackabletech/projects/25/views/1) has been updated

@razvan razvan requested a review from a team September 12, 2024 12:44
@razvan razvan marked this pull request as ready for review September 12, 2024 15:33
@razvan razvan enabled auto-merge September 13, 2024 10:50
@razvan razvan self-assigned this Sep 13, 2024
@sbernauer sbernauer self-requested a review September 16, 2024 08:45
Copy link
Member

@sbernauer sbernauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionality LGTM, only some style questions

@razvan razvan requested a review from sbernauer September 16, 2024 11:55
sbernauer
sbernauer previously approved these changes Sep 16, 2024
Copy link
Member

@sbernauer sbernauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Can you please run a Jenkins custom test before merge?

@razvan
Copy link
Member Author

razvan commented Sep 16, 2024

Copy link
Member

@sbernauer sbernauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@razvan razvan added this pull request to the merge queue Sep 16, 2024
Merged via the queue into main with commit 9cd61dd Sep 16, 2024
31 checks passed
@razvan razvan deleted the fix/457 branch September 16, 2024 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

operator resubmits all spark applications after restart no matter what their status is
3 participants