Skip to content

Manual Scale Test for Data and Control Plane Separation #3011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #1508
mpstefan opened this issue Jan 13, 2025 · 6 comments
Closed
Tracked by #1508

Manual Scale Test for Data and Control Plane Separation #3011

mpstefan opened this issue Jan 13, 2025 · 6 comments
Assignees
Labels
refined Requirements are refined and the issue is ready to be implemented. size/medium Estimated to be completed within a week tests Pull requests that update tests
Milestone

Comments

@mpstefan
Copy link
Member

mpstefan commented Jan 13, 2025

As a maintainer of NGF
I want to ensure that when we scale to ~1000 Agent connections to our control plane
So that we can ensure our control plane does not get overwhelmed with connections when deployed with a highly scale data plane.

Acceptance

  • Manual tests needed:
    • Test scaling NGF to 1000 Gateways to ensure the control plane is not overwhelmed by Agent connections.
    • Test if our control plane is scaled to 2 or more pods, when the 1000 data plane pods switch leaders that the control plane is not overwhelmed.
  • Record breaking points if applicable.
@mpstefan mpstefan added tests Pull requests that update tests size/medium Estimated to be completed within a week labels Jan 13, 2025
@mpstefan mpstefan added this to the v2.0.0 milestone Jan 13, 2025
@mpstefan mpstefan added the refined Requirements are refined and the issue is ready to be implemented. label Jan 13, 2025
@salonichf5 salonichf5 self-assigned this Feb 11, 2025
@salonichf5 salonichf5 moved this from 🆕 New to 🏗 In Progress in NGINX Gateway Fabric Feb 13, 2025
@salonichf5 salonichf5 removed their assignment Feb 13, 2025
@salonichf5 salonichf5 moved this from 🏗 In Progress to 🆕 New in NGINX Gateway Fabric Feb 13, 2025
@salonichf5
Copy link
Contributor

Unassigning myself from this story until we figure out the rollback restart issue of pods during leader election.

@salonichf5 salonichf5 self-assigned this Feb 13, 2025
@salonichf5 salonichf5 moved this from 🆕 New to 🏗 In Progress in NGINX Gateway Fabric Feb 13, 2025
@salonichf5
Copy link
Contributor

I have collected the results in this report.

General overview - with one control plane, I am able to scale to 700 nginx instances without crashes. I tried scaling to 1000 pods with 20 control plane instances and all NGINX pods terminated. I saw only 913 pods become running before they all crashed.

Some of the error messages that seemed concerning with regards to lease acquiring deadlocks have been noted in the report.

Manual Scale Tests for Control Plane and Data Plane.docx

@github-project-automation github-project-automation bot moved this from 🏗 In Progress to ✅ Done in NGINX Gateway Fabric Feb 19, 2025
@sjberman
Copy link
Collaborator

Found some issues with Locks in the code that I am fixing that should help with scaling.

@sindhushiv
Copy link
Collaborator

@sjberman, Do we need to do a scale test again after the fixes are done?

@sjberman
Copy link
Collaborator

@sindhushiv, already did this manually. Results look much better. No issues seen.

@salonichf5
Copy link
Contributor

@sjberman @sindhushiv let me know if you need any help from my side

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refined Requirements are refined and the issue is ready to be implemented. size/medium Estimated to be completed within a week tests Pull requests that update tests
Projects
Status: Done
Development

No branches or pull requests

4 participants