Skip to content

🐛 Refactor certificate watcher to use polling, instead of fsnotify #3020

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 26, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions pkg/certwatcher/certwatcher.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ import (
kerrors "k8s.io/apimachinery/pkg/util/errors"
"k8s.io/apimachinery/pkg/util/sets"
"k8s.io/apimachinery/pkg/util/wait"

"sigs.k8s.io/controller-runtime/pkg/certwatcher/metrics"
logf "sigs.k8s.io/controller-runtime/pkg/internal/log"
)
Expand Down Expand Up @@ -120,8 +121,31 @@ func (cw *CertWatcher) Start(ctx context.Context) error {
return cw.watcher.Close()
}

func (cw *CertWatcher) ensureAllFilesAreWatched() {
watchList := sets.New(cw.watcher.WatchList()...)
difference := sets.New(cw.certPath, cw.keyPath).Difference(watchList)
if difference.Len() == 0 {
return
}

for _, missingWatchPath := range difference.UnsortedList() {
log.V(1).Info("re-adding missing watch", "path", missingWatchPath)
if err := cw.watcher.Add(missingWatchPath); err != nil {
log.Error(err, "failed to add watch", "path", missingWatchPath)
return
}
}

log.V(1).Info("all files are watched again", "list", cw.watcher.WatchList())

if err := cw.ReadCertificate(); err != nil {
log.Error(err, "error re-reading certificate")
}
}

// Watch reads events from the watcher's channel and reacts to changes.
func (cw *CertWatcher) Watch() {
watcherHealthTimer := time.NewTicker(time.Second)
for {
select {
case event, ok := <-cw.watcher.Events:
Expand All @@ -139,6 +163,8 @@ func (cw *CertWatcher) Watch() {
}

log.Error(err, "certificate watch error")
case <-watcherHealthTimer.C:
cw.ensureAllFilesAreWatched()
Copy link
Member

@alvaroaleman alvaroaleman Nov 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Permanently polling once per second seems pretty aggressive. Can we not watch the dir intead if the file gets removed and re-add the filewatch once it exists again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sidenote: wondering at this point if we should generally avoid relying of fs events, and rather say poll every N seconds and reload the certificates if the hash changed or something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also make the polling configurable, and default it to something reasonable where both old (loaded) and new certificates are supposed to be valid

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not watch the dir intead if the file gets removed and re-add the filewatch once it exists again?

It does not watch the dir, it just tries to reinstall the notify watch that fails if the file is not present.
So technically, it is the same as "waiting until the file exists again" as the first step of addWatch is os.Lstat(name).

The other option is to watch the directory events instead of file events and parse all events for the file names but it:

  1. Increases the number of events to parse and handle
  2. Does not solve the case when the directory would be removed and recreated again (or have mount issues, whatever).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sidenote: wondering at this point if we should generally avoid relying of fs events, and rather say poll every N seconds and reload the certificates if the hash changed or something?

It makes sense, and in some of our internal projects, we rely on this behaviour instead. The downside is that you need to find a balance between polling interval, certificate expiration time, and number of file reads in this case; so ideally this param should also be configurable across all usages of certwatcher (webhook server and metrics server at least)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the main point is with the aggressive 1 second polling. I like vinces idea of just re-reading the file every 10 seconds and caching in between, that is also what client-go does for the SA token

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestions.
@alvaroaleman @vincepri I updated the PR with full removal of fsnotify from certwatcher and having a simple read/cache/repeat ticker loop. Please check and leave your opinions.

}
}
}
Expand Down
32 changes: 30 additions & 2 deletions pkg/certwatcher/certwatcher_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ import (
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/prometheus/client_golang/prometheus/testutil"

"sigs.k8s.io/controller-runtime/pkg/certwatcher"
"sigs.k8s.io/controller-runtime/pkg/certwatcher/metrics"
)
Expand Down Expand Up @@ -113,7 +114,7 @@ var _ = Describe("CertWatcher", func() {
Eventually(func() bool {
secondcert, _ := watcher.GetCertificate(nil)
first := firstcert.PrivateKey.(*rsa.PrivateKey)
return first.Equal(secondcert.PrivateKey)
return first.Equal(secondcert.PrivateKey) || firstcert.Leaf.SerialNumber == secondcert.Leaf.SerialNumber
}).ShouldNot(BeTrue())

ctxCancel()
Expand Down Expand Up @@ -143,14 +144,41 @@ var _ = Describe("CertWatcher", func() {
Eventually(func() bool {
secondcert, _ := watcher.GetCertificate(nil)
first := firstcert.PrivateKey.(*rsa.PrivateKey)
return first.Equal(secondcert.PrivateKey)
return first.Equal(secondcert.PrivateKey) || firstcert.Leaf.SerialNumber == secondcert.Leaf.SerialNumber
}).ShouldNot(BeTrue())

ctxCancel()
Eventually(doneCh, "4s").Should(BeClosed())
Expect(called.Load()).To(BeNumerically(">=", 1))
})

It("should reload currentCert after move out", func() {
doneCh := startWatcher()
called := atomic.Int64{}
watcher.RegisterCallback(func(crt tls.Certificate) {
called.Add(1)
Expect(crt.Certificate).ToNot(BeEmpty())
})

firstcert, _ := watcher.GetCertificate(nil)

Expect(os.Rename(certPath, certPath+".old")).To(Succeed())
Expect(os.Rename(keyPath, keyPath+".old")).To(Succeed())

err := writeCerts(certPath, keyPath, "192.168.0.3")
Expect(err).ToNot(HaveOccurred())

Eventually(func() bool {
secondcert, _ := watcher.GetCertificate(nil)
first := firstcert.PrivateKey.(*rsa.PrivateKey)
return first.Equal(secondcert.PrivateKey) || firstcert.Leaf.SerialNumber == secondcert.Leaf.SerialNumber
}, "10s", "1s").ShouldNot(BeTrue())

ctxCancel()
Eventually(doneCh, "4s").Should(BeClosed())
Expect(called.Load()).To(BeNumerically(">=", 1))
})

Context("prometheus metric read_certificate_total", func() {
var readCertificateTotalBefore float64
var readCertificateErrorsBefore float64
Expand Down