-
Notifications
You must be signed in to change notification settings - Fork 655
improvement health-checker #539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @smileusd. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@Random-Liu PTAL |
@abansal4032 Hey archit, can you take a look at this change? |
/assign @@abansal4032 |
/ok-to-test |
/test pull-npd-e2e-test |
config/health-checker-kubelet.json
Outdated
@@ -25,6 +25,7 @@ | |||
"--component=kubelet", | |||
"--enable-repair=true", | |||
"--cooldown-time=1m", | |||
"--loopback-time=5m", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe let the default config use the default loopback time to keep the same behavior as it was before this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default loopback time is zero, so if not this set. It is the same behavior as it was before. Here is a example to ues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set to 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
pkg/healthchecker/health_checker.go
Outdated
logPatternHealthy, err := logPatternHealthCheck(hc.systemdService, hc.logPatternsToCheck) | ||
uptime, err := hc.uptimeFunc() | ||
if err != nil { | ||
return false, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will return the service as unhealthy in case of parsing errors in uptime. Might lead to unnecessary restarts of the service. We can log the error here and return true as we are not sure if the service was unhealthy at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
name: machine-id | ||
readOnly: true | ||
- mountPath: /run/systemd/system | ||
name: systemd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were you able to run the deployment and get the health signals without errors using this configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this deployment has been ran in our productions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack.
/lgtm |
@abansal4032: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
pkg/healthchecker/health_checker.go
Outdated
@@ -82,7 +82,8 @@ func getUptimeFunc(service string) func() (time.Duration, error) { | |||
if len(val) < 2 { | |||
return time.Duration(0), errors.New("could not parse the service uptime time correctly") | |||
} | |||
t, err := time.Parse(types.UptimeTimeLayout, val[1]) | |||
timeEndIndex := strings.LastIndex(val[1], " ") | |||
t, err := time.Parse(types.UptimeTimeLayout, val[1][:timeEndIndex]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, if you run NPD in a container, it may not be in the same time zone with the host, in that case would this cause any problem?
What is the problem if we don't make this change? I don't quite get it after reading the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resoult for the cmd out is like this:
InactiveExitTimestamp=Wed 2021-04-14 01:11:39 -07
But types.UptimeTimeLayout is UptimeTimeLayout = "Mon 2006-01-02 15:04:05"
It is failed to parse time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UptimeTimeLayout
was "Mon 2006-01-02 15:04:05 UTC"
before your change.
And in your PR description, the systemctl output was:
[root@kubernetes-master-505-pkshx-4589367 software]# systemctl show kubelet --property=InactiveExitTimestamp
InactiveExitTimestamp=Wed 2021-03-10 06:05:51 PST
And on my node it is also:
$ systemctl show docker --property=InactiveExitTimestamp
InactiveExitTimestamp=Thu 2021-04-15 17:20:37 PDT
Why would the parsing fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rollback to old version and retry on my node, it is still wrong and reture error:
I0425 22:54:26.604844 18356 plugin.go:87] Add check result {Rule:0xc4200bd9d0 ExitStatus:2 Message:systemctl show: Sun 2021-04-25 11:22:50 PDT
error checking kubelet health: pars}
Then I code a test case like this:
import (
"fmt"
"time"
)
const UptimeTimeLayout = "Mon 2006-01-02 15:04:05 UTC"
func main() {
strs := []string{
"Wed 2021-03-10 03:39:13 PST",
"Sun 2021-04-25 11:22:50 PDT",
}
t, err := time.Parse(UptimeTimeLayout, strs[0])
if err != nil {
fmt.Println("error: ", err)
}
t, err = time.Parse(UptimeTimeLayout, strs[1])
if err != nil {
fmt.Println("error: ", err)
}
fmt.Println(t.String())
}
It is also wrong:
error: parsing time "Wed 2021-03-10 03:39:13 PST" as "Mon 2006-01-02 15:04:05 UTC": cannot parse "PST" as " UTC"
error: parsing time "Sun 2021-04-25 11:22:50 PDT" as "Mon 2006-01-02 15:04:05 UTC": cannot parse "PDT" as " UTC"
0001-01-01 00:00:00 +0000 UTC
Exiting.
I think Golang don't recognize UTC is a time zoom, but only a suffix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. In that case, we probably need a better fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
package main
import (
"fmt"
"time"
)
const UptimeTimeLayout = "Mon 2006-01-02 15:04:05 MST"
func main() {
strs := []string{
"Wed 2021-03-10 03:39:13 PST",
"Sun 2021-04-25 11:22:50 PDT",
}
t, err := time.Parse(UptimeTimeLayout, strs[0])
if err != nil {
fmt.Println("error: ", err)
}
fmt.Println(t.String())
t, err = time.Parse(UptimeTimeLayout, strs[1])
if err != nil {
fmt.Println("error: ", err)
}
fmt.Println(t.String())
}
Can you change it to MST?
When parsing a time with a zone abbreviation like MST, if the zone abbreviation has a defined offset in the current location, then that offset is used.
Hi @smileusd Can you rebase your change and fix the timestamp format? We plan to cut a release this week. :) |
Sorry to late see this. |
/test pull-npd-e2e-test |
@Random-Liu @abansal4032 PTAL |
@Random-Liu Any updates? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abansal4032, Random-Liu, smileusd The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/kind bug
add loopbacktime to reduce time of journalctl call
If use start time of kubelet as the journalctl, it could be take too much time to call.
I add loopkbacktime to set loop back time for journalctl, if not set or set = 0, it uses start time directly, else it use loopkbacktime.
fix timelayout for all timezones
Different nodes usally use different time zone:
[root@kubernetes-master-505-pkshx-4589367 software]# systemctl show kubelet --property=InactiveExitTimestamp InactiveExitTimestamp=Wed 2021-03-10 06:05:51 PST
I think npd doesn't need to care about the timezone if can ensure timezone is agreement with runtime/kubelet and journald.
this point has been fixed by #558
Many users reflect that can not use health-checker because use the wrong specs: !439, !540, !482. I thinks this commit can fix these issues.