Skip to content

improvement health-checker #539

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 25, 2021
Merged

Conversation

smileusd
Copy link
Contributor

@smileusd smileusd commented Mar 10, 2021

/kind bug

  1. add loopbacktime to reduce time of journalctl call
    If use start time of kubelet as the journalctl, it could be take too much time to call.
    I add loopkbacktime to set loop back time for journalctl, if not set or set = 0, it uses start time directly, else it use loopkbacktime.

  2. fix timelayout for all timezones
    Different nodes usally use different time zone:
    [root@kubernetes-master-505-pkshx-4589367 software]# systemctl show kubelet --property=InactiveExitTimestamp InactiveExitTimestamp=Wed 2021-03-10 06:05:51 PST
    I think npd doesn't need to care about the timezone if can ensure timezone is agreement with runtime/kubelet and journald.


this point has been fixed by #558

  1. add npd health-checker deploy file
    Many users reflect that can not use health-checker because use the wrong specs: !439, !540, !482. I thinks this commit can fix these issues.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 10, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @smileusd. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 10, 2021
@smileusd smileusd changed the title add loopbacktime to reduce time of journalctl call improvement health-checker Mar 11, 2021
@smileusd
Copy link
Contributor Author

@Random-Liu PTAL

@Random-Liu
Copy link
Member

@abansal4032 Hey archit, can you take a look at this change?

@Random-Liu Random-Liu self-assigned this Mar 13, 2021
@Random-Liu
Copy link
Member

/assign @@abansal4032

@Random-Liu
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 13, 2021
@smileusd
Copy link
Contributor Author

/test pull-npd-e2e-test

@@ -25,6 +25,7 @@
"--component=kubelet",
"--enable-repair=true",
"--cooldown-time=1m",
"--loopback-time=5m",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let the default config use the default loopback time to keep the same behavior as it was before this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default loopback time is zero, so if not this set. It is the same behavior as it was before. Here is a example to ues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set to 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

logPatternHealthy, err := logPatternHealthCheck(hc.systemdService, hc.logPatternsToCheck)
uptime, err := hc.uptimeFunc()
if err != nil {
return false, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will return the service as unhealthy in case of parsing errors in uptime. Might lead to unnecessary restarts of the service. We can log the error here and return true as we are not sure if the service was unhealthy at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

name: machine-id
readOnly: true
- mountPath: /run/systemd/system
name: systemd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you able to run the deployment and get the health signals without errors using this configuration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this deployment has been ran in our productions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

@abansal4032
Copy link
Contributor

/lgtm

@k8s-ci-robot
Copy link
Contributor

@abansal4032: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -82,7 +82,8 @@ func getUptimeFunc(service string) func() (time.Duration, error) {
if len(val) < 2 {
return time.Duration(0), errors.New("could not parse the service uptime time correctly")
}
t, err := time.Parse(types.UptimeTimeLayout, val[1])
timeEndIndex := strings.LastIndex(val[1], " ")
t, err := time.Parse(types.UptimeTimeLayout, val[1][:timeEndIndex])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, if you run NPD in a container, it may not be in the same time zone with the host, in that case would this cause any problem?

What is the problem if we don't make this change? I don't quite get it after reading the PR description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resoult for the cmd out is like this:

InactiveExitTimestamp=Wed 2021-04-14 01:11:39 -07

But types.UptimeTimeLayout is UptimeTimeLayout = "Mon 2006-01-02 15:04:05"
It is failed to parse time

Copy link
Member

@Random-Liu Random-Liu Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UptimeTimeLayout was "Mon 2006-01-02 15:04:05 UTC" before your change.

And in your PR description, the systemctl output was:

[root@kubernetes-master-505-pkshx-4589367 software]# systemctl show kubelet --property=InactiveExitTimestamp
InactiveExitTimestamp=Wed 2021-03-10 06:05:51 PST

And on my node it is also:

$ systemctl show docker --property=InactiveExitTimestamp
InactiveExitTimestamp=Thu 2021-04-15 17:20:37 PDT

Why would the parsing fail?

Copy link
Contributor Author

@smileusd smileusd Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rollback to old version and retry on my node, it is still wrong and reture error:

I0425 22:54:26.604844   18356 plugin.go:87] Add check result {Rule:0xc4200bd9d0 ExitStatus:2 Message:systemctl show: Sun 2021-04-25 11:22:50 PDT
error checking kubelet health: pars}

Then I code a test case like this:


import (
    "fmt"
    "time"
)

const UptimeTimeLayout = "Mon 2006-01-02 15:04:05 UTC"

func main() {
    strs := []string{
        "Wed 2021-03-10 03:39:13 PST",
        "Sun 2021-04-25 11:22:50 PDT",
    }

    t, err := time.Parse(UptimeTimeLayout, strs[0])
    if err != nil {
        fmt.Println("error: ", err)
    }
    t, err = time.Parse(UptimeTimeLayout, strs[1])
    if err != nil {
        fmt.Println("error: ", err)
    }
    fmt.Println(t.String())
}

It is also wrong:

error:  parsing time "Wed 2021-03-10 03:39:13 PST" as "Mon 2006-01-02 15:04:05 UTC": cannot parse "PST" as " UTC"
error:  parsing time "Sun 2021-04-25 11:22:50 PDT" as "Mon 2006-01-02 15:04:05 UTC": cannot parse "PDT" as " UTC"
0001-01-01 00:00:00 +0000 UTC
Exiting.

I think Golang don't recognize UTC is a time zoom, but only a suffix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In that case, we probably need a better fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

package main
import (
    "fmt"
    "time"
)

const UptimeTimeLayout = "Mon 2006-01-02 15:04:05 MST"

func main() {
    strs := []string{
        "Wed 2021-03-10 03:39:13 PST",
        "Sun 2021-04-25 11:22:50 PDT",
    }

    t, err := time.Parse(UptimeTimeLayout, strs[0])
    if err != nil {
        fmt.Println("error: ", err)
    }
    fmt.Println(t.String())
    t, err = time.Parse(UptimeTimeLayout, strs[1])
    if err != nil {
        fmt.Println("error: ", err)
    }
    fmt.Println(t.String())
}

Can you change it to MST?

When parsing a time with a zone abbreviation like MST, if the zone abbreviation has a defined offset in the current location, then that offset is used.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 27, 2021
@Random-Liu
Copy link
Member

Hi @smileusd Can you rebase your change and fix the timestamp format?

We plan to cut a release this week. :)

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 19, 2021
@smileusd
Copy link
Contributor Author

Hi @smileusd Can you rebase your change and fix the timestamp format?

We plan to cut a release this week. :)

Sorry to late see this.
Done

@smileusd
Copy link
Contributor Author

/test pull-npd-e2e-test

@smileusd
Copy link
Contributor Author

@Random-Liu @abansal4032 PTAL

@smileusd smileusd requested a review from Random-Liu May 27, 2021 02:59
@smileusd
Copy link
Contributor Author

@Random-Liu Any updates?

Copy link
Member

@Random-Liu Random-Liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 25, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abansal4032, Random-Liu, smileusd

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 25, 2021
@k8s-ci-robot k8s-ci-robot merged commit e349323 into kubernetes:master Jun 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants