Skip to content

Reason change once and it stay in wrong state with custom plugins #202

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aga20 opened this issue Sep 5, 2018 · 17 comments · Fixed by #320
Closed

Reason change once and it stay in wrong state with custom plugins #202

aga20 opened this issue Sep 5, 2018 · 17 comments · Fixed by #320

Comments

@aga20
Copy link

aga20 commented Sep 5, 2018

Hello,

I found something again I think but maybe I follwed wrong way.
I'm using a custom config based on the ntp example:

{
    "plugin": "custom",
    "pluginConfig": {
        "invoke_interval": "30s",
        "timeout": "5s",
        "max_output_length": 80,
        "concurrency": 3
    },
    "source": "ntp-custom-plugin-monitor",
    "conditions": [
        {
            "type": "CustomProblem",
            "reason": "CustomIsUp",
            "message": "Status of the custom service"
        }
    ],
    "rules": [
        {
            "type": "permanent",
            "condition": "CustomProblem",
            "reason": "CustomIsDown",
            "path": "/usr/bin/custom.sh",
            "timeout": "3s"
        }
    ]
}

The /usr/bin/custom.sh script is very simple: exit with 0 or 1.

So when the node problem detector start it set the condition:

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  CustomProblem    False   Wed, 05 Sep 2018 14:40:26 +0200   Wed, 05 Sep 2018 14:40:25 +0200   CustomIsUp                   Status of the custom service
  OutOfDisk        False   Wed, 05 Sep 2018 14:40:23 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Wed, 05 Sep 2018 14:40:23 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 05 Sep 2018 14:40:23 +0200   Wed, 05 Sep 2018 13:17:06 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 05 Sep 2018 14:40:23 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 05 Sep 2018 14:40:23 +0200   Thu, 30 Aug 2018 17:35:04 +0200   KubeletReady                 kubelet is posting ready status

After it run the script (what returned with 0 in this case) the Status stay false but the Reason field changed to what I set in the rule section:

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  CustomProblem    False   Wed, 05 Sep 2018 14:41:56 +0200   Wed, 05 Sep 2018 14:40:55 +0200   CustomIsDown                 Status of the custom service
  OutOfDisk        False   Wed, 05 Sep 2018 14:42:23 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Wed, 05 Sep 2018 14:42:23 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 05 Sep 2018 14:42:23 +0200   Wed, 05 Sep 2018 13:17:06 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 05 Sep 2018 14:42:23 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 05 Sep 2018 14:42:23 +0200   Thu, 30 Aug 2018 17:35:04 +0200   KubeletReady                 kubelet is posting ready status

So ok, in the next run the script exited with 1. The Status is True, and the Reason still same (this is what I set under the rule):

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  CustomProblem    True    Wed, 05 Sep 2018 14:43:56 +0200   Wed, 05 Sep 2018 14:43:55 +0200   CustomIsDown                 Status of the custom service
  OutOfDisk        False   Wed, 05 Sep 2018 14:43:53 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Wed, 05 Sep 2018 14:43:53 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 05 Sep 2018 14:43:53 +0200   Wed, 05 Sep 2018 13:17:06 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 05 Sep 2018 14:43:53 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 05 Sep 2018 14:43:53 +0200   Thu, 30 Aug 2018 17:35:04 +0200   KubeletReady                 kubelet is posting ready status

In the next round the script returned with 0 again and Status changed back to false but the Reason didn't change:

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  CustomProblem    False   Wed, 05 Sep 2018 14:44:26 +0200   Wed, 05 Sep 2018 14:44:25 +0200   CustomIsDown                 Status of the custom service
  OutOfDisk        False   Wed, 05 Sep 2018 14:44:23 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Wed, 05 Sep 2018 14:44:23 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 05 Sep 2018 14:44:23 +0200   Wed, 05 Sep 2018 13:17:06 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 05 Sep 2018 14:44:23 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 05 Sep 2018 14:44:23 +0200   Thu, 30 Aug 2018 17:35:04 +0200   KubeletReady                 kubelet is posting ready status

As I see you overwrite the condition's rule and maybe the original condition lost and the node problem detector never can't set it again. https://github.com/kubernetes/node-problem-detector/blob/master/pkg/custompluginmonitor/custom_plugin_monitor.go#L140
But maybe I missed something.

Thank you!

@aga20 aga20 changed the title Reason changed once and it stay in wrong state with custom plugins Reason change once and it stay in wrong state with custom plugins Sep 5, 2018
@aga20
Copy link
Author

aga20 commented Sep 6, 2018

@andyxning could you check this too? Thanks

@andyxning
Copy link
Member

Thanks for the detailed re-produce procedures. Seems some logic is wrong in updating the condition. Will check this later today.

@andyxning
Copy link
Member

Fired a PR #203 to fix this. @aga20 PTAL.

@aga20
Copy link
Author

aga20 commented Sep 17, 2018

Thank you! @andyxning maybe you know when you can merge back this PR? In addition whe you plan to release a new version?
Thanks for your help!

@andyxning
Copy link
Member

@aga20 Too busy these days. Will do this later this week. After the PR is merged, i will ping @Random-Liu to do a new release as we have some important bug fixes after the latest release.

@aga20
Copy link
Author

aga20 commented Oct 1, 2018

@andyxning
Just FYI: I tried your custom branch in #203 but the message was strange: First I see script output (the condition was false and the output of my script is: Every partition is ok!). I filled the partition what my custom plugin check and the status was true but the message wasn't the script output just the condition's default message (All partition is ok). After the next run of the custom plugin the message updated with the script output.
And sorry but could you say something when my two issue will be fixed? In our next release we want to use the npd with custom plugins if it is possible.
Thank you!

@andyxning
Copy link
Member

@aga20 Will take a look at this later.

@byxorna
Copy link

byxorna commented Oct 25, 2018

Ive been running in to the same issue. Any progress with this? This makes integrating NPD into external tools more difficult

@andyxning
Copy link
Member

I will update this later today. Sorry for the late.

@andyxning
Copy link
Member

@aga20 Sorry for taking so long to update this. Too busy.

I have update the PR in #203 to address the comments @Random-Liu made. Could you please give it a try and add a feedback here?

@aga20
Copy link
Author

aga20 commented Nov 19, 2018

Hello,
@andyxning Sorry busy days here. It seems ok for me. Thank you!

When you plann to merge this back and release a new version?

@MaksymTrykur
Copy link

@andyxning seems like there is still an issue when status transitions from False -> Unknown and back.

When status = False shouldn't reason always be defaultConditionReason, meaning normal state?

Current implementation will set faulty reason to the condition with OK result, see change 3 and change 4.

https://github.com/kubernetes/node-problem-detector/blob/v0.6.2/pkg/custompluginmonitor/custom_plugin_monitor.go#L129

if condition.Status == types.True && status != types.True {
  // change 1: Condition status change from True to False/Unknown

  condition.Reason = defaultConditionReason
} else if condition.Status != types.True && status == types.True {
  // change 2: Condition status change from False/Unknown to True

  condition.Reason = result.Rule.Reason
} else if condition.Status != status {
  // change 3: Condition status change from False to Unknown or vice versa

  condition.Reason = result.Rule.Reason
} else if condition.Status == status &&
  (condition.Reason != result.Rule.Reason ||
    (*c.config.PluginGlobalConfig.EnableMessageChangeBasedConditionUpdate && condition.Message != result.Message)) {
  // change 4: Condition status do not change.
  // condition reason changes or
  // condition message changes when message based condition update is enabled.

  condition.Reason = result.Rule.Reason
}

Status is False, message is ok, there is no problem at this moment. But reason shows NTPIsDown.

    - lastHeartbeatTime: "2019-04-04T14:31:29Z"
      lastTransitionTime: "2019-04-04T12:59:58Z"
      message: 'OK: time is synchronized'
      reason: NTPIsDown
      status: "False"
      type: NTPProblem

@AlexShemeshWix
Copy link

have same issue on 0.6.3
Hi ive defined custom plugin like that

{
  "plugin": "custom",
  "pluginConfig": {
    "invoke_interval": "30s",
    "timeout": "5s",
    "max_output_length": 80,
    "concurrency": 3,
    "enable_message_change_based_condition_update": true
  },
  "source": "file-exists",
  "conditions": [
    {
      "type": "CustomProblem",
      "reason": "NoFailures",
      "message": "No failures found"
    }
  ],
  "rules": [
    {
      "type": "temporary",
      "reason": "CustomProblem",
      "path": "/srv/scripts/test_2.sh"
    },
    {
      "type": "permanent",
      "condition": "CustomProblem",
      "reason": "FailureDetected",
      "path": "/srv/scripts/test_2.sh"
    }
  ]
}

test_2.sh looks like that

#!/bin/bash
FILE=/tmp/kube.fail

if [ -f "$FILE" ]; then
    echo "Warning. $FILE exist"
    exit 1
else
    echo "Failure file not found"
    exit 0
fi

The idea is to touch /tmp/kube.fail to test condition monitoring.
Its sorta works but looks wrong.
This is how my node conditions look like

DockerProblem        False     Mon, 10 Jun 2019 16:00:52 +0300   Mon, 10 Jun 2019 14:49:08 +0300   DockerUp                     Docker service is ok
  CustomProblem        False     Wed, 12 Jun 2019 17:15:17 +0300   Wed, 12 Jun 2019 17:12:14 +0300   FailureDetected              Failure file not found
  NTPProblem           Unknown   Wed, 12 Jun 2019 17:15:17 +0300   Wed, 12 Jun 2019 17:12:14 +0300   NTPIsDown                    Systemd is not supported
  KernelDeadlock       False     Wed, 12 Jun 2019 17:15:17 +0300   Wed, 12 Jun 2019 17:11:44 +0300   KernelHasNoDeadlock          kernel has no deadlock

Custom problem has proper Status and message says file was not found but reason is FailureDetected.

@wangzhen127
Copy link
Member

@andyxning and @Random-Liu, can you take a look?

What @MaksymTrykur described in #202 (comment) seems correct.

What I don't understand is that for @AlexShemeshWix's example, it should never enter change 3 and change 4, but he got something unexpected.

@wangzhen127
Copy link
Member

I found the bug. Will put up a fix PR shortly.

@wangzhen127
Copy link
Member

The fix is now included in NPD v0.6.6. @AlexShemeshWix and @MaksymTrykur, can you double check if your problem is fixed?

@MaksymTrykur
Copy link

Fixed for me in v0.6.6:

Normal NodeIsHeatlhy 9m59s (x8 over 26m) general-monitor, ua-core01 Node condition NodeProblem is now: False, reason: NodeIsHeatlhy

vs before

Normal NodeIsUnhealthy 15m (x57 over 23h) general-monitor, ua-core01 Node condition NodeProblem is now: False, reason: NodeIsUnhealthy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants