Skip to content

"startPattern" is fragile and wrong on newer kernels #48

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
euank opened this issue Dec 8, 2016 · 5 comments · Fixed by #81
Closed

"startPattern" is fragile and wrong on newer kernels #48

euank opened this issue Dec 8, 2016 · 5 comments · Fixed by #81

Comments

@euank
Copy link
Contributor

euank commented Dec 8, 2016

Broken out from here

Currently, the config has a default of "startPattern": "Initializing cgroup subsys cpuset",

This pattern is meant to detect a node's boot process. Prior to the 4.5 kernel, this message was typically printed during boot of a node. After 4.5 however, due to this change, it is quite unlikely for that message to appear.

Furthermore, there's rarely a reason to detect whether a message is for the current boot in such a fragile way.

With the kern.log reader, every message is for the current boot because kern.log is usually handled where each kern.log file corresponds to one boot (e.g. kern.log is this boot, kern.log.1 is the boot before, kern.log.2.gz the one before, etc). (EDIT: I'm wrong about this for gci at least)

With journald, the boot id is annotated in messages, and so it can accurately be correlated with the current boot id (see the "_BOOT_ID" record in journald messages).

With a kmsg reader, all messages will only be the current boot because kmsg is not persistent.

In none of those cases is startPattern useful. Each kernel log parsing plugin should be responsible for doing the right thing itself I think.

@euank euank mentioned this issue Dec 8, 2016
@Random-Liu
Copy link
Member

Random-Liu commented Dec 8, 2016

With the kern.log reader, every message is for the current boot because kern.log is usually handled where each kern.log file corresponds to one boot (e.g. kern.log is this boot, kern.log.1 is the boot before, kern.log.2.gz the one before, etc).

I'm not sure with this. At least, I usually see kern.log contains multiple reboots. :)

In Kubernetes cluster, we use logrotate to rotate all logs under /var/log.
https://github.com/kubernetes/kubernetes/blob/8fd414537b5143ab039cb910590237cabf4af783/cluster/saltbase/salt/logrotate/conf

Is it smart enough to ignore kern.log?

Anyhow, if each reboot, kern.log will be rotated once, then it should work. Let me check it. :)

@euank
Copy link
Contributor Author

euank commented Dec 8, 2016

Even if it's been copy-trunc rotated, it still only applies to the current boot, so it doesn't matter.

But that salt template is not applied to kern.log, only this list of files: https://github.com/kubernetes/kubernetes/blob/8fd414537b5143ab039cb910590237cabf4af783/cluster/saltbase/salt/logrotate/init.sls#L5

GCI however doesn't use that salt stuff and has its own logrotate chunk which does catch kern.log here: https://github.com/kubernetes/kubernetes/blob/dfe801de1021a005a7c43742c9357ef05dac0f0d/cluster/gce/gci/configure-helper.sh#L113-L136

I don't see why that changes anything though so long as it's true that rsyslog is also doing the typical rotation for kern.log.

@Random-Liu
Copy link
Member

But that salt template is not applied to kern.log, only this list of files.

Yeah, that's right. I missed that.

With the kern.log reader, every message is for the current boot because kern.log is usually handled where each kern.log file corresponds to one boot (e.g. kern.log is this boot, kern.log.1 is the boot before, kern.log.2.gz the one before, etc).

I'm still not sure with this. I just rebooted one of my VMs, here is the kernel log: https://gist.github.com/Random-Liu/e77348945d4482c5f3fda034a3da7f90

@euank
Copy link
Contributor Author

euank commented Dec 8, 2016

Ah, well, that's unfortunate. I thought that was the default config on ubuntu and debian, but it looks like I'm wrong.

That being said, ratherthan using a startPattern, the logic of detecting that break should still be internal to the syslog parser IMO.

Perhaps the better heuristic here is to notice the "time since boot" at the beginning going backwards.

@Random-Liu
Copy link
Member

Random-Liu commented Dec 8, 2016

When the NPD is introduced, we are still running on ContainerVM. At that time, we only have kern.log, and it's hard to a identify a reboot without defining a StartPattern.

Initially I was using the kernel timestamp [ 0.000000], but changed to the "start pattern" later.

Let me see whether I could find some relation between boot id and kern.log. I think using boot id is the best solution, and journald is so convenient. Haha

@Random-Liu Random-Liu added the bug label Dec 15, 2016
Random-Liu added a commit to Random-Liu/node-problem-detector that referenced this issue Jan 21, 2017
* Remove `unregister_netdevice` rule to fix kubernetes#47.
* Change `KernelPanic` to `KernelOops` because we can't handle kernel
panic currently.
* Use system boot time instead of "StartPattern" to fix kubernetes#48.
Random-Liu added a commit to Random-Liu/node-problem-detector that referenced this issue Feb 1, 2017
* Remove `unregister_netdevice` rule to fix kubernetes#47.
* Change `KernelPanic` to `KernelOops` because we can't handle kernel
panic currently.
* Use system boot time instead of "StartPattern" to fix kubernetes#48.
Random-Liu added a commit to Random-Liu/node-problem-detector that referenced this issue Feb 7, 2017
* Remove `unregister_netdevice` rule to fix kubernetes#47.
* Change `KernelPanic` to `KernelOops` because we can't handle kernel
panic currently.
* Use system boot time instead of "StartPattern" to fix kubernetes#48.
Random-Liu added a commit to Random-Liu/node-problem-detector that referenced this issue Feb 8, 2017
* Remove `unregister_netdevice` rule to fix kubernetes#47.
* Change `KernelPanic` to `KernelOops` because we can't handle kernel
panic currently.
* Use system boot time instead of "StartPattern" to fix kubernetes#48.
Random-Liu added a commit to Random-Liu/node-problem-detector that referenced this issue Feb 10, 2017
* Remove `unregister_netdevice` rule to fix kubernetes#47.
* Change `KernelPanic` to `KernelOops` because we can't handle kernel
panic currently.
* Use system boot time instead of "StartPattern" to fix kubernetes#48.
Random-Liu added a commit to Random-Liu/node-problem-detector that referenced this issue Feb 10, 2017
* Change `unregister_netdevice` to be an event to fix kubernetes#47.
* Change `KernelPanic` to `KernelOops` because we can't handle kernel
panic currently.
* Use system boot time instead of "StartPattern" to fix kubernetes#48.
Random-Liu added a commit to Random-Liu/node-problem-detector that referenced this issue Feb 10, 2017
* Change `unregister_netdevice` to be an event to fix kubernetes#47.
* Change `KernelPanic` to `KernelOops` because we can't handle kernel
panic currently.
* Use system boot time instead of "StartPattern" to fix kubernetes#48.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants