"startPattern" is fragile and wrong on newer kernels #48

euank · 2016-12-08T23:17:08Z

Broken out from here

Currently, the config has a default of "startPattern": "Initializing cgroup subsys cpuset",

This pattern is meant to detect a node's boot process. Prior to the 4.5 kernel, this message was typically printed during boot of a node. After 4.5 however, due to this change, it is quite unlikely for that message to appear.

Furthermore, there's rarely a reason to detect whether a message is for the current boot in such a fragile way.

With the kern.log reader, every message is for the current boot because kern.log is usually handled where each kern.log file corresponds to one boot (e.g. kern.log is this boot, kern.log.1 is the boot before, kern.log.2.gz the one before, etc). (EDIT: I'm wrong about this for gci at least)

With journald, the boot id is annotated in messages, and so it can accurately be correlated with the current boot id (see the "_BOOT_ID" record in journald messages).

With a kmsg reader, all messages will only be the current boot because kmsg is not persistent.

In none of those cases is startPattern useful. Each kernel log parsing plugin should be responsible for doing the right thing itself I think.

The text was updated successfully, but these errors were encountered:

Random-Liu · 2016-12-08T23:25:42Z

With the kern.log reader, every message is for the current boot because kern.log is usually handled where each kern.log file corresponds to one boot (e.g. kern.log is this boot, kern.log.1 is the boot before, kern.log.2.gz the one before, etc).

I'm not sure with this. At least, I usually see kern.log contains multiple reboots. :)

In Kubernetes cluster, we use logrotate to rotate all logs under /var/log.
https://github.com/kubernetes/kubernetes/blob/8fd414537b5143ab039cb910590237cabf4af783/cluster/saltbase/salt/logrotate/conf

Is it smart enough to ignore kern.log?

Anyhow, if each reboot, kern.log will be rotated once, then it should work. Let me check it. :)

euank · 2016-12-08T23:29:31Z

Even if it's been copy-trunc rotated, it still only applies to the current boot, so it doesn't matter.

But that salt template is not applied to kern.log, only this list of files: https://github.com/kubernetes/kubernetes/blob/8fd414537b5143ab039cb910590237cabf4af783/cluster/saltbase/salt/logrotate/init.sls#L5

GCI however doesn't use that salt stuff and has its own logrotate chunk which does catch kern.log here: https://github.com/kubernetes/kubernetes/blob/dfe801de1021a005a7c43742c9357ef05dac0f0d/cluster/gce/gci/configure-helper.sh#L113-L136

I don't see why that changes anything though so long as it's true that rsyslog is also doing the typical rotation for kern.log.

Random-Liu · 2016-12-08T23:39:39Z

But that salt template is not applied to kern.log, only this list of files.

Yeah, that's right. I missed that.

With the kern.log reader, every message is for the current boot because kern.log is usually handled where each kern.log file corresponds to one boot (e.g. kern.log is this boot, kern.log.1 is the boot before, kern.log.2.gz the one before, etc).

I'm still not sure with this. I just rebooted one of my VMs, here is the kernel log: https://gist.github.com/Random-Liu/e77348945d4482c5f3fda034a3da7f90

euank · 2016-12-08T23:44:43Z

Ah, well, that's unfortunate. I thought that was the default config on ubuntu and debian, but it looks like I'm wrong.

That being said, ratherthan using a startPattern, the logic of detecting that break should still be internal to the syslog parser IMO.

Perhaps the better heuristic here is to notice the "time since boot" at the beginning going backwards.

Random-Liu · 2016-12-08T23:45:20Z

When the NPD is introduced, we are still running on ContainerVM. At that time, we only have kern.log, and it's hard to a identify a reboot without defining a StartPattern.

Initially I was using the kernel timestamp [ 0.000000], but changed to the "start pattern" later.

Let me see whether I could find some relation between boot id and kern.log. I think using boot id is the best solution, and journald is so convenient. Haha

* Remove `unregister_netdevice` rule to fix kubernetes#47. * Change `KernelPanic` to `KernelOops` because we can't handle kernel panic currently. * Use system boot time instead of "StartPattern" to fix kubernetes#48.

* Change `unregister_netdevice` to be an event to fix kubernetes#47. * Change `KernelPanic` to `KernelOops` because we can't handle kernel panic currently. * Use system boot time instead of "StartPattern" to fix kubernetes#48.

euank mentioned this issue Dec 8, 2016

Journald support #39

Merged

Random-Liu added the bug label Dec 15, 2016

Random-Liu mentioned this issue Jan 21, 2017

Fix kernel monitor issues #81

Merged

dchen1107 closed this as completed in #81 Feb 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

"startPattern" is fragile and wrong on newer kernels #48

"startPattern" is fragile and wrong on newer kernels #48

euank commented Dec 8, 2016 •

edited

Loading

Random-Liu commented Dec 8, 2016 •

edited

Loading

Uh oh!

euank commented Dec 8, 2016

Uh oh!

Random-Liu commented Dec 8, 2016

Uh oh!

euank commented Dec 8, 2016

Uh oh!

Random-Liu commented Dec 8, 2016 •

edited

Loading

Uh oh!

"startPattern" is fragile and wrong on newer kernels #48

"startPattern" is fragile and wrong on newer kernels #48

Comments

euank commented Dec 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Random-Liu commented Dec 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

euank commented Dec 8, 2016

Uh oh!

Random-Liu commented Dec 8, 2016

Uh oh!

euank commented Dec 8, 2016

Uh oh!

Random-Liu commented Dec 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

euank commented Dec 8, 2016 •

edited

Loading

Random-Liu commented Dec 8, 2016 •

edited

Loading

Random-Liu commented Dec 8, 2016 •

edited

Loading