-
Notifications
You must be signed in to change notification settings - Fork 655
Fix kernel monitor issues #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix kernel monitor issues #81
Conversation
@Random-Liu #39 was merged. Can you rebase this one so that I can review it? |
7a55ab7
to
4885f07
Compare
@dchen1107 Rebased. |
👍 PTAL, we just had a few heart attacks after seeing this show up in our node descriptions, thinking there was some sort of serious problem. Curiously, it seems like it occurred when our nodes ran out of space. Any nodes that did run out of inodes (using overlay2) showed this issue ones that didn't, didn't. |
Initially this is added for a serious kernel issue moby/moby#5618, coreos/bugs#254. However, it seems that for newer kernel, this doesn't always seem to be a serious issue. This PR changes the problem to an event. After this PR, if you see tons of events from a node, you may want to login and take a look. :) If docker really hungs, the hung task detection will detect it and set the node condition https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json#L45.
Hm, that's an interesting observation. In fact, I don't really know what happened behind the scene for the "unregister_netdevice" bug. :( According to the discussion in coreos/bugs#254, the upstream kernel fix may be relevant http://www.spinics.net/lists/netdev/msg351337.html |
4885f07
to
1fda257
Compare
* Change `unregister_netdevice` to be an event to fix kubernetes#47. * Change `KernelPanic` to `KernelOops` because we can't handle kernel panic currently. * Use system boot time instead of "StartPattern" to fix kubernetes#48.
1fda257
to
d281cb8
Compare
LGTM |
Based on #39. Will rebase after #39 gets merged.
Only the last commit is new.
This PR:
unregister_netdevice
to be an event to fix "unregister_netdevice" isn't necessarily a KernelDeadlock #47. /cc @euankKernelPanic
toKernelOops
because we can't really handle kernel panic by simply parsing kernel log.ramoops
, but GCE doesn't support it now.serial console
orPVPANIC
, but it could not be achieved inside the node scope. We may introduce a cluster level node health monitoring component in the future.StartPattern
to fix "startPattern" is fragile and wrong on newer kernels #48 /cc @euank . SinceKernelPanic
is not in the scope any more, we can only care about current boot now./cc @dchen1107 @jfilak
This change is