Skip to content

Fix kernel monitor issues #81

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 10, 2017

Conversation

Random-Liu
Copy link
Member

@Random-Liu Random-Liu commented Jan 21, 2017

Based on #39. Will rebase after #39 gets merged.

Only the last commit is new.

This PR:

/cc @dchen1107 @jfilak


This change is Reviewable

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 21, 2017
@Random-Liu Random-Liu added the bug label Jan 21, 2017
@Random-Liu Random-Liu added this to the Kubernetes v1.6 milestone Jan 21, 2017
@dchen1107
Copy link
Member

@Random-Liu #39 was merged. Can you rebase this one so that I can review it?

@Random-Liu Random-Liu force-pushed the fix-kernel-monitor-issues branch from 7a55ab7 to 4885f07 Compare February 1, 2017 21:02
@Random-Liu
Copy link
Member Author

@dchen1107 Rebased.

@dalehamel
Copy link

👍 PTAL, we just had a few heart attacks after seeing this show up in our node descriptions, thinking there was some sort of serious problem.

Curiously, it seems like it occurred when our nodes ran out of space.

Any nodes that did run out of inodes (using overlay2) showed this issue ones that didn't, didn't.

@Random-Liu
Copy link
Member Author

Random-Liu commented Feb 10, 2017

👍 PTAL, we just had a few heart attacks after seeing this show up in our node descriptions, thinking there was some sort of serious problem.

Initially this is added for a serious kernel issue moby/moby#5618, coreos/bugs#254.

However, it seems that for newer kernel, this doesn't always seem to be a serious issue. This PR changes the problem to an event. After this PR, if you see tons of events from a node, you may want to login and take a look. :)

If docker really hungs, the hung task detection will detect it and set the node condition https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json#L45.

Curiously, it seems like it occurred when our nodes ran out of space.

Hm, that's an interesting observation. In fact, I don't really know what happened behind the scene for the "unregister_netdevice" bug. :(

According to the discussion in coreos/bugs#254, the upstream kernel fix may be relevant http://www.spinics.net/lists/netdev/msg351337.html

@Random-Liu Random-Liu force-pushed the fix-kernel-monitor-issues branch from 4885f07 to 1fda257 Compare February 10, 2017 00:08
* Change `unregister_netdevice` to be an event to fix kubernetes#47.
* Change `KernelPanic` to `KernelOops` because we can't handle kernel
panic currently.
* Use system boot time instead of "StartPattern" to fix kubernetes#48.
@Random-Liu Random-Liu force-pushed the fix-kernel-monitor-issues branch from 1fda257 to d281cb8 Compare February 10, 2017 00:09
@dchen1107
Copy link
Member

LGTM

@dchen1107 dchen1107 added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2017
@dchen1107 dchen1107 merged commit 5e56393 into kubernetes:master Feb 10, 2017
@Random-Liu Random-Liu deleted the fix-kernel-monitor-issues branch February 10, 2017 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"startPattern" is fragile and wrong on newer kernels "unregister_netdevice" isn't necessarily a KernelDeadlock
4 participants