{{ message }}
check_disk: no longer hangs on hanging filesystems#1186
Open
waja wants to merge 1 commit into
Open
Conversation
Just turning attached patch of github issue monitoring-plugins#867 into a push request.
441913d to
40c870e
Compare
Member
Author
Member
Author
|
reverted by 11c5796 |
Member
|
Our idea is to |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Just turning attached patch of github issue #867 into a push request.
"Hi,
i created a patch for check_disk (v2025/1.4.13) which can handle hanging nfs filesystems. Imagine you mounted a share from a NAS at the mountpoint /mnt. Now if the Storage device or whatever acts as NFS server dies or encounters a network problem, you will see messages like "NFS server nas.naprax.de not responding still trying" and every process accessing files inside the /mnt directory will be blocked, maybe forever. Depending on the mount options the hanging processes may even be invulnerable to a kill -9. This also applies to check_disk. If you have a service monitoring usage of /mnt with "check_disk ... -p /mnt", it will also be blocked. Nagios will report a timeout then. But the bad thing is, every minutes another check_disk will be started which also will hang then. Sooner or later your process list fills up with unkillable check_disks.
The critical piece of code inside check_disk is the stat system call, which is in the moment needed to find out, if a path exists at all. If that stat call hits a directory which is mounted from a dead nfs server, it will not return with an error code, but will not return at all.
I found out that although processes cannot be killed in such a situations, threads can. So i rewrote the stat_path subroutine in a way, where the critical stat is executed in it's own thread. If this thread does not terminate within the --timeout interval, it is considered to be blocked by a dead nfs filesystem and the thread will be detached.
I tested it on Linux 2.6.18 (gcc 4.1.2) and Solaris 10/x86 (gcc 3.4.3)"