check_disk: no longer hangs on hanging filesystems by waja · Pull Request #1186 · monitoring-plugins/monitoring-plugins · GitHub
Skip to content

check_disk: no longer hangs on hanging filesystems#1186

Open
waja wants to merge 1 commit into
monitoring-plugins:masterfrom
waja:github867
Open

check_disk: no longer hangs on hanging filesystems#1186
waja wants to merge 1 commit into
monitoring-plugins:masterfrom
waja:github867

Conversation

@waja

@waja waja commented Oct 1, 2013

Copy link
Copy Markdown
Member

Just turning attached patch of github issue #867 into a push request.

"Hi,
i created a patch for check_disk (v2025/1.4.13) which can handle hanging nfs filesystems. Imagine you mounted a share from a NAS at the mountpoint /mnt. Now if the Storage device or whatever acts as NFS server dies or encounters a network problem, you will see messages like "NFS server nas.naprax.de not responding still trying" and every process accessing files inside the /mnt directory will be blocked, maybe forever. Depending on the mount options the hanging processes may even be invulnerable to a kill -9. This also applies to check_disk. If you have a service monitoring usage of /mnt with "check_disk ... -p /mnt", it will also be blocked. Nagios will report a timeout then. But the bad thing is, every minutes another check_disk will be started which also will hang then. Sooner or later your process list fills up with unkillable check_disks.
The critical piece of code inside check_disk is the stat system call, which is in the moment needed to find out, if a path exists at all. If that stat call hits a directory which is mounted from a dead nfs server, it will not return with an error code, but will not return at all.
I found out that although processes cannot be killed in such a situations, threads can. So i rewrote the stat_path subroutine in a way, where the critical stat is executed in it's own thread. If this thread does not terminate within the --timeout interval, it is considered to be blocked by a dead nfs filesystem and the thread will be detached.
I tested it on Linux 2.6.18 (gcc 4.1.2) and Solaris 10/x86 (gcc 3.4.3)"

Just turning attached patch of github issue monitoring-plugins#867 into a push request.
@waja waja modified the milestones: 2.2, 2.1 Oct 6, 2014
@waja waja force-pushed the master branch 2 times, most recently from 441913d to 40c870e Compare October 19, 2014 21:31
@waja waja added the squash label Nov 28, 2014
@weiss weiss self-assigned this Nov 28, 2014
@weiss weiss closed this in 14d306f Dec 2, 2014
@waja

waja commented Oct 12, 2015

Copy link
Copy Markdown
Member Author

@waja

waja commented Oct 12, 2015

Copy link
Copy Markdown
Member Author

reverted by 11c5796

@weiss

weiss commented Oct 13, 2015

Copy link
Copy Markdown
Member

Our idea is to fork(2) child processes for checking remote file systems, instead.

@weiss weiss reopened this Oct 13, 2015
@weiss weiss modified the milestones: 2.3, 2.2 Oct 13, 2015
@weiss weiss removed the squash label Oct 13, 2015
@waja waja closed this Nov 20, 2016
@waja waja deleted the github867 branch November 20, 2016 21:17
@waja waja restored the github867 branch November 20, 2016 21:31
@waja waja reopened this Nov 20, 2016
@thatsafunnyname

Copy link
Copy Markdown

@waja waja modified the milestones: 2.3, 2.4 Dec 15, 2020
@waja waja modified the milestones: 2.4, 2.5 Jul 23, 2024
@RincewindsHat RincewindsHat modified the milestones: 2.5, 3.1.0 Jan 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

4 participants