Check cluster health via nagios plugin
From Notes_Wiki
Home > Suse > SAP setup and maintenance > Check cluster health via nagios plugin
We can monitor cluster health using nagios plugin using: Not tested in production
- Refer Configuring nrpe based internal service checks on how nrpe based internal checks work for remote systems.
- Create a plugin to be called via nrpe on the cluster host '/usr/lib64/nagios/plugins/cluster_check.sh' with:
#!/bin/bash # Run crm status command and capture output crm_output=$(crm status 2>&1) # Check for error or warning in output, ignoring case if [[ "$crm_output" =~ error || "$crm_output" =~ warning ]]; then # Send email alert with hostname and IP hostname=$(hostname) ip=$(hostname -I | awk '{print $1}') echo "Cluster status is not healthy on $hostname ($ip)!" exit 2 # Nagios exit code for critical fi # Check if all nodes are online if [[ "$crm_output" =~ Online:\ \[\ (.*)\ \] ]]; then online_nodes=${BASH_REMATCH[1]} if [[ "$online_nodes" =~ \[.*\] ]]; then # Send email alert with hostname and IP hostname=$(hostname) ip=$(hostname -I | awk '{print $1}') echo "Not all nodes are online on $hostname ($ip)!" exit 2 # Nagios exit code for critical fi fi # Check if all resources are started if [[ "$crm_output" =~ Full\ list\ of\ resources:\$'\n'\ (.*) ]]; then resources=${BASH_REMATCH[1]} if [[ "$resources" =~ \*\* ]]; then # Send email alert with hostname and IP hostname=$(hostname) ip=$(hostname -I | awk '{print $1}') echo "Not all resources are started on $hostname ($ip)!" exit 2 # Nagios exit code for critical fi fi echo "Cluster status is healthy!" exit 0 # Nagios exit code for OK
- Edit '/etc/nagios/nrpe.conf' to have below:
- command[check_cluster_status]=/usr/lib64/nagios/plugins/cluster_check.sh
- Restart nrpe on the cluster machine
- Then configure remote service check using above plugin for appropriate host using below nagios service configuration:
define host { use linux-server host_name example-host alias Example Host address 192.0.2.100 } define service { use generic-service host_name example-host service_description Check Cluster Status check_command check_nrpe!check_cluster_status check_interval 60 ; Check every 60 seconds retry_interval 10 ; Retry every 10 seconds if check fails notification_interval 120 ; Send a notification every 2 hours contact_groups admins }
- Restart nagios service on server
- Validate whether proper health of cluster status is being captured
- Optionally stop a resource and see whether latest status is reflected properly. Consider adding a virtual IP for testing in production systems. This virtual IP can be removed after testing.
Home > Suse > SAP setup and maintenance > Check cluster health via nagios plugin