Stephen Fritz on Systems Engineering: Nagios / Icinga Logic, Alerts and Notifications

Nagios / Icinga provide a variety of notification types and correspondingly different media over which they are transmitted. Thus, the notification system is robust because it uses IP, POTS and other defined media for transmission. The system is also intelligent because the built-in service check and state processing logic can differentiate not only failed vs. degraded states, but also states that are unreachable and whose status can not be reliably determined.

Consider this example. A switch to which 48 hosts are connected fails. What is the status of the hosts connected to the switch? Are they also failed or simply unreachable? From a troubleshooting standpoint, the problem is the switch. A notification system that floods 49 (or more) error notifications when a switch fails is only flooding the support staff with misleading information from which identifying the real cause of the problem -- a failed switch -- is hidden in a large number of relevant, but misleading, notifications.

Host and Service Check Logic

Notifications are very important. However, as described above, accurate notifications upon which fast and informed corrective actions may be implemented are vital. Nagios' / Icinga's notification accuracy begins with the host and service check logic.

Parent - Child Relationships

Parent - child relationships are specified in host definitions. They are derived from the physical relationship between monitored nodes. In the demonstration network, a monitoring server in Pittsburgh checks a host in Harrisburg; there are two monitored routers between them. Thus, we can define a top-level parent -- the Pittsburgh Monitoring Server -- and a series of child hosts between, from the Pittsburgh Router to the Harrisburg Router to the Harrisburg Backup Server. The parent definitions are:

define host{
 host_name  pit-monitor.pittsburg.mydomain.com   ; <-- The monitoring host
 }

define host{
 host_name  pit-router.pittsburg.mydomain.com ;<-- The Pittsburgh gateway router 
 parents          pit-monitor.pittsburg.mydomain.com
 }

define host{
 host_name  hbg-router.harrisburg.mydomain.com ;<-- The Harrisburg router 
 parents          pit-router.pittsburg.mydomain.com
 }

define host{
 host_name  hbg-backup.harrisburg.mydomain.com ;<-- A Harrisburg host
 parents          hbg-router.harrisburg.mydomain.com
 }

Host Up - Down - Unreachable Status

A host that responds to a monitoring server check is considered in an UP state. Without using host check parent - child logic, any host that does not respond would be considered in a DOWN state. Nagios / Icinga have a more sophisticated view of the network. If a host is DOWN, it's child hosts (and their children) will initially report DOWN, but the check logic determines that the child hosts are UNREACHABLE. For instance, if 48 hosts are connected to a switch, the 48 hosts have a parent definition for the switch. If the switch fails, it reports DOWN and the 48 connected hosts report UNREACHABLE.

Host and Service Error States

The parameter max_check_attempts is defined for each host and service (often in templates). This parameter defines the difference between Hard and Soft host and service states. When a host or service check results in a non-UP or non-OK result, it is initially considered in a Soft state. Once the number of non-UP or non-OK check results equals the value of max_check_attempts, it is then considered in a Hard state. Hard error states are also applied to transitions from any hard error state to another.

The distinction between Soft and Hard states is important for Nagios / Icinga logic. Event Handlers -- actions that attempt to, for instance, correct an error -- execute when ahost or service is in a Soft state. Notifications do not occur until hosts and services enter a Hard state.

Delaying the notifications until Hard states occur is also important because it allows Nagios / Icinga time to determin if monitored hosts and, thus, their services are DOWN or UNREACHABLE.

Defining Notifications

Notifications occur when hosts and services enter a Hard state. Additionally, if a host is in a DOWN or UNREACHABLE state, there are no notifications for its services. That is, if the host is DOWN, the failure of its services are implied and no notifications are necessary.

Who is notified

Notification destinations are defined as contact and Contact and Contact Group definitions. Each contact includes a contact name and alias, definitions of the commands by which notifications are sent (e.g. e-mail and pager) and the addresses to which the notifications are sent. Contact Groups simply define multiple logically-grouped Contacts. Thus, you may define a group of server administrators who are notified when servers fail and infrastructure administrator who are notified when switches and routers fail.

You may also define timeperiods for which notifications are sent. Thus, notifications for critical equipment may sent 24 x 7 x 365, while notifications for development servers are only sent 9-5 Monday through Friday.

Notification Methods

Nagios / Icinga are capable sending notifications over multiple media types. If one media type is not available (e.g. IP for e-mail notifications), redundant media (such as POTS) are available. The media include:

Email
Pager
Phone (SMS)
WinPopup message
Yahoo, ICQ, or MSN instant message
Audio alerts

The multiple media provide additional robustness for delivering notifications during service outages.

Nagios / Icinga Notification Demonstration

The video below demonstrates two cases:

A service failure on a single host at the end of a parent-child chain, and
The failure of all WAN links on a router that isolates hosts in a data center from the monitoring server.

The monitoring server in is Pittsburgh, at the lower right of the map. The service failure (1) occurs in Philadelphia and a single notification each for its failure and recovery is dispatched by e-mail. The router WAN link failure (2) occurs on the Harrisburg router in the bottom center. That failure isolates the backup server in the Harrisburg data center from the Pittsburgh monitoring server. However, the host and service check logic isolates the failure as a Host DOWN state for the router and Host UNAVAILABLE state for the data center. A single notification -- Harrisburg Router DOWN -- is dispatched by Nagios / Icinga because the notifications are configured to ignore UNAVAILABLE states. Thus, a single root-cause notification is sent to the proper administrator instead of dozens of notifications for host and services in the data center.

Stephen Fritz on Systems Engineering

Search This Blog

Labels

Wednesday, April 9, 2014

Nagios / Icinga Logic, Alerts and Notifications