Consider this example. A switch to which 48 hosts are connected fails. What is the status of the hosts connected to the switch? Are they also failed or simply unreachable? From a troubleshooting standpoint, the problem is the switch. A notification system that floods 49 (or more) error notifications when a switch fails is only flooding the support staff with misleading information from which identifying the real cause of the problem -- a failed switch -- is hidden in a large number of relevant, but misleading, notifications.
Host and Service Check Logic
Notifications are very important. However, as described above, accurate notifications upon which fast and informed corrective actions may be implemented are vital. Nagios' / Icinga's notification accuracy begins with the host and service check logic.Parent - Child Relationships
Parent - child relationships are specified in host definitions. They are derived from the physical relationship between monitored nodes. In the demonstration network, a monitoring server in Pittsburgh checks a host in Harrisburg; there are two monitored routers between them. Thus, we can define a top-level parent -- the Pittsburgh Monitoring Server -- and a series of child hosts between, from the Pittsburgh Router to the Harrisburg Router to the Harrisburg Backup Server. The parent definitions are:define host{ host_name pit-monitor.pittsburg.mydomain.com ; <-- The monitoring host } define host{ host_name pit-router.pittsburg.mydomain.com ;<-- The Pittsburgh gateway router parents pit-monitor.pittsburg.mydomain.com }
define host{ host_name hbg-router.harrisburg.mydomain.com ;<-- The Harrisburg router parents pit-router.pittsburg.mydomain.com }
define host{ host_name hbg-backup.harrisburg.mydomain.com ;<-- A Harrisburg host parents hbg-router.harrisburg.mydomain.com }
Host Up - Down - Unreachable Status
A host that responds to a monitoring server check is considered in an UP state. Without using host check parent - child logic, any host that does not respond would be considered in a DOWN state. Nagios / Icinga have a more sophisticated view of the network. If a host is DOWN, it's child hosts (and their children) will initially report DOWN, but the check logic determines that the child hosts are UNREACHABLE. For instance, if 48 hosts are connected to a switch, the 48 hosts have a parent definition for the switch. If the switch fails, it reports DOWN and the 48 connected hosts report UNREACHABLE.Host and Service Error States
The parameter max_check_attempts is defined for each host and service (often in templates). This parameter defines the difference between Hard and Soft host and service states. When a host or service check results in a non-UP or non-OK result, it is initially considered in a Soft state. Once the number of non-UP or non-OK check results equals the value of max_check_attempts, it is then considered in a Hard state. Hard error states are also applied to transitions from any hard error state to another.
The distinction between Soft and Hard states is important for Nagios / Icinga logic. Event Handlers -- actions that attempt to, for instance, correct an error -- execute when ahost or service is in a Soft state. Notifications do not occur until hosts and services enter a Hard state.
Delaying the notifications until Hard states occur is also important because it allows Nagios / Icinga time to determin if monitored hosts and, thus, their services are DOWN or UNREACHABLE.
Notifications occur when hosts and services enter a Hard state. Additionally, if a host is in a DOWN or UNREACHABLE state, there are no notifications for its services. That is, if the host is DOWN, the failure of its services are implied and no notifications are necessary.The distinction between Soft and Hard states is important for Nagios / Icinga logic. Event Handlers -- actions that attempt to, for instance, correct an error -- execute when ahost or service is in a Soft state. Notifications do not occur until hosts and services enter a Hard state.
Delaying the notifications until Hard states occur is also important because it allows Nagios / Icinga time to determin if monitored hosts and, thus, their services are DOWN or UNREACHABLE.
Defining Notifications
Who is notified
Notification destinations are defined as contact and Contact and Contact Group definitions. Each contact includes a contact name and alias, definitions of the commands by which notifications are sent (e.g. e-mail and pager) and the addresses to which the notifications are sent. Contact Groups simply define multiple logically-grouped Contacts. Thus, you may define a group of server administrators who are notified when servers fail and infrastructure administrator who are notified when switches and routers fail.You may also define timeperiods for which notifications are sent. Thus, notifications for critical equipment may sent 24 x 7 x 365, while notifications for development servers are only sent 9-5 Monday through Friday.
Notification Methods
Nagios / Icinga are capable sending notifications over multiple media types. If one media type is not available (e.g. IP for e-mail notifications), redundant media (such as POTS) are available. The media include:- Pager
- Phone (SMS)
- WinPopup message
- Yahoo, ICQ, or MSN instant message
- Audio alerts
The multiple media provide additional robustness for delivering notifications during service outages.
Nagios / Icinga Notification Demonstration
The video below demonstrates two cases:- A service failure on a single host at the end of a parent-child chain, and
- The failure of all WAN links on a router that isolates hosts in a data center from the monitoring server.
The monitoring server in is Pittsburgh, at the lower right of the map. The service failure (1) occurs in Philadelphia and a single notification each for its failure and recovery is dispatched by e-mail. The router WAN link failure (2) occurs on the Harrisburg router in the bottom center. That failure isolates the backup server in the Harrisburg data center from the Pittsburgh monitoring server. However, the host and service check logic isolates the failure as a Host DOWN state for the router and Host UNAVAILABLE state for the data center. A single notification -- Harrisburg Router DOWN -- is dispatched by Nagios / Icinga because the notifications are configured to ignore UNAVAILABLE states. Thus, a single root-cause notification is sent to the proper administrator instead of dozens of notifications for host and services in the data center.
No comments :
Post a Comment