Nagios/Icinga are enterprise class systems monitors that can track publicly-available services, agent-collected data and SNMP. Data is presented in web-based formats and --through additional packages -- graphs and user-configurable visualizations.
The Nagios project commenced in 1996 and was first publicly released -- as NetSaint -- in 1999. The project was renamed Nagios in 2002. By 2005, the project was receiving a great deal of attention from the Open Source Community and the developers formed Nagios LLC in 2007. By 2009, Nagios LLC began to release commercial products and provide support contracts.
Also in 2009, a group of developers forked the Nagios project to Icinga. The well-developed core project retains a great deal of compatibility with Nagios (particularly the configuration files), but offers independently-developed solutions and interfaces.
Although the projects are forks, this article will discuss the two projects from a common perspective.
Monitored Systems
Publicly
Available Services
Publicly
available services are those shared to the network, such as HTTP,
SMTP, FTP etc. Essentially, these are the TCP and UDP services reported using the netstat command.
Host
Resources
Host resources are not shared across the network and include items such as hard drive / memory performance and utilization, processor utilization, network interface card statistics, etc. These performance parameters may not be directly accessed and rely on other methods -- including agents and SNMP -- to collect data.
Agents
Agents are application-specific software that run on the monitored host. In the Nagios and Icinga systems, the most common are:
- Nagios Remote Plugin Executor (NRPE)
- Nagios Service Check Acceptor (NSCA)
Agents are polled by the monitoring server and return data formatted for the Nagios application. NRPE, as the name implies, performs much of the data processing on the monitored host and returns results that are compatible with the application storage format; this architecture relieves the monitoring server of some of the processing load and distributes it among the monitored hosts.
Simple Network Monitoring Protocol (SNMP)
SNMP is defined by RFC's and provides a publicly-available standard for querying monitored hosts and returning standardized data.
Configuration
Files
Nagios and Icinga are configured with text files. There are a few differences between the main configuration file and several of the add-on files (e.g. database configuration), but the object configuration and command files are compatible.
Main
Configuration File
The /etc/nagios3/nagios.cfg and /etc/icinga/icinga.cfg files define environment, scheduling, logging and performance parameters for the monitoring daemons.
Object
Configuration Files
Objects are defined in a series of files in the /etc/nagios3/conf.d and /etc/icinga/objects directories. Objects may be divided into the following categories:
- Services
- Service Groups
- Hosts
- Host Groups
- Contacts
- Contact Groups
- Commands
- Time Periods
- Notification Escalations
- Notification and Execution Dependencies
Plugins
Nagios and Icinga do not maintain internal processes to perform service monitoring. Instead, these actions are performed by plugins, external executable files and scripts (perl, shell, etc.) that perform the host queries. Nagios and Icinga distributions typically ship with a set of plugins, but many more may be added from external sources, such as the Nagios Exchange.
Host
Checks
Host checks only determine if a host is available or not. The host definitions include parent/child definitions that define host dependencies and reachability logic. Host checks are performed:
- At regular intervals in the host definition.
- On-demand when a host's service state changes.
- On-demand and controlled/triggered by host reachability logic.
- On-demand and controlled/triggered by host dependency logic.
Hosts are reported as in an UP, DOWN or UNREACHABLE state. UP and DOWN are self-explanatory. UNREACHABLE is a state in which access to a host is no available because an intermediary host is DOWN. Thus, if a server is behind a router and the router is reported DOWN, the server will be reported as UNREACHABLE. Each state may be either HARD or SOFT. The SOFT state is reported at a host's first state change. It is then rechecked a specified number of times, after which it is the reported in a HARD state. These states are used to control things such as notification, reducing the number of false alarms from the system.
Service
Checks
Service checks poll the state of specific applications, hardware, software, etc. on individual hosts. The service definitions include parent/child definitions that define service dependencies and reachability logic. Service checks are performed:
- At regular intervals, as defined by the service definition.
- On-demand and controlled/triggered by predictive service dependency checks.
Services are reported as in an OK, WARNING, UNKNOWN or CRITICAL state. OK means the service has responded as expected. WARNING means the service has reported back to the polling server, but provided information that indicates it is outside defined optimal performance. CRITICAL means the service has failed to respond or has responded with information that indicates it is outside defined acceptable performance. UNKNOWN is a bit unclear; for instance an SNMP check that receives no response may report UNKNOWN rather than CRITICAL. Services, too, also report in HARD and SOFT states.
Active
Checks
Active checks are configured and controlled by the Nagios and Icinga applications.
Passive
Checks
Passive checks are performed on the hosts outside the Nagios and Icinga applications. The monitoring process accepts and interprets these service checks as they occur. They are typically asynchronous. Examples include SNMP Traps and Security Alerts.
State
Types
State types refers to the reliability that a given host/service state is accurate.
- Soft -- the host has returned a changed service state, but it's reliability has not been verified by repeated rechecks.
- Hard -- the host's state is considered reliable because it has been repeatedly checked.
Time
Periods and Notifications
Time periods define when:
- Scheduled host and service checks are performed
- Notifications are sent
- Notification escalations may be used
- Dependencies are valid
Notifications are messages sent via e-mail, pager, SMS, IM and other media in reponse to specific, defined events.
Event
Handlers
Event handlers are external scripts and actions that are triggered by Nagios and Icinga events.
- Restarting a failed service
- Entering a trouble ticket into a help desk system
- Logging event information to a database
- Power Cycling a host
Distributed,
Redundant and Fail Over Monitoring
Distributed, redundant and fail over networking are complex topics; Nagios and Icinga
offer a great of software and implementations to achieve load distribution and redundancy.
- DNX
- Fusion
- MNTOS
- NDOutils, IDOutils
- Gearman
- Check_MK
The nature and implementations of these services are beyind the scope of this discussion and the reader is referred to project documentation for more detailed descriptions.
Predictive Monitoring
Predictive monitoring optimizes data collection using reachability logic, host and service dependencies. For instance, if a host is unreachable, there is no point expending resources monitoring its services. When the host is again reachable, obtaining updated information about its services is a priority. The following briefly illustrates and example.
In the first image, all hosts and services are available and monitored by a server in Pittsburgh, on the left of the image.
|
Operational Network -- Hosts and Services Available |
A system failure -- the Harrisburg Router shuts down -- and predictive monitoring directives override default host and service checks. The image below depicts the Harrisburg Router as DOWN (red) and the Harrisburg Backup Server (connected to the WAN by through the router) as UNREACHABLE (purple).
|
Harrisburg Router Down, Harrisburg Backup Unreachable |
A detailed status of the each office may then be reviewed. Note that the Harrisburg Router host is DOWN and its interfaces are CRITICAL (both red). The Harrisburg Backup host is unreachable (purple). However, the other services on these two hosts are a mix of CRITICAL (red) and OK (green). The monitoring server, upon determining that the router is DOWN ceases to check additional services. Those that report CRITICAL were polled before the router host was reported DOWN.
|
Harrisburg Data Center As Seen from Pittsburgh -- Router Down, Backup Unreachable, Several Services Down, Remaining Services not Polled |
The Pittsburgh Data Center is local to the monitoring server and the Philadelphia Data Center is reachable through a redundant link. Those data centers report the far side of the Harrisburg WAN links are down.
|
Pittsburgh Data Center -- Harrisburg WAN Link Down |
Performance
Tuning
Applications with the scope of Nagios and Icinga must, to be scalable, have performance tuning options. These generalizations briefly list what is available:
- Service Check Latency Monitoring to quickly identify generally poor performance.
- MRTG Performance Statistics to identify specific bottlenecks.
- Process tuning through configuration file options.
- Passive Checks that offload processing from the monitoring server to monitored hosts.
- Embedded Perl Interpreter that offers better performance than an Operating System-installed Perl Interpreter.
- Cached Logic and Host Checks that preclude unnecessary checks.