Stephen Fritz on Systems Engineering: November 2013

<data:blog.title/> <data:blog.pageName/>

Nagios / Icinga are the core packages of an entire platform. This installation will include add-on packages -- IDOUtils for MySQL database support and Nagvis for visualization -- but the article will focus primarily on installing Icinga.

Icinga has a default web-based interface that is very similar to Nagios. However, there are add-on packages -- icinga-web and icinga-cgi -- that require a database. The illustration below shows the installation command on Debian Wheezy. During the installation, a series of pop-up screens will provide configuration prompts. The first step is to install and configure the Postfix mail server and MySQL database server. After those are installed, add Icinga, Icinga-Web, Icinga-IDOUtils (for database support) and Nagios-NRPE / Nagios Service Check Acceptor for remote host checks. The video below illustrates a command-line installation.

The applications are now installed and operational -- accessible at http://<servername>/icinga. The default installation includes a basic configuration for "localhost" at the loopback address 127.0.0.1. However, there are several tasks to complete before the monitoring systems are functional.

1) Enable External Commands

sed -i -e 's/check_external_commands=0/check_external_commands=1/' /etc/icinga/icinga.cfg
dpkg-statoverride --update --add nagios www-data 2710 /var/lib/icinga/rw
dpkg-statoverride --update --add nagios nagios 751 /var/lib/icinga
service icinga restart

2) Enable IDO2DB Database Functionality
3) Supply a working Nagios-NRPE configuration file
4) Modify the php.ini file's timezone settings for the Icinga-Web Instance Monitoring to work correctly

Icinga's web interface to external commands will now work properly. Thus, a basic Icinga installation is now operational. The video below demonstrate's these steps and also includes an initialization of the system by restoring configuration files that describe the network.

Subsequent articles shall describe additional Nagios / Icinga features and their configuration.

<data:blog.title/> <data:blog.pageName/>

Nagios/Icinga are enterprise class systems monitors that can track publicly-available services, agent-collected data and SNMP. Data is presented in web-based formats and --through additional packages -- graphs and user-configurable visualizations.

The Nagios project commenced in 1996 and was first publicly released -- as NetSaint -- in 1999. The project was renamed Nagios in 2002. By 2005, the project was receiving a great deal of attention from the Open Source Community and the developers formed Nagios LLC in 2007. By 2009, Nagios LLC began to release commercial products and provide support contracts.

Also in 2009, a group of developers forked the Nagios project to Icinga. The well-developed core project retains a great deal of compatibility with Nagios (particularly the configuration files), but offers independently-developed solutions and interfaces.

Although the projects are forks, this article will discuss the two projects from a common perspective.

Monitored Systems

Publicly Available Services

Publicly available services are those shared to the network, such as HTTP, SMTP, FTP etc. Essentially, these are the TCP and UDP services reported using the netstat command.

Host Resources

Host resources are not shared across the network and include items such as hard drive / memory performance and utilization, processor utilization, network interface card statistics, etc. These performance parameters may not be directly accessed and rely on other methods -- including agents and SNMP -- to collect data.

Agents

Agents are application-specific software that run on the monitored host. In the Nagios and Icinga systems, the most common are:

Nagios Remote Plugin Executor (NRPE)
Nagios Service Check Acceptor (NSCA)

Agents are polled by the monitoring server and return data formatted for the Nagios application. NRPE, as the name implies, performs much of the data processing on the monitored host and returns results that are compatible with the application storage format; this architecture relieves the monitoring server of some of the processing load and distributes it among the monitored hosts.

Simple Network Monitoring Protocol (SNMP)

SNMP is defined by RFC's and provides a publicly-available standard for querying monitored hosts and returning standardized data.

Configuration Files

Nagios and Icinga are configured with text files. There are a few differences between the main configuration file and several of the add-on files (e.g. database configuration), but the object configuration and command files are compatible.

Main Configuration File

The /etc/nagios3/nagios.cfg and /etc/icinga/icinga.cfg files define environment, scheduling, logging and performance parameters for the monitoring daemons.

Object Configuration Files

Objects are defined in a series of files in the /etc/nagios3/conf.d and /etc/icinga/objects directories. Objects may be divided into the following categories:

Services
Service Groups
Hosts
Host Groups
Contacts
Contact Groups
Commands
Time Periods
Notification Escalations
Notification and Execution Dependencies

Plugins

Nagios and Icinga do not maintain internal processes to perform service monitoring. Instead, these actions are performed by plugins, external executable files and scripts (perl, shell, etc.) that perform the host queries. Nagios and Icinga distributions typically ship with a set of plugins, but many more may be added from external sources, such as the Nagios Exchange.

Host Checks

Host checks only determine if a host is available or not. The host definitions include parent/child definitions that define host dependencies and reachability logic. Host checks are performed:

At regular intervals in the host definition.
On-demand when a host's service state changes.
On-demand and controlled/triggered by host reachability logic.
On-demand and controlled/triggered by host dependency logic.

Hosts are reported as in an UP, DOWN or UNREACHABLE state. UP and DOWN are self-explanatory. UNREACHABLE is a state in which access to a host is no available because an intermediary host is DOWN. Thus, if a server is behind a router and the router is reported DOWN, the server will be reported as UNREACHABLE. Each state may be either HARD or SOFT. The SOFT state is reported at a host's first state change. It is then rechecked a specified number of times, after which it is the reported in a HARD state. These states are used to control things such as notification, reducing the number of false alarms from the system.

Service Checks

Service checks poll the state of specific applications, hardware, software, etc. on individual hosts. The service definitions include parent/child definitions that define service dependencies and reachability logic. Service checks are performed:

At regular intervals, as defined by the service definition.
On-demand and controlled/triggered by predictive service dependency checks.

Services are reported as in an OK, WARNING, UNKNOWN or CRITICAL state. OK means the service has responded as expected. WARNING means the service has reported back to the polling server, but provided information that indicates it is outside defined optimal performance. CRITICAL means the service has failed to respond or has responded with information that indicates it is outside defined acceptable performance. UNKNOWN is a bit unclear; for instance an SNMP check that receives no response may report UNKNOWN rather than CRITICAL. Services, too, also report in HARD and SOFT states.

Active Checks

Active checks are configured and controlled by the Nagios and Icinga applications.

Passive Checks

Passive checks are performed on the hosts outside the Nagios and Icinga applications. The monitoring process accepts and interprets these service checks as they occur. They are typically asynchronous. Examples include SNMP Traps and Security Alerts.

State Types

State types refers to the reliability that a given host/service state is accurate.

Soft -- the host has returned a changed service state, but it's reliability has not been verified by repeated rechecks.
Hard -- the host's state is considered reliable because it has been repeatedly checked.

Time Periods and Notifications

Time periods define when:

Scheduled host and service checks are performed
Notifications are sent
Notification escalations may be used
Dependencies are valid

Notifications are messages sent via e-mail, pager, SMS, IM and other media in reponse to specific, defined events.

Event Handlers

Event handlers are external scripts and actions that are triggered by Nagios and Icinga events.

Restarting a failed service
Entering a trouble ticket into a help desk system
Logging event information to a database
Power Cycling a host

Distributed, Redundant and Fail Over Monitoring

Distributed, redundant and fail over networking are complex topics; Nagios and Icinga
offer a great of software and implementations to achieve load distribution and redundancy.

DNX
Fusion
MNTOS
NDOutils, IDOutils
Gearman
Check_MK

The nature and implementations of these services are beyind the scope of this discussion and the reader is referred to project documentation for more detailed descriptions.

Predictive Monitoring

Predictive monitoring optimizes data collection using reachability logic, host and service dependencies. For instance, if a host is unreachable, there is no point expending resources monitoring its services. When the host is again reachable, obtaining updated information about its services is a priority. The following briefly illustrates and example.

In the first image, all hosts and services are available and monitored by a server in Pittsburgh, on the left of the image.

Operational Network -- Hosts and Services Available

A system failure -- the Harrisburg Router shuts down -- and predictive monitoring directives override default host and service checks. The image below depicts the Harrisburg Router as DOWN (red) and the Harrisburg Backup Server (connected to the WAN by through the router) as UNREACHABLE (purple).

Harrisburg Router Down, Harrisburg Backup Unreachable

A detailed status of the each office may then be reviewed. Note that the Harrisburg Router host is DOWN and its interfaces are CRITICAL (both red). The Harrisburg Backup host is unreachable (purple). However, the other services on these two hosts are a mix of CRITICAL (red) and OK (green). The monitoring server, upon determining that the router is DOWN ceases to check additional services. Those that report CRITICAL were polled before the router host was reported DOWN.

Harrisburg Data Center As Seen from Pittsburgh -- Router Down, Backup Unreachable, Several Services Down, Remaining Services not Polled

The Pittsburgh Data Center is local to the monitoring server and the Philadelphia Data Center is reachable through a redundant link. Those data centers report the far side of the Harrisburg WAN links are down.

Pittsburgh Data Center -- Harrisburg WAN Link Down

Performance Tuning

Applications with the scope of Nagios and Icinga must, to be scalable, have performance tuning options. These generalizations briefly list what is available:

Service Check Latency Monitoring to quickly identify generally poor performance.
MRTG Performance Statistics to identify specific bottlenecks.
Process tuning through configuration file options.
Passive Checks that offload processing from the monitoring server to monitored hosts.
Embedded Perl Interpreter that offers better performance than an Operating System-installed Perl Interpreter.
Cached Logic and Host Checks that preclude unnecessary checks.

Stephen Fritz on Systems Engineering

Search This Blog

Labels

Saturday, November 23, 2013

Nagios/Icinga Installation and Initial Configuration

Thursday, November 21, 2013

Nagios/Icinga Architecture