Search This Blog

Thursday, November 21, 2013

Nagios/Icinga Architecture

<data:blog.title/> <data:blog.pageName/>
Nagios/Icinga are enterprise class systems monitors that can track publicly-available services, agent-collected data and SNMP.  Data is presented in web-based formats and --through additional packages -- graphs and user-configurable visualizations.
Icinga CGI Display

The Nagios project commenced in 1996 and was first publicly released -- as NetSaint -- in 1999.  The project was renamed Nagios in 2002.  By 2005, the project was receiving a great deal of attention from the Open Source Community and the developers formed Nagios LLC in 2007.  By 2009, Nagios LLC began to release commercial products and provide support contracts.

Also in 2009, a group of developers forked the Nagios project to Icinga.  The well-developed core project retains a great deal of compatibility with Nagios (particularly the configuration files), but offers independently-developed solutions and interfaces.

Although the projects are forks, this article will discuss the two projects from a common perspective.

Monitored Systems


Publicly Available Services

Publicly available services are those shared to the network, such as HTTP, SMTP, FTP etc.  Essentially, these are the TCP and UDP services reported using the netstat command.

Host Resources

Host resources are not shared across the network and include items such as hard drive /  memory performance and utilization, processor utilization, network interface card statistics, etc.  These performance parameters may not be directly accessed and rely on other methods -- including agents and SNMP -- to collect data.

Agents

Agents are application-specific software that run on the monitored host.  In the Nagios and Icinga systems, the most common are:


  • Nagios Remote Plugin Executor (NRPE)
  • Nagios Service Check Acceptor (NSCA)
Agents are polled by the monitoring server and return data formatted for the Nagios application.  NRPE, as the name implies, performs much of the data processing on the monitored host and returns results that are compatible with the application storage format; this architecture relieves the monitoring server of some of the processing load and distributes it among the monitored hosts.

Simple Network Monitoring Protocol (SNMP)

SNMP is defined by RFC's and provides a publicly-available standard for querying monitored hosts and returning standardized data. 

Configuration Files

Nagios and Icinga are configured with text files.  There are a few differences between the main configuration file and several of the add-on files (e.g. database configuration), but the object configuration and command files are compatible.

Main Configuration File

The /etc/nagios3/nagios.cfg and /etc/icinga/icinga.cfg files define environment, scheduling, logging and performance parameters for the monitoring daemons.

Object Configuration Files

Objects are defined in a series of files in the /etc/nagios3/conf.d and /etc/icinga/objects directories.  Objects may be divided into the following categories:


  • Services
  • Service Groups
  • Hosts
  • Host Groups
  • Contacts
  • Contact Groups
  • Commands
  • Time Periods
  • Notification Escalations
  • Notification and Execution Dependencies

Plugins

Nagios and Icinga do not maintain internal processes to perform service monitoring.  Instead, these actions are performed by plugins, external executable files and scripts (perl, shell, etc.) that perform the host queries.  Nagios and Icinga distributions typically ship with a set of plugins, but many more may be added from external sources, such as the Nagios Exchange.

Host Checks

Host checks only determine if a host is available or not.  The host definitions include parent/child definitions that define host dependencies and reachability logic.  Host checks are performed:
  • At regular intervals in the host definition.
  • On-demand when a host's service state changes.
  • On-demand and controlled/triggered by host reachability logic.
  • On-demand and controlled/triggered by host dependency logic.

Hosts are reported as in an UP, DOWN or UNREACHABLE state.  UP and DOWN are self-explanatory.  UNREACHABLE is a state in which access to a host is no available because an intermediary host is DOWN.  Thus, if a server is behind a router and the router is reported DOWN, the server will be reported as UNREACHABLE.  Each state may be either HARD or SOFT.  The SOFT state is reported at a host's first state change.  It is then rechecked a specified number of times, after which it is the reported in a HARD state.  These states are used to control things such as notification, reducing the number of false alarms from the system.

Service Checks

Service checks poll the state of specific applications, hardware, software, etc. on individual hosts.  The service definitions include parent/child definitions that define service dependencies and reachability logic.  Service checks are performed:


  • At regular intervals, as defined by the service definition.
  • On-demand and controlled/triggered by predictive service dependency checks.
Services are reported as in an OK, WARNING, UNKNOWN or CRITICAL state.  OK means the service has responded as expected.  WARNING means the service has reported back to the polling server, but provided information that indicates it is outside defined optimal performance.  CRITICAL means the service has failed to respond or has responded with information that indicates it is outside defined acceptable performance.  UNKNOWN is a bit unclear; for instance an SNMP check that receives no response may report UNKNOWN rather than CRITICAL.  Services, too, also report in HARD and SOFT states.

Active Checks

Active checks are configured and controlled by the Nagios and Icinga applications.

Passive Checks

Passive checks are performed on the hosts outside the Nagios and Icinga applications.  The monitoring process accepts and interprets these service checks as they occur.  They are typically asynchronous.  Examples include SNMP Traps and Security Alerts.

State Types

State types refers to the reliability that a given host/service state is accurate.
  • Soft -- the host has returned a changed service state, but it's reliability has not been verified by repeated rechecks.
  • Hard -- the host's state is considered reliable because it has been repeatedly checked.

Time Periods and Notifications

Time periods define when:
  • Scheduled host and service checks are performed
  • Notifications are sent
  • Notification escalations may be used
  • Dependencies are valid
Notifications are messages sent via e-mail, pager, SMS, IM and other media in reponse to specific, defined events.

Event Handlers

Event handlers are external scripts and actions that are triggered by Nagios and Icinga events.
  • Restarting a failed service
  • Entering a trouble ticket into a help desk system
  • Logging event information to a database
  • Power Cycling a host

Distributed, Redundant and Fail Over Monitoring

Distributed, redundant and fail over networking are complex topics; Nagios and Icinga
offer a great of software and implementations to achieve load distribution and redundancy.

  • DNX
  • Fusion
  • MNTOS
  • NDOutils, IDOutils
  • Gearman
  • Check_MK

The nature and implementations of these services are beyind the scope of this discussion and the reader is referred to project documentation for more detailed descriptions.

Predictive Monitoring

Predictive monitoring optimizes data collection using reachability logic, host and service dependencies.  For instance, if a host is unreachable, there is no point expending resources monitoring its services.  When the host is again reachable, obtaining updated information about its services is a priority.  The following briefly illustrates and example.

In the first image, all hosts and services are available and monitored by a server in Pittsburgh, on the left of the image.
NagVis Enterprise Visualization
Operational Network -- Hosts and Services Available
 A system failure -- the Harrisburg Router shuts down -- and predictive monitoring directives override default host and service checks.  The image below depicts the Harrisburg Router as DOWN (red) and the Harrisburg Backup Server (connected to the WAN by through the router) as UNREACHABLE (purple).
NagVis Enterprise Visualization
Harrisburg Router Down, Harrisburg Backup Unreachable
 A detailed status of the each office may then be reviewed.  Note that the Harrisburg Router host is DOWN and its interfaces are CRITICAL (both red).  The Harrisburg Backup host is unreachable (purple).  However, the other services on these two hosts are a mix of CRITICAL (red) and OK (green).  The monitoring server, upon determining that the router is DOWN ceases to check additional services.  Those that report CRITICAL were polled before the router host was reported DOWN.
NagVis Data Center Visualization
Harrisburg Data Center As Seen from Pittsburgh -- Router Down, Backup Unreachable, Several Services Down, Remaining Services not Polled
The Pittsburgh Data Center is local to the monitoring server and the Philadelphia Data Center is reachable through a redundant link.  Those data centers report the far side of the Harrisburg WAN links are down.
NagVis Data Center Visualization
Pittsburgh Data Center -- Harrisburg WAN Link Down


Performance Tuning

Applications with the scope of Nagios and Icinga must, to be scalable, have performance tuning options.  These generalizations briefly list what is available:
  • Service Check Latency Monitoring to quickly identify generally poor performance.
  • MRTG Performance Statistics to identify specific bottlenecks.
  • Process tuning through configuration file options.
  • Passive Checks that offload processing from the monitoring server to monitored hosts.
  • Embedded Perl Interpreter that offers better performance than an Operating System-installed Perl Interpreter.
  • Cached Logic and Host Checks that preclude unnecessary checks.

No comments :

Post a Comment