Search This Blog

Tuesday, March 25, 2014

High-Availability Nagios / Icinga on a DRBD - Corosync - Pacemaker Failover Cluster

Failover clustering is relatively uncomplicated and provides high availability.  It consists of two servers -- preferably almost identical -- in which one is active and the other passive.  In the event the active server fails or is manually taken offline, the passive server assumes the active role.

This may sound like the two nodes are independent, but they share resources.  There is data and shared services that must be available to both servers (but not simultaneously). Shared data is stored on a shared drive controlled by the operating system with the help of clustering software or clustering-aware file systems on a SAN; shared services are controlled by clustering software.
DRBD High-Availability Clustering


Installing and Configuring DRBD, Corosync and Pacemaker with LCMC

The first step is to provide the servers with a shared data drive.  In this case, the shared drive will be two physical drives -- one on each server -- replicated by the Distributed Replicated Block Device (DRBD) software package.  DRBD is analogous to drive mirroring, however the drives reside on different servers and are mirrored over a network connection.  The / partition of the servers is installed on drive /dev/sda and a blank drive -- /dev/sdb -- will be controlled by DRBD.  Once configured, /dev/sdb will no longer be accessible and will be addressed as /dev/drbd0.

The second step is to share the drive with the servers, control which one is active and assign a shared IP address by which the cluster will be available.  This is controlled by the Pacemaker - Corosync clustering software packages.  Pacemaker and Corosync use network interfaces (preferable at least two) to maintain a "ring" over which the servers provide staus updates.  The cluster will move resource control from one server to the other, either manually or automatically.

For this setup, the servers each have four network controllers:

  • eth0 -- publicly-available
  • eth1 -- dedicated to DRBD data control and replication
  • eth2 and eth3 -- dedicated to Pacemaker - Corosync services control
There are several files that control these applications and they may be configured manually. DRBD is set up with the files /etc/drbd.d/global_common.conf and /etc/drbd.d/r0.res (or more if there are multiple drive resources).  Pacemaker - Corosync is set up with the /etc/corosync/corosync.conf file.

There is another application -- the Linux Cluster Management Console (LCMC) -- that automates setting up such a cluster.  LCMC is Java-based and will install / configure all of the software necessary to operate a failover cluster.

The video below demonstrates adding the two unconfigured servers to LCMC, installing and configuring DRBD, Pacemaker and Corosync, setting up and replicating a shared ext4 partition on the shared data drive, adding a Pacemaker-controlled drive partition, adding a shared IP address and testing Active-Passive operations and failover.

Shared Apache, Postfix, MySQL and Icinga Resources

Once the basic shared resources are configured and tested, we can add applications that are controlled by the cluster.  There are a number of other posts that describe setting up Nagios and Icinga in this blog; refer to them for details.

Begin by installing and configuring the Apache2 web server, Postfix mail server and MySQL database server.  Then install and configure all of the Icinga monitoring, web and database packages.  These must be operating correctly on both servers and have identical configurations before clustering may proceed.

Now important distinctions must be identified:  data and services that are controlled by the operating system versus those controlled by the cluster.  For instance, the clustering software runs independently on each server and is controlled by the operating system, available at boot time.  The web, mail and database servers share data and resources and are controlled by the cluster; they must be disabled at boot time and started by the clustering software.  This is very important.  For instance, MySQL will be configured so that the configuration files (/etc/mysql/*.*) and data drive and symlinked back to their original location. Thus, since only one server has access to the configuration and data, the MySQL daemon must be disabled at boot and the DRBD - Pacemaker - Corosync clustering software decides upon which server has access to the files and starts the servers.  I use the Webmin interface to disable all shared services at boot time.

This process is illustrated in the video below.  Upon completion, the cluster will be in control of all shared services (web, mail, database, Icinga, etc.).  Failover is illustrated.

Testing High-Availability Nagios / Icinga on a DRBD - Corosync - Pacemaker Failover Cluster

Once failover is demonstrated, it is time to allow the servers to collect, process and display data.  Which shared data files to move to the replicated DRBD drive depend upon what services are installed.  For instance, PNP4Nagios -- the RRDTool graphing add-on -- stores shared data in /var/lib/pnp4nagios; this file is moved to the DRBD drive and symlinked back to each node.

Also be aware that some directories do not move well.  There are numerous symlinks in the base installations that end up pointing to nothing if moved and symlinked back to the node file systems.  The /etc/icinga directory is a case in point.  If this directory is moved to the shared drive, Icinga's operation becomes, at best, unstable.  Thus, updates to the Icinga configuration must be manually installed on EACH node.


However, as the video below demonstrates, shared MySQL databases and moving selected PNP4Nagios and NagVis directories to the shared drive provides high-availability performance.






No comments :

Post a Comment