Search This Blog

Tuesday, July 14, 2015

Zabbix Template for Squid Proxy Server SNMP

This article describes how to configure SNMP on Squid Proxy Servers and a Zabbix Template to monitor data.


Introduction

The Squid Proxy Server does not use the Linux typical Net-SNMP when installed.  Instead, it uses a separate binary that runs as its own process.  The binary is installed by default with Squid 3.x and later, but must be enabled when compiled into Squid 2.x and earlier and the SNMP binary compiled after. Once the SNMP binaries are available, they may be configured in the squid.conf file.  The separate SNMP procces may then be queried by SNMP monitoring tools such as Zabbix.

Configuring the Squid SNMP Process

The Squid Proxy web site provides a detailed HOW-TO page.  Assigning ACLs and listener IP addresses for security is strongly recommended.  For instance, create a host ACL for the Zabbix Server's IP address and limit access to that ACL and set the listener IP address for a single interface. For this article, we will not use a secure configuration because it is for demonstration purposes.  Familiarity with Squid's ACLs is assumed for a live deployment.

Minimum squid.conf Configuration

A minimal Squid SNMP configuration requires the following lines in the squid.conf file:
snmp_port 3401
acl <acl_name> snmp_community <community_name>
snmp_access allow all
 These lines create the SNMP process listener port (3401 is a widely-referenced port) and listens on all interfaces (default), creates an SNMP_Community-type ACL and assigns a community name (public is a de facto, but inssecure, standard) and allows access over port 3401 to all hosts.

Squid Proxy Zabbix Template

The Zabbix Share web site hosts the Squid Proxy Server template.  There are several customizations to query Squid Proxies.

Two macros are included in the template:
{$SQUID_SNMP_COMMUNITY} > public
{$SQUID_SNMP_PORT} > 3401
 These reflect the settings applied in the squid.conf file above.  If the global configurations of Squid Proxies in the enterprise differ, adjust them accordingly.

Each host must also be configured with an SNMP interface (Hosts > Interfaces > SNMP) that specifies the listening IP address / UDP port configured in the squid.conf file.  This must also be the primary listening SNMP interface.


Example Squid Proxy Server Cache Monitoring

The following architecture is a test environment.






There are two Linux host bridges (br0 and br1) on different subnets.  On br0, there is a Zabbix Server, a parent Squid Server (Squid01) and its child Squid Server (Squid02).  On br1, there is another Squid01 child Squid Server, Squid03.  The two child Squid Servers are configured as siblings.  The Squid Servers begin with empty Disk Caches.

Squid01 -- the parent cache -- is configured to only disk cache files larger than 4096 bytes and smaller than 128 MB.  Squid02 and Squid03 -- the sibling child caches -- have the default configuration of only caching files less than 4MB.  There is an overlap of the caching size policies, but in general, large files will only be cached on the parent.   Smaller files will be cached on the children.

The Internet connection (through which the Debian repositories are accessed during installation) is limited to 10 Mb/s (1.25MB/sec).

The Zabbix Server utilizes the Squid Proxy SNMP template.  The custom screens below organize the three Squid Proxy Servers in columns.  The three respective rows present graphs for HTTP Byte Hit Ratio (%), HTTP Rate (KB/s) and HTTP Request Hit Ratio.




The test consists of performing a Debian 8 (Jessie) desktop network installation (packages retrieved from repositories) with the Debian hosts configured to use different Squid Proxies.  First, we perform the installation with empty caches; also, server Squid03 is off line because we only want to cache on Squid02 and the sibling configuration will distribute the load.  As illustrated below, the installation requires approximately 20 minutes to download required packages from the repositories.  Notice the HTTP Rate (KB/sec) is limited to approximately 1.25 MB/sec (1.25KKB/sec or the Internet connection bandwidth of 10 Mb/sec).  Parent Proxy Squid01 (in the left column) returns generally low cache hit ratios and Child Proxy Squid02 returns higher hit ratios because it is passing requests to the parent.

Unpopulated Parent-Child Proxy Caches

The second test installation mimics the first, but is now accessing Proxy Caches populated with files from the first test.  Not only is the installation much faster (approximately five minutes compared to 20 in the first test), but the HTTP Rate is much higher -- 9 to 10 MB/sec (72 - 80 Mb/sec).  Internet bandwidth for this test is not an issue because the files are all cached locally.  Bandwidth is limited by the system Linux Bridge and Virtio NIC and Disk performance.
 
Populated Parent-Child Proxy Caches


The third test installation moves the Debian machine to Linux KVM host bridge br0 and uses Squid03 as its proxy server.  Here we see limitations of disk IO and peer caching because Sibling Proxies Squid02 and Squid03 are synchronizing and both accessing parent cache Squid01.  While the installation is faster than fetching all the files from the Internet-hosted repository, the disk contention and peer cache synchronization slow the total time of installation to approximately 10 minutes (compared to 20 for fetching all files from the Internet and five minutes for the populated two-cache proxy arrangement).

As above with an Unpopulated Child-Sibling Proxy Cache

The fourth test installation mimics the third but is now accessing Proxy Caches populated with files from the first three tests and the Sibling Proxies are also synchronized.  The Parent Proxy and Squid03 Child Proxy are both actively providing files for the installation which completes in approximately five minutes.  HTTP Rate bandwidth is also comparable to the tuned proxy performance in Test 2 -- 7MB/sec - 11 MB/sec.   Squid02 -- the other Sibling Proxy depicted in the center column -- displays very little activity bceuase its disk cache contents are identical to Squid03 and no files are needed.

Populated Parent - Sibling Children Proxy Caches

The above tests are not representative of an enterprise proxy server deployment.  The caches are either unpopulated or fully populated with requested data, whereas in live deployments typically approximately 5% of requests are cached.  There are performance limitations (particularly disk IO) in the Linux KVM test environment.  The all-or-nothing caching depicted above does, however, demonstrate the increased HTTP and Cache Hit Rates associated with proxy caching.

The Zabbix Share web site hosts the Squid Proxy Server template.

Monday, June 1, 2015

Zabbix Templates for Windows Cluster Services LLD Discovery

This article provides an overview of Microsoft Clustering and its underlying technologies.  Microsoft Cluster Services provide failover clustering for increased availability.  It supports several types of shared storage and implements older technologies, such as SMB, in new ways.  Finally, it presents Zabbix templates for resource discovery, monitoring, alerts and trend analysis.


The Zabbix templates (with PowerShell scripts and zabbix.conf agent modifications) may be downloaded from the Zabbix Share site:
These are zip files containing and contain a README.

Windows Cluster Services Components

Windows Clusters are primarily failover, in which one node is active and the other standing by ready to take over if the active -- Coordinating Node -- goes off line.  Shared resources -- such as services, the shared network address and some storage types -- will only be active on the Coordinating Node.  Monitoring strategies need to address the differences between the Cluster Server System, Coordinating Node and remaining nodes.

Monitoring should include Cluster Software Components and Cluster Objects.  Software Components are generally operating system level software items that control the cluster and its nodes.  Objects are hardware and software features/services controlled by the Cluster Software Components.  Cluster Software Components include:
  • Cluster Service
  • Disk Driver
  • Resource Monitor and DLLs
  • Database
  • Administration Interface
Cluster Objects include (among others):
  • Network Interfaces
  • Nodes
  • Resources
Cluster Object Resources are the raison d'etre of clusters, the IP addresses and services presented to the network.  The other components are more specific to the individual nodes. 

While all of these components are important, not all of them have Performance Monitor counters that provide actionable information.  Those that merely provide running totals -- not rates or thresholds -- are not particularly useful.  Thus, merely tracking the number of database flushes does not provide particularly useful  diagnostic information.

The operation of individual nodes may be monitored at the operating system level as detailed here and here.  Items such as network interfaces, disk utilization, etc. are not unique to clusters and operate much as any other Windows operating system.  There are two applicable Zabbix Templates:  Windows Server Discovery (LLD) and Windows Server Discovery (LLD) Performance Monitoring.  The first template is designed for detailed day-to-day monitoring while the second (which is Zabbix Server intensive) is intended as an addition to the first for diagnosing more complex problems.

Persistent and global vs. ephemeral or Coordinator/Standby Node items are also an important distinction.  The Windows Cluster Service is an item that runs on all nodes at all times and its operation is mandatory.  If it fails on any node, we want to know about it.  Shared Storage may be more ephemeral.  Consider the case of Physical and Logical Disks used as Shared Storage.  In this case, the Coordinator Node (left) recognizes two Logical Disks: F: and H:.  The Standby Node (right) recognizes none.

The illustration below depicts the Physical Disks recognized by the Coordinator Node on the left and a Standby Node on the right.  Each recognizes four physical disks.  The Coordinator Node lists two (Disks 2 and 4) as Logical Disks F: and H:.  Disk 1 is recognized as an Online Cluster Shared Volume and Disk 3 an Offline Cluster Shared Volume.  The Standby Node recognizes Disk 3 as Online.
The Windows Cluster Manager application provides another view of this:

Monitoring Global and Persistent Cluster Items

The Zabbix Windows Cluster Services template is designed to monitor global and persistent services.  It consists of fourteen items, three triggers, three graphs and a discovery rule; the discovery rule consists of four item, three trigger and one graph prototype.

The items monitor the Windows Cluster Service, Global Update Manager, Resource Control Manager and Total Resources (numeric, not by name).  The Discovery Rule is required because it enumerates the names of member nodes in the cluster and monitors network reconnections and message queues between the nodes.

Monitoring Shared Storage

Shared storage is ephemeral and moves between the Coordinator and Standby Nodes as needed.  Its monitoring is a much more complex issue.  Understanding how to monitor shared storage and interpret the results requires understanding the different types of shared storage Windows Clusters use.

Server Message Block (SMB)

Microsoft Server Message Block (SMB) is a client/server file sharing protocol.  On the client side, a Logical Drive may be mapped to a server SMB share.  SMB operates at the Application and Presentation Layers 6 and 7 of the OSI model.


The Distributed File System (Dfs) is an extension of the SMB protocol that creates a single share -- the Dfs Root -- that logically organizes shares on multiple servers.  These distributed shares may then all be accessed through the DFS Root.  Dfs also provides file replication for increased fault tolerance.


Windows Server 2012 introduced the SMB 3 protocol and among its capabilities is the Scale Out File Server (SOFS).  A SOFS system distributes data across many enclosures and manages all file replication; it can even be used with inexpensive commodity JBOD enclosures lacking RAID hardware.  SOFS is not intended for use as a traditional file share because the amount of metadata traffic (from opening, modifying and saving files) would consume excessive network bandwidth.  It is better suited to managing and replicating large files such as databases and Virtual Hard Disks, as used by failover clusters.

iSCSI, Serial Attached SCSI and Fibre Channel

Microsoft supports iSCSI, Serial Attached SCSI, Fibre Channel and SMB 3 shared storage in failover clusters.  From the Windows Server perspective, these types of storage present themselves (through Host Bus Adapters, Network Interfaces and drivers) as physically-attached SCSI storage addressed by its Logical Unit Number (LUN).  That is, they appear as Physical Drives and may be partitioned and formatted into one or more Logical Drives.  However, such storage is often not partitioned into Logical Drives.

iSCSI operates at the Session Layer 5; it has less network overhead than SMB shares and is often faster on the same hardware.  Fibre Channel does not operate using the OSI model because it is not Ethernet.  However, Fibre Channel switches do operate in layers in a manner somewhat analogous to the OSI model, but lacking layers that correspond to the OSI Session, Presentation and Application layers 5-7:
  • FC4: Protocol Mapping layer for protocols such as SCSI.
  • FC3: Common Services layer, a thin layer for encryption or RAID redundancy algorithms.
  • FC2: Network layer, consists of the core of Fibre Channel, and defines the main protocols.
  • FC1: Data Link layer, which implements line coding of signals.
  • FC0: PHY, includes cabling, connectors etc.
SAS is a technology that operates more like locally-attached storage; it does not correlate to networking models because it uses Host Bus Adapters and drivers to access storage.


SMB 3 shares operate at the Presentation and Application Layers 6 and 7 of the OSI model.  SMB 3 provides the shared storage space for Virtual Hard Disks (VHDs) used by Hyper-V and Clusters.

File Systems

The dominant Windows file systems are NTFS and ReFS.  NTFS remains the one used for the Windows Server system/boot partition.  It is more feature-rich than ReFS.  ReFS is designed for scalability and resiliency; it is the file system of choice for very large data storage.

These file systems may be mounted as drives by one or more SMB clients.  The SMB server arbitrates file access to prevent more than one client from accessing -- and corrupting -- data simultaneously.

iSCSI, SAS and Fibre Channel storage is also formatted as NTFS or ReFS, but if two or more servers mount them as drives, the file systems themselves will not prevent simultaneous data access and corruption.  NTFS and ReFS are not Clustered File Systems.  Microsoft has provided the Cluster Shared Volume File System (CSVFS) since Windows Server 2008 R2.  This file system allows two or more mount the drive.

CSVFS is really an NTFS- or ReFS-formatted file system managed by Windows Failover Cluster Services.  Each LUN is actually a CSV FS-formatted VHD that resides on an NTFS or ReFS partition.  The server in control of the LUN is called the Coordinator Node.  The  Coordinator Node addresses the LUN using the appropriate SCSI protocol (iSCSI, SAS or Fibre Channel).  It then creates an SMB share under the %SystemDrive%\ClusterStorage directory that may be addressed by other nodes in the cluster using the SMB protocol.

CSVFS is quite different from other cluster-aware file systems such as Oracle's OCFS2 because it uses so many older technologies -- file systems, VHDs and SMB -- to provide simultaneous file system access.  There is involvement of the higher layers of the OSI model with attendant overhead, which is manifest as additional processor, memory, network utilization and potentially Logical Disk bottlenecks.

Yet there is a method to this seeming madness.  Microsoft's goal is to implement cloud storage protocols that scale beyond SAN technology on commodity hardware -- JBOD enclosures.  Thus, the Microsoft Cluster Services utilize technologies and protocols that are tested and closely related to its SMB 3 SOFS strategy.

Monitoring NTFS and ReFS Partitions

NTFS and ReFS partitions -- not being cluster-aware file systems -- are controlled by the operating system and cluster service.  The operating system will recognize the Physical Disks, which may be readily monitored using the Windows OS Discovery (LLD) template.  This template will also recognize Logical Drives, but only on the Coordinator Node.  Thus, standby nodes will either not recognize the presence of offline Logical Disks or (if they have been the coordinator Node in the past) Zabbix will show them as Not Present.  This is a minor issue and if you change the go to the template's LogicalDisk Discovery rule and change the value of "Keep lost resource period (in days)" from the default 30 to 1 (or even 0), the offline Logical Disks will be more quickly removed from the host's discovered items.

The most important information may be obtained by configuring a host for the cluster shared IP address with the Windows Discovery (LLD) template.  The cluster will be aware of all physical and logical disks managed by the Cluster Service and should remain unchanged.

Monitoring CSVFS Partitions

Monitoring the underlying Physical and Logical disks will discover CSVFS partitions.  However, as describe in the sections above, CSVFS relies upon the SMB protocol and any node in the cluster (Coordinator and Standby) may access a CSVFS partition.  Recent versions of Windows Cluster Services intentionally distribute the Online and Offline CSV partitions between nodes in the cluster, both the coordinator and standby.  However, CSVFS partitions differ in that they are NOT recognized as Logical Disks by any node at the operating system level and are logically controlled and addressed solely through the Cluster Service, at a higher level of the operating system than iSCSI, SAS and Fibre Channel shared storage.

CSVFS partitions are so different from iSCSI, SAS and Fibre Channel partitions they require a separate Zabbix template -- CSV Cluster Shared Volumes (LLD).  This template consists of three discovery rules:
  1. Volume Manager -- 6 Item, 6 Trigger and 2 Graph Prototypes
  2. File System -- 26 Item, 4 Trigger and 5 Graph Prototypes
  3. SMB Server -- 10 Item, 3 Trigger and 3 Graph Prototypes
CSVFS is designed to be fault-tolerant and error transparent.  The template -- PARTICULARLY THE TRIGGERS -- are intended to be applied to the shared cluster host and its corresponding IP address.  The triggers target Redirected IO, something that should be minimal on the Coordinator Node but may be expected on others.  Please read Microsoft's Shared Volume Performance Counters article for an in-depth description of these Performance Monitoring Counters.  The triggers are configured according to the article but may provide false positives under some circumstances.

The triggers are indicative of problems, but diagnosis may require additional trend analysis.

Zabbix Template Deployment

There is a README file included in the zip file download that explains how to deploy the templates, modify the zabbix.conf file and add scripts.

In summary, you may manually install the Zabbix Agent on Windows host, script the installation or use Group Policy.
  1. Create a Zabbix Discovery rule with a named macro, filters (if necessary) items, triggers and graphs.
  2. Add UserParameter statements to the client agent zabbix.conf file referencing the Zabbix Discovery rule and calling PowerShell scripts.
  3. Add PowerShell scripts to the client.
You may also wish to review the Windows Server Discovery LLD article for more detailed information.

Saturday, May 30, 2015

Zabbix Templates for Windows LLD Discovery

This article describes Windows Server Zabbix Low Level Discovery (LLD) Templates that monitor core server functions.  While tested on Windows Server 2012 R2, it us likely  the Templates are compatible with other versions of Windows as well.

For those with Zabbix and Windows experience, the templates used are available from the Zabbix Share Templates page.  A previous version of Windows Templates is described in this blog post on Windows Server 2008 R2 Performance Monitoring.  The previous templates used Windows Performance Monitoring (_Total) and (*) instances to collect data.  While this provides overall systems performance indicators, it is not highly precise.  For instance, if the Disk Queue Write Length (_Total) exceeds the warning threshold, it applies to all disks on the server and does not identify the specific disk that is the problem.

Zabbix Low Level Discovery (LLD) provides more specific information about the hardware and software running on Windows Servers.  Three components are used:
  1. A Discovery Rule that defines what information will be obtained.
  2. A UserParameter statement in the Zabbix Agent zabbix.conf file defines what scripts will be used.
  3. A PowerShell script that queries the Windows Operating System and returns JavaScript Object Notation (JSON) formatted variables used in the Discovery Rule.
Instead of relying on (_Total) and (*) instances written into the counters, the Discovery Rule will enumerate individually-returned items, such as Logical Disks C:, D:, E:, etc.


Discovery Rules

Rule Definitions


Discovery Rules are written in the Zabbix Template.  The definition page requires:
  • Unique Name such as windowsldisk.discovery for Logical Disk Discovery
  • Type (Zabbix Agent for all rules in this article)
  • Key to define what UserParameter to run on the Zabbix Agent
  • Update Interval in seconds
You may also add Flexible Intervals and define how long items that are no longer discovered are retained.  The last feature is useful if you are, for example, monitoring SMB-mapped drives shared by a failover cluster.

Filters use regular expressions and macros to filter the returned results.  For instance, calling the {#FSTYPE} macro that uses the @File systems for discovery defined regular expression will return results for matching values (e.g. ext4, ntfs) and filter out those not desired (e.g. cdfs).

For Windows Server Discovery, the macro is simply defined from the data returned by a PowerShell script (see below) that filters on the server before returning values to Zabbix.  We do not need to define Zabbix-level filters and simply use the macro name defined in the PowerShell script.

Item Prototypes

Item prototypes are similar to regular items in format, except they typically reference a macro instead of a defined value.  For Windows Discovery, notice both the Name and Key contain the macro {#DISKNUMLET}.  This acts as a variable and references all of the items returned by the PowerShell script in JSON format.  As discussed in more detail below, the PowerShell script will filter all values to return the logical drives recognized by the operating system (C:, E:, F:, etc.) filtering out the CD drive.  Keep in mind logical drives include mapped SMB shares physically hosted on other servers.


Item details define how items will be stored, reported and -- importantly -- the macro and Zabbix operation performed.  The Type of operation is always Zabbix Agent because the process is sent to a remote agent for execution.  The perf_counter key instructs Zabbix to use a Windows Server-formatted Performance Monitoring item and pass the defined operation and macro name to the Agent.  The returned item is a Numeric (float) type and may be assigned appropriate units (Bytes, Bytes/sec, sec, millisec, etc.).  Value mapping may also be assigned.  These define how numeric values returned by the Agent are interpreted.  For Windows, the agent will return a numeric value for the service state.  The Value Mapping maps the numeric value to a human-readable value value (e.g. running, paused, stopped, etc.).

Trigger Prototypes

Collecting data is useful for trend analysis.  Triggers define thresholds at which Zabbix generates alerts.  A complete discussion of Triggers is beyond the scope of this article.  Windows Server-specific Triggers warrant description.

Microsoft's MSDN and Technet provide lists of suggested Performance Counter thresholds that are easily translated into Zabbix Triggers.  The illustration depicts a Warning Trigger for Logical Disk sec/Read (sec) on a specific drive; if the read time is greater than 0.015 seconds, Zabbix generates a Warning alert.  This trigger was then cloned and the threshold value changed to 0.025 seconds and Warning changed to High to create a higher-rated alert.  Decisions about forwarding may be made based upon the severity of the alert.

Graph Prototypes

Graphical displays of information are useful for trend analysis and diagnosing problems.  Zabbix Discovery Graph Prototypes are much like standard graphs but use macros in place of defined objects.  One graph prototype, as illustrated, calling macros will generate a graph for each item returned.  Thus, a Windows Server with three logical drives will have three of each graph and the macro will list each drive name in the title.




Zabbix Agent

The Zabbix Agent controls communications between the Windows Server operating system and Zabbix server.  Its configuration file -- zabix.conf -- defines the UserParameter functions that the Zabbix server passes to the agent in order to execute Windows PowerShell scripts.

Following the Logical Disk Discovery described above, the zabbix.conf file requires a UserParameter statement for each different script used.
UserParameter=windowsldisk.discovery,powershell -NoProfile -ExecutionPolicy Bypass -File c:\scripts\get_ldisks.ps1
This UserParameter line responds to calls from the the Zabbix server windowsldisk.discovery definition and invokes PowerShell to run with the privileges necessary to execute the script c:\scripts\get_ldisks.ps1.  The agent then returns the values to the Zabix server.

PowerShell Scripts

PowerShell is the Microsoft Windows scripting language used to query Performance Monitoring. The following script queries the Logical Disks Performance Counters and returns the macro-defined {#DISKNUMLET} and drive number-letter name:

  1. $drives = Get-WmiObject win32_PerfFormattedData_PerfDisk_LogicalDisk | ?{$_.name -ne "_Total"} | Select Name
  2. $idx = 1
  3. write-host "{"
  4. write-host " `"data`":[`n"
  5. foreach ($perfDrives in $drives)
  6. {
  7. if ($idx -lt $drives.Count)
  8. {
  9. $line= "{ `"{#DISKNUMLET}`" : `"" + $perfDrives.Name + "`" },"
  10. write-host $line
  11. }
  12. elseif ($idx -ge $drives.Count)
  13. {
  14. $line= "{ `"{#DISKNUMLET}`" : `"" + $perfDrives.Name + "`" }"
  15. write-host $line
  16. }
  17. $idx++;
  18. }
  19. write-host
  20. write-host " ]"
  21. write-host "}"
Line 1 invokes the LogicalDisk query command and filters out the _Total item, returning only the counters' names (not the voluminous additional data associated with each).  Line 2 sets an index at 1 and Line 17 increments it.  Lines 3 and 4 write the required headers for JSON format.  Lines 5 through 16 query each drive item returned in Line 1 and writes a formatted pair -- {#DISKNUMLET} and Drive Name -- in JSON format.  Lines 19 through 21 then complete the JSON-formatted response.
{

"data":[

{ "{#DISKNUMLET}" : "C:"},
{ "{#DISKNUMLET}" : "E:"},
{ "{#DISKNUMLET}" : "F:"},
{ "{#DISKNUMLET}" : "G:"}

 ]
}
Don't try to look up a list of Get-WmiObject commands because what Microsoft documents is incomplete.  There are simply too many and they are installed as needed with Roles and Applications.  Fortunately, PowerShell also provides a command syntax that will export all available Get-WmiObject commands to a .csv-format file:
Get-WmiObject -List | Where-Object { $_.name -match 'perfformatted' } | Export-CSV c:\scripts\perfformatted.txt
You may the search this lengthy document for the syntax needed to create other PowerShell scripts.

Summary

You may manually install the Zabbix Agent on Windows host, script the installation or use Group Policy.
  1. Create a Zabbix Discovery rule with a named macro, filters (if necessary) items, triggers and graphs.
  2. Add UserParameter statements to the client agent zabbix.conf file referencing the Zabbix Discovery rule and calling PowerShell scripts.
  3. Add PowerShell scripts to the client.

Windows Server Templates

There are two Templates on Zabbix Share for Windows Server Discovery:
The first template is adequate for day-to-day monitoring and trend analysis.  The second template is very thorough, Zabbix Serverv-intensive and intended for diagnosing difficult problems.

Each .zip file downlaod contains the template, UserParameter statements to be added to the zabbix.conf file and PowerShell scripts to be placed in the c:\scripts directory.  There is a brief README file explaining what needs to be done.

Saturday, April 11, 2015

Linux Layer 3 Cisco Router and Open vSwitch with NetFlow Virtualization

Configuring Cisco and Open Switch Routers and Layer 3 devices to send NetFlow data to a Linux NTOP NetFlow collector to interpret higher layer network flows.

Introduction

Layer 2 and Layer 3 Linux Switching (using kernel-supported utilities) is fast and efficient.  Open vSwitch -- when combined with an OpenFlow controller -- is a full-blown Software Defined Networking (SDN) implementation that offers the advantages of offloading topology and switching decisions to a centralized controller, allowing the Open vSwitch devices to primarily switch frames instead of expending processor and memory resources on topology maintenance and decision-making.
Open vSwitch devices are also Layer 3 and above aware, able to modify traffic based upon Layer 3 (IP address) and Layer 4 (TCP port) information.  However, an administrator needs quality information to make prioritization and routing decisions and OpenFlow controllers provide mostly Layer 2 information.  Systems monitoring devices -- MRTG, Cacti, Munin, Nagios/Icinga and Zabbix -- operate primarily at the device level (e.g. interface utilization, interface errors, interface congestion) and are difficult, if not impossible, to configure for detailed analysis.  Firewalls and access lists operate at the requisite levels (Layers 2, 3 and 4) for detailed traffic analysis, but require a great deal of manual configuration and interpretation to characterize traffic.
The term used for this kind of detailed traffic characterization is Flows.  Each flow has unique end devices, IP addressing and TCP ports.  At the flow level, each step in the path between end devices (i.e. intermediary switches and routers) also records traffic.  As stated above, OpenFlow controllers report this quite well at Layer 2.  Other protocols, such as NetFlow and OpenFlow, provide the higher-level information required to more precisely characterize and manage traffic.

Configuring Open vSwitch Layer 3 Switching

There are several ways to implement Layer 3 functionality, such as fake bridges and VLANs.  From a monitoring perspective, these can be somewhat problematical because they do not appear in SNMP MIBs.

This article will use multiple bridges on each Layer 3 Open vSwitch to define multiple networks.  Bridge and port membership is defined with the ovs-vsctl set of commands.  IP address assignments to the bridges is managed at boot time in the /etc/network/interfaces file.  Routing is managed using Quagga.

In the topology at the top of the article, there is routing configured on the Cisco R1 gateway and Open vSwitch devices Switch-01 and Switch-05.  All are members of OSPF Area 10.128.0.0 (range 10.128.0.0/13 with configured networks 10.128.0.0/24 and 10.128.1.0/24) and the Cisco router is also a member of the host laptop's Backbone Area 0.0.0.0 (range 172.16.0.0/12).

Switch-01 Configuration

The /etc/network/interfaces File

auto eth0
iface eth0 inet manual

auto eth1
iface eth1 inet manual

auto eth2
iface eth2 inet manual

...
auto eth7
iface eth7 inet manual

auto br0
iface br0 inet static
    address 10.128.0.254
    netmask 255.255.255.0
    network 10.128.0.0
    broadcast 10.128.0.255
    gateway 10.128.0.1
    # dns-* options are implemented by the resolvconf package, if installed
    dns-nameservers 192.168.1.1
    dns-search mydomain.com

auto br1
iface br1 inet static
    address 10.128.1.254
    netmask 255.255.255.0
    network 10.128.1.0
    broadcast 10.128.1.255
    # dns-* options are implemented by the resolvconf package, if installed
    dns-nameservers 192.168.1.1
    dns-search mydomain.com

Open vSwitch Configuration

18ce5ab3-d025-49c2-8cd8-0a41d3927c79
    Bridge "br1"
        Controller "tcp:10.128.0.102:6633"
            is_connected: true
        Port "eth4"
            Interface "eth4"
        Port "br1"
            Interface "br1"
                type: internal
    Bridge "br0"
        Controller "tcp:10.128.0.102:6633"
            is_connected: true
        Port "eth1"
            Interface "eth1"
        Port "eth6"
            Interface "eth6"
        Port "eth5"
            Interface "eth5"
        Port "eth3"
            Interface "eth3"
        Port "br0"
            Interface "br0"
                type: internal
        Port "eth7"
            Interface "eth7"
        Port "eth2"
            Interface "eth2"
        Port "eth0"
            Interface "eth0"
    ovs_version: "2.1.3"

The Quagga OSPF Configuration

router ospf
 ospf router-id 10.128.0.254
 network 10.128.0.0/24 area 10.128.0.0
 network 10.128.1.0/24 area 10.128.0.0
 area 10.128.0.0 range 10.128.0.0/13

Switch-05 Configuration

The /etc/network/interfaces file

auto eth0
iface eth0 inet manual

auto eth1
iface eth1 inet manual

auto eth2
iface eth2 inet manual
...
auto eth7
iface eth7 inet manual

auto br0
iface br0 inet static
    address 10.128.1.250
    netmask 255.255.255.0
    network 10.128.1.0
    broadcast 10.128.1.255
    # dns-* options are implemented by the resolvconf package, if installed
    dns-nameservers 192.168.1.1
    dns-search mydomain.com



Open vSwitch Configuration

18ce5ab3-d025-49c2-8cd8-0a41d3927c79
    Bridge "br0"
        Controller "tcp:10.128.0.102:6633"
            is_connected: true
        Port "eth1"
            Interface "eth1"
        Port "eth6"
            Interface "eth6"
        Port "eth0"
            Interface "eth0"
        Port "eth5"
            Interface "eth5"
        Port "eth3"
            Interface "eth3"
        Port "br0"
            Interface "br0"
                type: internal
        Port "eth4"
            Interface "eth4"
        Port "eth7"
            Interface "eth7"
        Port "eth2"
            Interface "eth2"
    ovs_version: "2.1.3"

The Quagga OSPF Configuration

router ospf
 ospf router-id 10.128.1.250
 network 10.128.1.0/24 area 10.128.0.0
 area 10.128.0.0 range 10.128.0.0/13

Installing NTOP on Ubuntu 14.04 Trusty Tahr

Keep in mind NTOP is not NTOP-NG.  NTOP-NG is the current version of the utility, but it decouples NetFlow from the application and requires the paid nProbe package.  For this article, we will use the older -- but still useful -- original NTOP.

The Easy (and Broken) Way

It's very easy -- apt-get install ntop.  You will need to configure the listening interface(s) and supply an administrator password.

The problem with this installation is the package does not install the visualization packages required for elements like pie graphs and host details, despite using the last version (5.0.1 dated 2012-08-13).  The package does not provide the full range of ntop features.

Compiling from Source

Install dependencies
sudo apt-get install libpcap-dev libgdbm-dev libevent-dev librrd-dev python-dev libgeoip-dev automake libtool subversion
sudo apt-get build-dep ntop 
Extract and compile
tar zxvf ntop-5.0.1.tar.gz
cd ntop-5.0.1
./autogen
make
sudo make install
Copy libraries to the correct location, change ownership of the executable and restart the service
sudo cp /usr/local/lib/libntop* /usr/lib/
sudo chown -R ntop.ntop /usr/local/share/ntop
sudo service ntop restart

Configuring NTOP

The installation process specifies one or more listening interfaces.  These are available without additional configuration.  The overview report provides information about this (ese) interface(s), such as packet distribution and size.




 There is also a tab for protocol distribution.
 

 And another for Application Protocols.

In a switched environment, an interface in promiscuous mode will still only capture unicast traffic switched to its own port, broadcast and multicast traffic.  The capture does not reflect overall network activity.

One option to capture traffic is port mirroring, in which all traffic on a specified interface is forwarded to another (in this case, the NTOP monitor).  The drawbacks to port mirroring are manual configuration, the added bandwidth and processing required.  Even using virtio network devices on virtualized machines, the processing and memory bandwidth will contribute to added load on the host.

NetFlow

Netflow is a set of services developed by Cisco.   Briefly, a NetFlow capable device summarizes traffic, formats it and forwards it to a collector over UDP.  Version 9 is current at the time of writing and its format is detailed here.  Essentially, it provides a summary of TCP/IP protocol, address and application information the collector may store, format and present.

Configuring NTOP NetFlow Collector

From the main menu, select Plugins and click on the NetFlow option.
There are three important fields:  NetFlow Device, Local Collector UDP Port and Virtual NetFlow Interface Network Address.

NetFlow Device

This is a unique name to identify a network-specific collector.

Local Collector UDP Port

This specifies the UDP port on which the collector listens.  2055 is the current default.

Virtual NetFlow Interface Network Address

This is NOT an IP address, but a network address that corresponds to the network from which device(s) send information.  In the configuration used in this article it is 10.128.0.0/24, which will save as 10.128.0.0/255.255.255.0.  This virtual interface may be the destination for more than one collector on each defined network.  In this article, we will configure a Cisco router and Ubuntu Open vSwitch to forward NetFlow information to the NTOP server.

There are other options available, but they will not be discussed in this article. 

Configuring Cisco NetFlow


The following commands begin with configuring NetFlow to operate on interface FastEthernet0/1, collecting inbound and outbound data.   Then we configure the router export dource interface, timeouts, NetFlow version and (finally) the destination IP address and UDP port of the NTOP server.
R1#configure terminal
R1(config)#interface f0/1
R1(config-if)#ip route-cache flow
R1(config-if)#ip flow ingress
R1(config-if)#ip flow egress
R1(config-if)#exit
R1(config)#ip flow-export source f0/1
R1(config)#ip flow-cache timeout active 60
R1(config)#ip flow-cache timeout inactive 120
R1(config)#ip flow-export version 5
R1(config)#ip flow-export destination 10.128.1.104 2055

Configuring Open vSwitch Netflow

First some background on Open vSwitch and OpenFlow (e.g. Floodlight) controllers.  Open vSwitch alone operates quickly at Layer 2 -- a bridge or switch.  The OpenFlow controller centralizes MAC address logic and topologies and is a decision relieves the connected Open vSwitch of logic and topology decisions.  Yet Open vSwitch is also aware of higher layers of the network stack -- Layer 3 IP, Layer 4 TCP, etc. The deficiency of existing OpenFlow controllers is a lack of detailed Layer 3 and above information upon which to make configuration decisions such as prioritization.  Enter NetFlow.

The following command (one statement) configures Open vSwitch NetFlow for bridge br0 and forward it to the NTOP NetFlow collector with an active timeout of 120 seconds:
sudo ovs−vsctl set Bridge br0 netflow=@nf0 -- --id=@nf0 create NetFlow targets=\"10.128.1.104:2055\" active_timeout=120
More than one bridge on each device may then be added, such as br1, with the command:
ovs-vsctl set Bridge br1 netflow=[_uuid]

Viewing NetFlow Information on NTOP

Switch to the Netflow interface configured above if you have not already done so.  The interface is identical, but now that NetFlow is collecting all port and host information for the network 10.128.0.0/24, there will be much more information.


The Zabbix Proxy in the network collects information from all hosts and devices.  Click on its address (DNS is not configured on this network, but NTOP will use it when available).
Details of traffic between the Zabbix Proxy and the hosts it monitors are presented as an overview and in detail.  Of particular interest, the process of querying monitored devices requires very little bandwidth compared to that required to forward data from the Zabbix Proxy (10.128.0.103) to the Zabbix Server (10.128.0.101).





Top talkers are also identified and graphed.