This section contains two categories of tools: tools developed before cloud computing and contemporary tools which were not designed specifically for cloud monitoring but have related goals. In the case of the former, these tools retain relevance to cloud computing either by utilising concepts pertinent to cloud monitoring or by being commonly utilised in cloud monitoring.
Ganglia
Ganglia [31] is a resource monitoring tool primarily intended for HPC environments. Ganglia organises machines into clusters and grids. A cluster is a collection of monitored servers and a grid is the collection of all clusters. A Ganglia deployment operates three components: Gmond, Gmetad and the web frontend. Gmond, the Ganglia monitoring daemon is installed on each monitored machine and collects metrics from the local machine and receives metrics over the network from the local cluster. Gmetad, the Ganglia Meta daemon polls aggregated metrics from Gmond instances and other Gmetad instances. The web frontend obtains metrics from a gmond instance and presents them to users. This architecture is used to form a tree, with Gmond instances at the leaves and Gmond instances at subsequent layers. The root of the tree is the Gmond instance which supplies state to the web frontend. Gmetad makes use of RRDTool [32] to store time series data and XDR [33] to represent data on the wire.
Ganglia provides limited analysis functionality. A third party tool: the Ganglia-Nagios bridge[34] allows Nagios to perform analysis using data collected by Ganglia. This attempts to gain the analysis functionality of Nagios while preserving the scalable data collection model of Ganglia. This is not a perfect marriage as the resulting system incurs the limitations of both tools but is often proposed as a stopgap tool for cloud monitoring. Similarly Ganglia can export events to Riemann.
Ganglia is first and foremost a resource monitor and was designed to monitor HPC environments. As such it is designed to obtain low level metrics including CPU, memory, disk and IO. It was not designed to monitor applications or services and nor was it designed for highly dynamic environments. Plugins and various extensions are available to provide additional features but the requirements of cloud computing are very different to Ganglia’s design goals. Despite this, Ganglia still sees some degree of usage within cloud computing due to its scalability.
Astrolabe
Astrolabe [8] is a tool intended for monitoring large scale distributed systems which is heavily inspired by DNS. Astrolabe provides scalable data collection and attribute based lookup but has limited capacity for performing analysis. Astrolabe partitions groups of hosts into an overlapping hierarchy of zones in a manner similar to DNS. Astrolabe zones have a recursive definition: a zone is either a host or a set of non overlapping zones. Two zones are non-overlapping if they have no hosts in common. The smallest zones consist of single hosts which are grouped into increasingly large zones. The top level zone includes all other zones and hosts within those zones. Unlike DNS, Astrolabe zones are not bound to a specific name server, do not have fixed attributes and state update are propagated extremely quickly.
Every monitored host runs the Astrolabe agent which is responsible for the collection of local state and the aggregation of state from hosts within the same zone. Each host tracks the changes of a series of attributes and stores them as rows in a local SQL database. Aggregation in Astrolabe is handled through SQL SELECT queries which serve as a form of mobile code. A user or software agent issues an query to locate a resource or obtain state and the Astrolabe deployment rapidly propagates the query through a peer to peer overly, aggregates results and returns a complete result. Astrolabe continuously recomputes these queries and returns updated results to any relevant clients.
Astrolabe has no publicly available implementation and the original implementation evaluates the architecture through the use of simulation. As a result, there is no definitive means to evaluate Astrolabe’s use for cloud monitoring. Irrespective of this Astrolabe is an influential monitoring system which, unlike many contemporary, monitoring tools employs novel aggregation and grouping mechanisms.
Nagios
Nagios [35] is the de facto standard open source monitoring tool for monitoring server deployments. Nagios in its simplest configuration is a two tier hierarchy; there exists a single monitoring server and a number of monitored servers. The monitoring server is provided with a configuration file detailing each server to be monitored and the services each operates. Nagios then generates a schedule and polls each server and checks each service in turn according to that schedule. If servers are added or removed the configuration must be updated and the schedule recomputed. Plugins must be installed on the monitored servers if the necessary information for a service check cannot be obtained by interacting with the available services. A Nagios service check consists of obtaining the relevant data from the monitored host and then checking that value against a expected value or range of values; raising an alert if an unexpected value is detect. This simple configuration does not scale well, as the single server becomes a significant bottleneck as the pool of monitored servers grows.
In a more scalable configuration, Nagios can be deployed in an n-tier hierarchy using an extension known as the Nagios Service Check Acceptor (NCSA). In this deployment, there remains a single monitoring server at the top of this hierarchy, but in this configuration it polls a second tier of monitoring servers which can in turn poll additional tiers of monitoring servers. This distributes the monitoring load over a number of monitoring servers and allows for scheduling and polling to be performed in small subsections rather then en mass. The NCSA plug-in that facilitates this deployment allows the monitoring results of a Nagios server to be propagated to another Nagios server. This requires each Nagios server to have its own independent configuration and the failure of any one of the monitoring servers in the hierarchy will disrupt monitoring and require manual intervention. In this configuration system administrators typically rely on a third party configuration management tool such as Chef or Puppet to manage the configuration and operation of each independent Nagios server.
A final, alternative configuration known as the Distributed Nagios Executor (DNX) introduces the concept of worker nodes [36]. In this configuration, a master Nagios server dispatches the service checks from its own schedule to a series of worker nodes. The master maintains all configuration, workers only require the IP address of the master. Worker nodes can join and leave in an ad hoc manner without disrupting monitoring services. This is beneficial for cloud monitoring; allowing an elastic pool of workers to scale in proportion to the monitored servers. If, however, the master fails all monitoring will cease. Thus, for anything other than the most trivial deployments additional failover mechanisms are necessary.
Nagios is not a perfect fit for cloud monitoring. There is an extensive amount of manual configuration required, including the need to modify configuration when monitored VMs are instantiated and terminated. Performance is an additional issue, many Nagios service checks are resource intensive and a large number of service checks can result in significant CPU and IO overhead. Internally, Nagios relies upon a series of pipes, buffers and queues which can become bottlenecks when monitoring large scale systems [37]. Many of these host checks are, by default, non parallelised and block until complete. This severely limits the number of service checks that can be performed. While most, if not all of these issues can be overcome through the use of plugins and third party patches this requires significant labour. Nagios was simply never designed or intended for monitoring large scale cloud systems and therefore requires extensive retrofitting to be suitable for the task [38]. The classification of Nagios is dependant upon its configuration. Due to its age and extensive plugin library Nagios has numerous configurations.
Collectd
Collectd [39] is an open source tool for collecting monitoring state which is highly extensible and supports all common applications, logs and output formats. It is used by many cloud providers as part of their own monitoring solutions, including Rightscale [40]. One of the appeals of collectd is its network architecture; unlike most tools it utilises IPv6 and multicast in addition to regular IPv4 unicast. Collectd uses a push model, monitoring state is pushed to a multicast group or single server using one of the aforementioned technologies. Data can be pushed to several storage backends, most commonly RRDTool [32] is used but MongoDB, Redis, MySQL and others are supported. This allows for a very loosely coupled monitoring architecture whereby monitoring servers, or groups or monitoring servers, need not be aware of clients in order to collect state from them. The sophisticated networking and loose coupling allow collectd to be deployed in numerous different topologies, from simple two tier architectures to complex multicast hierarchies. This flexibility makes collectd one of the more popular emerging tools for cloud monitoring. Collectd, as the name implies, it primarily concerned with the collection and transmission of monitoring state. Additional functions, including the storage of state are achieved through plugins. There is no functionality provided for analysing, visualising or otherwise consuming collected state. Collectd attempts to adhere to UNIX principles and eschews non collection related functionality, relying on third party programs to provide additional functionality if required. Tools which are frequently recommended for use with Collectd include:
Logstash [41] for managing raw logfiles, performing text search analysis which is beyond the perview of collected
StatsD [42] for aggregating monitoring state and sending it to an analysis service
Bucky [43] for translating data between StatsD, Collectd and Graphtite’s formats.
Graphite [44] for providing visualization and graphing
drraw [45] an alternative tool for visualization RRDtool data
Riemann [9] for event processing
Cabot [46] for alerting
Collectd, in conjunction with additional tools make for a comprehensive cloud monitoring stack. There is however no complete, ready to deploy distribution of a collectd based monitoring stack. Therefore, there is significant labour required to build, test and deploy the full stack. This is prohibitive to organisations that lack the resources to roll their own monitoring stack. Colletd based monitoring solutions are therefore available only to those organisations that can develop and maintain their stack. For other organisations, a more complete solution or monitoring as a service tool may be more appropriate.
Riemann
Riemann [9] is an event based distributed systems monitoring tool. Riemann does not focus on data collection, but rather on event submission and processing. Events are representations of arbitrary metrics which are generated by clients and encoded using Google Protocol Buffers [47] and additionally contains various metadata (hostname, service name, time, ttl, etc). On receiving an event Riemann processes it through a stream. Users can write stream functions in a Clojure based DSL to operate on streams. Stream functions can handle events, merge streams, split streams and perform various other operations. Through stream processing Riemann can check thresholds, detect anomalous behaviour, raise alerts and perform other common monitoring use cases. Designed to handle thousands of events per second, Riemann is intended to operate at scale.
Events can be generated and pushed to Riemann in one of two ways. Either applications can be extended to generate events or third party programs can monitor applications and push events to Riemann. Nagios, Ganglia and collectd all support forwarding events to Riemann. Riemann can also export data to numerous graphing tools. This allows Riemann to integrate into existing monitoring stacks or for new stacks to be built around it.
Riemann is relatively new software and has developing a user base. Due it its somewhat unique architecture, scalability and integration Riemann is a valuable cloud monitoring tool.
sFlow and host sFlow
sFlow [48] is a monitoring protocol designed for network monitoring. The goal of sFlow is to provide an interoperable monitoring standard that allows equipment from different vendors to be monitored by the same software. Monitored infrastructure runs an sFlow agent which receives state from the local environment and builds and sends sFlow datagram to a collector. A collector is any software which is capable of receiving sSlow encoded data. The original sFlow standard is intended purely for monitoring network infrastructure. Host sFlow [49] extends the base standard to add support for monitoring applications, physical servers and VMs. sFlow is therefore one of the few protocols capable of obtaining state from the full gamut of data centre technologies.
sFlow is a widely implemented standard. In addition to sFlow agents being available for most pieces of network equipment, operating systems and applications there are a large number of collectors. Collectors vary in terms of functionality and scalability. Collectors range from basic command line tools and simple web applications to large complex monitoring and analysis systems. Other monitoring tools discussed, including Nagios and Ganglia, are also capable of acting as sFlow collectors.
Rather than being a monitoring system, sFlow is a protocol for encoding and transmitting monitoring data which makes it unique amongst the other tools surveyed here. Many sFlow collectors are special purpose tools intended for DDOS prevention [50], network troubleshooting [51] or intrusion detection. At present, there is no widely adopted general purpose sFlow based monitoring tool. sFlow is however, a potential protocol for future monitoring solutions.
Logstash
Logstash is a monitoring tool quite unlike the vast majority of those surveyed here. Logstash is concerned not with the collection of metrics but rather with the collection of logs and event data. Logstash is part of the ElasticSearch family, a set of tools intended for efficient distributed real time text analytics. Logstash is responsible for parsing and filtering log data while other parts of the ElasticSearch toolset, notably Kibana (a browser based analytics tool) is responsible for analysing and visualising the collected log data. Logstash supports and event processing pipeline not dissimilar to Riemann allowing chains of filter and routing functions to be applied to events as they progress through the Logstash index.
Similar to other monitoring tools, Logstash runs a small agent on each monitored host referred to as a shipper. The shipper is a Logstash instance which is configured to take inputs from various sources (stdin, stderr, log files etc) and then ’ship’ them using AMQ to an indexer. An indexer is another Logstash index which is configured to parse, filter and route logs and events that come in via AMQ. The index then exports parsed logs to ElasticSearch which provides analytics tools. Logstash additionally provides a web interface which communicates with ElasticSearch in order to analyse, retrieve and visualise log data.
LogStash is written in JRuby and communicates using AMQP with Redis serving as a message broker. Scalability is of chief concern to LogStash and it enables distribution of work via Redis, which despite relying on a centralised model can conceptually allow LogStash to scale to thousands of nodes.
MonALISA
MonALISA (Monitoring Agents in A Large Integrated Services Architecture) [52],[53] is an agent based monitoring system for globally distributed grid systems. MonALISA uses a network of JINI [54] services to register and discover a variety of self-describing agent-based subsystems that dynamically collaborate to collect and analyse monitoring state. Agents can pull interact with conventional monitoring tools including Ganglia and Nagios and collect state from a range of applications in order to analyse and manage systems on a global scale.
MonALISA sees extensive use with the eScience and High Energy Physics community where large scale, globally distributed virtual organisations are common. The challenges those communities face are relatively unique, but few other monitoring systems address the concerns of widespread geographic distribution: a significant concern for cloud monitoring.
visPerf
visPerf [55] is a grid monitoring tool which provides scalable monitoring to large distributed server deployments. visPerf employs a hybrid peer-to-peer and centralised monitoring approach. Geographically close subsets of the grid utilise a unicast strategy to have a monitoring agent on each monitored server disseminate state to a local monitoring server. Each local server communicates with each other using a peer-to-peer protocol. A central server, the visPerf controller, collects state from each of the local masters and visualises state.
Communication in visPerf is facilitated by a bespoke protocol over TCP or alternatively XML-RPC. Either may be used allowing external tools to interact and obtain monitoring state. Each monitoring agent collects state using conventional UNIX tools including iostat, vmstat and top. Additionally, visPerf supports the collection of log data and uses a grammar based strategy to specify how to parse log data. The visPerf controller can communicate with monitoring agents in order to change the rate of data collection, what data is obtained and other variables. Bi-directional communication between monitoring agents and monitoring servers is an uncommon feature which is enables visPerf to adapt to changing monitoring requirements without reduced need to manually edit configuration.
GEMS
Gossip-Enabled Monitoring Service for Scalable Heterogeneous Distributed Systems (GEMS) is a tool designed for cluster and grid monitoring [5]. It’s primary focus is failure detection at scale. GEMS is similar to Astrolabe in that it divides monitored nodes into a series of layers which form a tree. A gossip protocol [56] is employed to propagate resource and availability information throughout this hierarchy. A gossip message at level k encapsulates the state of all nodes at k and all levels above k. Layers are therefore formed based on the work being performed at each node, with more heavily loaded nodes being placed at the top of the hierarchy where less monitoring work is performed. This scheme is primarily used to propagate four data structures: a gossip list, suspect vector, suspect matrix and a live list. The gossip list contains the number of intervals since a heartbeat was received from each node, the suspect vector stores information regarding suspected failures, the suspect matrix is the total of all node’s suspect vectors and the live list is a vector containing the availability of each node. This information is used to implement a consensus mechanism over the tree which corroborates missed heartbeats and failure suspicions to detect failure and network partitions.
This architecture can be extended to collect and propagate resource information. Nodes can operate a performance monitoring component which contains a series of sensors, small programs, which collect basic system metrics including cpu, memory, disk io and so forth and propagates them throughout the hierarchy. Additionally GEMS operates an API to allow this information to be made available to external management tools, middleware or other monitoring systems. These external systems are expected to analyse the collected monitoring state and take appropriate action, GEMS provides no mechanism for this. GEMS does however include significant room for extension and customisation by the end user. GEMS can collect and propagate any arbitrary state and has the capacity for real time programmatic reconfiguration.
Unlike Astrolabe, which GEMS bears similarity to, prototype code for the system is available [57]. This implementation has received a small scale, 150 node, evaluation which demonstrates the viability of the architecture and the viability of gossip protocols for monitoring at scale. Since its initial release, GEMS has not received widespread attention and has not been evaluated specifically for cloud monitoring. Despite this, the ideas expressed in GEMS are relevant to cloud computing. The problem domain for GEMS, large heterogeneous clusters are not entirely dissimilar to some cloud environments and the mechanism for programmatic reconfiguration is of definite relevance to cloud computing.
Reconnoiter
Reconnoiter [58] is an open source monitoring tool which integrates monitoring with trend analysis. Its design goal is to surpass the scalability of previous monitoring tools and cope with thousands of servers and hundreds of thousands of metrics. Reconnoiter uses a n-tier architecture similar to Nagios. The hierarchy consists of three components notid, stratcond and Reconnoiter. An instance of the monitoring agent, notid, runs in each server rack of datacenter (dependant upon scale) and performs monitoring checks on all local infrastructure. stratcond aggregates data from multiple notid instances (or other stratcond instances) and pushes aggregates to a PostgreSQL database. The front end, named Reconnoiter, performs various forms of trend analysis and visualises monitoring state.
Reconnoiter represents an incremental development from Nagios and other hierarchical server monitoring tools. It utilises a very similar architecture to previous tools but is specifically built for scale, with notid instances having an expected capacity of 100,000 service checks per minute. Despite being available for several years Reconnoiter has not seen mass adoption, possibly due to the increasing abandonment of centralised monitoring architectures in favour of decentralised alternatives.
Other monitoring systems
There are other related monitoring systems which are relevant to cloud monitoring but are either similar to the previously described systems, poorly documented or otherwise uncommon. These systems include:
Icinga [59] is a fork of Nagios which was developed to overcome various issues in the Nagios development process and implement additional features. Icinga, by its very nature includes all of Nagios’ features and additionally provides support for additional storage back ends, built in support for distributed monitoring, an SLA reporting mechanism and greater extensibility.
Zenoss [60] is an opens source monitoring system similar in concept and use to Nagios. Zenoss include some functionality not present in a standard Nagios install including automatic host discovery, inventory and configuration management, and an extensive event management system. Zenoss provides a strong basis for customisation and extension providing numerous means to include additional functionality, including supporting the Nagios plugin format.
Cacti [61], GroundWork [62], Munin [63], OpenNMS [64], Spiceworks [65], and Zabbix [66] are all enterprise monitoring solutions that have seen extensive use in conventional server monitoring which now see some usage in cloud monitoring. These systems have similar functionality and utilise a similar set of backend tools including RRDTool, MySQL and PostgreSQL. These systems, like several discussed in this section, were designed for monitoring fixed server deployments and lack the supports for elasticity and scalability that more cloud-specific tools offer. The use of these tools in the cloud domain is unlikely to continue as alternative, better suited tools become increasingly available.