+ +

+ CMS Features +

+ Problem Specification +

+ Original Problem +

+ This is the original specification given to us when we + started the project. The i-scream central monitoring system + meets this specification, and aims to extend it further. + This is, however, where it all began. +

+ Centralised Machine Monitoring +

+ The Computer Science department has a number of different + machines running a variety of different operating systems. + One of the tasks of the systems administrators is to make + sure that the machines don't run out of resources. This + involves watching processor loads, available disk space, + swap space, etc. +

+ It isn't practicle to monitor a large number of machines by + logging on and running commands such as 'uptime' on the + unix machines, or by using performance monitor for NT + servers. Thus this project is to write monitoring software + for each platform supported which reports resource usage + back to one centralised location. System Administrators + would then be able to monitor all machines from this + centralised location. +

+ Once this basic functionality is implemented it could + usefully be expanded to include logging of resource usage + to identify longterm trends/problems, alerter services + which can directly contact sysadmins (or even the general + public) to bring attention to problem areas. Ideally it + should be possible to run multiple instances of the + reporting tool (with all instances being updated in + realtime) and to to be able to run the reporting tool as + both as stand alone application and embeded in a web page. +

+ This project will require you to write code for the unix + and Win32 APIs using C and knowledge of how the underlying + operating systems manage resources. It will also require + some network/distributed systems code and a GUI front end + for the reporting tool. It is important for students + undertaking this project to understand the importance of + writing efficient and small code as the end product will + really be most useful when machines start run out of + processing power/memory/disk. +

+ John Cinnamond (email jc) whose idea this is, will provide + technical support for the project. +

+ Features +

+ Key Features of The System +

A centrally stored, dynamically reloaded, system wide + configuration system +
A totally extendable monitoring system, nothing except + the Host (which generates the data) and the Clients (which + view it) know any details about the data being sent, + allowing data to be modified without changes to the server + architecture. +
Central server and reporting tools all Java based for + multi-platform portability +
Distribution of core server components over CORBA to + allow appropriate components to run independently and to + allow new components to be written to conform with the + CORBA interfaces. +
Use of CORBA to create a hierarchical set of data entry + points to the system allowing the system to handle event + storms and remote office locations. +
One location for all system messages, despite being + distributed. +
XML data protocol used to make data processing and + analysing easily extendable +
A stateless server which can be moved and restarted at + will, while Hosts, Clients, and reporting tools are + unaffected and simply reconnect when the server is + available again. +
Simple and open end protocols to allow easy extension + and platform porting of Hosts and Clients. +
Self monitoring, as all data queues within the system + can be monitored and raise alerts to warn of event storms + and impending failures (should any occur). +
A variety of web based information displays based on + Java/SQL reporting and PHP on-the-fly page generation to + show the latest alerts and data +
Large overhead monitor Helpdesk style displays for + latest Alerting information +

+ An Overview of the i-scream Central Monitoring System +

+ The i-scream system monitors status and performance + information obtained from machines feeding data into it and + then displays this information in a variety of ways. +

+ This data is obtained through the running of small + applications on the reporting machines. These applications + are known as "Hosts". The i-scream system provides a range + of hosts which are designed to be small and lightweight in + their configuration and operation. See the website and + appropriate documentation to locate currently available + Host applications. These hosts are simply told where to + contact the server at which point they are totally + autonomous. They are able to obtain configuration from the + server, detect changes in their configuration, send data + packets (via UDP) containing monitoring information, and + send so called "Heartbeat" packets (via TCP) periodically + to indicate to the server that they are still alive. +

+ It is then fed into the i-scream server. The server then + splits the data two ways. First it places the data in a + database system, typically MySQL based, for later + extraction and processing by the i-scream report generation + tools. It then passes it onto to real-time "Clients" which + handle the data as it enters the system. The system itself + has an internal real-time client called the "Local Client" + which has a series of Monitors running which can analyse + the data. One of these Monitors also feeds the data off to + a file repository, which is updated as new data comes in + for each machine, this data is then read and displayed by + the i-scream web services to provide a web interface to the + data. The system also allows TCP connections by non-local + clients (such as the i-scream supplied Conient), these + applications provide a real-time view of the data as it + flows through the system. +

+ The final section of the system links the Local Client + Monitors to an alerting system. These Monitors can be + configured to detect changes in the data past threshold + levels. When a threshold is breached an alert is raised. + This alert is then escalated as the alert persists through + four live levels, NOTICE, WARNING, CAUTION and CRITICAL. + The alerting system keeps an eye on the level and when a + certain level is reached, certain alerting mechanisms fire + through whatever medium they are configured to send. +

+ +