A computer cluster refers to a group or collection of computers – connected by a local area network (LAN) – that are working extremely well together that one can think of them as being a single computer. Computer clusters were created to improve performance. An office network can be considered a cluster since multiple computers and users are tied in to a server to exchange information or data in order to perform their assigned tasks or roles and to ensure availability of service by having a fallback computer(s) ready to take over in the event of failure in the primary system.

Conceptual Basis for High-Availability Clusters

The overall concept of high-availability clusters is simple. A single server by itself can be considered a point of failure. If something causes the server to shut down (power failure, software glitches, virus attack, natural disaster), the whole network shuts down and becomes non-operational until the problem is solved, the server can be restarted and the whole network is brought up again.

A high-availability cluster will have (at minimum) two 'nodes' – a primary and backup, with the backup on standby until something happens to the primary. At just such an occurrence, the backup takes over operations seamlessly, without the need for outside intervention.

While simple in concept, the actual design and implementation of a high-availability cluster is a complex undertaking involving both software design and hardware architecture.

Issues and Concerns with High-Availability Clusters

One issue is ensuring that both primary and backup nodes have the same programs, operating systems and data in real time, to ensure seamless switching from primary node to backup node with no disruption.

Think of it this way: most commercial software allows for automatic backup of data using pre-set or manually-programmed time frames. If something happens to one's computer, it is simply a matter of finding the problem (e.g. a pulled power cord), restarting the computer and going to the backed-up data and starting from there. The problem, of course, is the fact that a fairly large amount of work can be done between the time a backup is done and the moment the disruption occurs – which means that work done is lost when operations resume again.

An HA Cluster must be designed in such a way that there will be no loss of data or work when something happens to the primary node and that there will be automatic and effortless switching over to the backup. This means that the data on the backup should be constantly updated in real time.

Another issue concerns redundancy. While a cluster can be made up of two nodes, there is still a risk of simultaneous failure happening (especially if both nodes are in a single location). As such, most clusters have multiple backup nodes which are interconnected through multiple links to ensure immediate fallback in case of multiple failures. Again, the software design and hardware architecture of the cluster must be carefully considered. These concerns require consideration if one's setting up an HA Cluster.