Implementing Windows Cluster Service

Planning for a Cluster Service Implementation

A number of factors have to be determined when you plan for your Cluster Service implementation. A few items which you should include in your planning phase are listed here:

Determine which applications and network services are the mission-critical applications of the organization that need high availability.
Determine which clustering technology to implement that would ensure high levels of availability for the mission-critical applications previously identified. Here, you should identify those applications which should be used with Cluster Service, and those applications which should be used with NLB.
After you have decided on the clustering technology, you have to determine the server capacity requirements.
Determine the network risks.
Determine all potential points of failure and network connectivity issues.
Determine whether a preferred node is to be configured to support a specific resource.
Determine the failover timing properties and failback timing properties which you are going to implement.
Determine the role of each server in the context of the applications and services it will run. A server can be configured as a member server or as a domain controller.
Determine the cluster configuration model which suits the requirements of the organization.
After you have decided on the clustering technology and cluster configuration model, you have to determine the server hardware requirements.
Determine how the servers in the cluster are going to be secured.
Determine how you are going to back up data of the cluster.
When creating a new cluster, you would need to provide the following information:
- The host name to specify for the cluster.
- The IP address to set for the cluster.
- The domain name that will host the cluster.
- The name and password for the Cluster Service account.
When determining the applications for the cluster and failover, consider the following:
- The application has to use Transmission Control Protocol/Internet Protocol (TCP/IP), or Distributed Component Object Model (DCOM) and Named Pipes, or Remote Procedure Call (RPC) over TCP/IP to function in the cluster.
- NTLM authentication must be supported by the application.
- An application has to be capable of storing its data on the disks connected to the shared bus if it is to be included in the failover process.

Requirements for Installing Cluster Service

A few requirements for installing Cluster Service are listed below:

Administrative permissions are needed on each node in the cluster.
There should be sufficient disk space on the system drive and shared device for the Cluster Service installation.
The appropriate network adapter drivers must be used.
The network adapters must have the have the proper TCP/IP configurations.
File and Print Sharing for Microsoft Networks has to be installed on a node to configure Cluster Service.
The nodes should be configured with the same hardware and device drivers.
Each node must belong to the same domain.
The domain account utilized should be identical on each cluster node.
The system paging file must have sufficient space to prevent decreased performance. When the file has insufficient space, it can result in a system lockup.
It is good practice to examine the system and event logs prior to, and after installing Cluster Service.
Before installing any additional nodes for the cluster, first ensure that the previously installed node is running.
You can use System Monitor to troubleshoot virtual memory issues.

A few shared disk, hardware, and network specific considerations for implementing Cluster Service are listed here:

The shared drives must be physically attached to the nodes that belong to the cluster.
The NTFS file system should be used to format the partitions of the shared disk
Shared disks should be configured as a Basic disk.
The SCSI drives and adapters must each have a unique SCSI Identifier (ID).
Each server must have two PCI NICs.
The storage host adapter for SCSI or Fibre Channel must to be separate.
An external drive which has multiple RAID configured drives must be connected to the servers of the cluster.
A cluster must have a unique NetBIOS name.
Nodes which are part of the cluster must belong to the same domain.

Planning Resource Groups for the Cluster

The hardware and software components of the cluster are called resources. This includes the services and applications in the cluster.Resources can be grouped to form a resource group. The specific properties of the resource group and the application or service determine the manner in which the resource group is moved to the offline state by Cluster Service.

The resources generally included in a resource group are:

Application hosted
IP Address
Network Name
Physical Disk

The factors to consider when planning resource groups for your cluster, as well as a few recommendations are listed here:

Resources must be grouped based on function and resource dependencies.
When one resource is dependent on another resource, the resources must reside in the same resource group.
Resources that are dependent on each other must also reside on the same domain.
Consider drawing a dependency tree diagram to assist you when you are planning the resource groups for the cluster. The dependency tree should contain the resource groups and their related dependencies.
You can use the process below as a guideline when planning resource groups:
- Identify the applications which are to be hosted in the cluster.
- Identify those resource groups which will need failover capabilities.
- Identify the dependencies for each application.
- Identify which other resources such as file shares (not applications) are to be included in the cluster.

All server clusters have a default cluster group. The default cluster group has the following resources:

Quorum disk.
Cluster IP Address
Cluster name

Planning Failover Policies for the Cluster

As part of planning for a cluster implementation, you have to determine the failover policy for the cluster. The failover policy for a resource group determines how Cluster Service handles the resource when failover is initiated.

The options which can be configured to define the failover policy for a resource group are:

Failover timing; Cluster Service starts the failover process when a resource group has a failure. The resource group is then moved to anther node in the cluster. You can configure Cluster Service to attempt to restart the resources of the failed resource group before it moves the group to the other node.
Failback timing; the failed resource group is moved back to the primary node for the resource group when it is online again. You can configure failback to only occur during off peak hours.
Preferred node; setting a preferred node for a resource group ensures that a resource group is automatically moved back to its specified preferred node.

Planning Security for the Cluster

The mere fact that clusters host mission critical applications and services indicates that you have to secure the cluster.

A few strategies for securing a cluster are listed here:

Physically secure the nodes of the cluster.
Restrict physical access to the cluster infrastructure.
Secure all DNS, WINS and DHCP servers as well.
All mission critical server clusters should be placed behind firewalls.
Use the firewall configuration to control traffic that is directed to the cluster.
You should refrain from combining the cluster heartbeat messages with other network traffic.
Use only a few nodes to administer the server cluster.
The security features of Windows 2000, Windows Server 2003, and Active Directory can be used to secure applications hosted in the cluster.
Assign NTFS file system permissions on the server cluster to secure data.
Through configuring NTFS file system permissions, ensure that only members of the Administrators group and the Cluster Service account have access to the cluster quorum disk.
The Cluster Service account should not be used to run applications.
You should use a unique cluster service account to administer each individual cluster. This would ensure that if one account is compromised, it cannot be used on all clusters.
Use domainlets if you want finer control over the security boundary for the server cluster.
You should regularly audit activities on the cluster.

How to create a new cluster

1. Verify that only one node is connected.
2. Ensure that the node can access the shared storage device.
3. Ensure that the network interfaces have names and IP addresses.
4. Log on to the domain.
5. Click Start, Administrative Tools, and then click Active Directory Users and Computers to open the Active Directory Users and Computers management console.
6. Navigate to the Users container.
7. Create a Cluster Service user account.
8. Close the Active Directory Users and Computers.
9. Click Start, Administrative Tools, and then click Cluster Administrator to open the Cluster Administrator management console.
10. On the Open Connection to Cluster dialog box, click the Create new cluster command on the Action menu. Click OK.
11. The New Server Cluster Wizard initiates.
12. Click Next on the New Server Cluster Wizard Welcome screen.
13. On the Cluster Name and Domain page, provide a name for the cluster in the Cluster name text box, and specify the domain in the Domain drop-down list box. Click Next
14. On the Select Computer page, provide the name of the first computer which will be the initial node in the new cluster. Click Next.
15. On the Analyzing Configuration page, use the buttons available to determine what activities the Wizard performed to verify the node. Click Next.
16. On the IP Address page, enter the IP address for the new cluster in the IP Address box, and then click Next.
17. When the Cluster Service Account page opens, enter the user name, password and domain details of the cluster service account. Click Next.
18. The Wizard now shows the configuration for the new cluster on the Proposed Cluster Configuration page.
19. Click the Quorum button to select the quorum disk. Click OK.
20. The Wizard next starts to create the new server cluster.
21. When the Creating the Cluster page appears, click Next.
22. Click Finish to close the Wizard.
23. The Cluster Administrator management tool opens.
24. The new cluster is displayed in the Cluster Administrator management tool.
25. To configure properties for the new cluster, right-click the cluster and then select Properties from the shortcut menu.

Managing Clusters

The following mechanisms are available for managing a cluster:

Cluster Administrator (GUI tool): Cluster Administrator is the main tool used to manage and troubleshoot the cluster. Cluster Administrator is installed on each node in the cluster. You can also install Cluster Administrator on a computer that does not belong to the cluster if you want to remotely administer the cluster.
Cluster.exe (command-line utility): If you want to perform administrative tasks for the cluster from the command-line, then you can use Cluster.exe.

The administrative tasks which you can perform for the cluster through Cluster Administrator are listed here:

View information on the state of the cluster.
View the properties of the default Cluster group, and the default Disk group.
Change the name of the cluster. For a cluster name change to reflect, you first though have to bring the Cluster Name resource offline and then online.
Create resources for the cluster, and assign resource dependencies.
Delete resources. You can though only delete a resource once all its assigned dependencies are deleted.
Create new resource groups for the cluster, and configure the failover and failback policy for the resource group.
Delete resource groups for the cluster. The resources of a resource group are deleted when a group is deleted.
Add applications to the cluster: You can initiate the Cluster Application Wizard from Cluster Administrator if you want to add applications to the cluster.
Change ownership for a resource groups. Resources can be moved from one resource group to another group, and you can move a resource group from one node in the cluster to another node in the cluster. You would normally change ownership for a resource group when maintenance tasks need to be performed for the cluster.
Change properties of existing resources and resource groups. You can also rename existing resources and resource groups.
Change the state of resource groups. When you change the state of a resource group to either online or offline, then the resources of the particular group are automatically updated to reflect the modified state change.
Configure the location of the Quorum resource and change the default size of the Quorum log file.
You can initiate a failure for the cluster. This would usually be done to test the configured failover policies, and to test how resources restart.

How to create a new resource group

1. Click Start, Administrative Tools, and then click Cluster Administrator.
2. When the Open Connection To Cluster dialog box opens, enter the name of the cluster that you want to add a new group for.
3. Click Open.
4. Right-click Groups, and select New and then Group from the shortcut menu.
5. The New Group Wizard initiates.
6. In the Name box, enter a name for the new group.
7. In the Description box, provide a brief description for the new group. Click Next.
8. Enter the nodes which are preferred owners for the new group in the Preferred Owners list box.
9. Click Finish to create the new group.

How to move a resource group to another node

1. Click Start, Administrative Tools, and then click Cluster Administrator.
2. Expand the node that contains the resource group which you want to move.
3. Click Active Groups.
4. Double-click Groups.
5. Right-click the resource group which you want to move, and then select Move Group from the shortcut menu.

How to create a file share resource

1. Click Start, Administrative Tools, and then click Cluster Administrator.
2. Expand the Groups folder.
3. Right-click Cluster Printer, and select New and then Resource from the shortcut menu.
4. When the New Resource dialog box opens, provide a Name, Description, Resource Type, and Group. Click Next.
5. Enter the appropriate nodesin the Possible Owners list. Click Next.
6. Add the dependency resource in the Resource Dependencies list, and then click Next.
7. When the File Share Parameters dialog box opens, provide Share Name, Path, and Comment information.
8. Click Finish.

How to create a virtual server

1. Click Start, Administrative Tools, and then click Cluster Administrator.
2. When the Open Connection To Cluster dialog box opens, enter the name of the cluster, and then click Open.
3. Right-click Groups, and select New and then Group from the shortcut menu.
4. The New Group Wizard initiates.
5. In the Name box, enter a name for the new group.
6. In the Description box, provide a brief description for the new group. Click Next.
7. Enter the nodes which are preferred owners in the Preferred Owners list box.
8. Click Finish to create the new group.
9. To create an IP Address resource, in Cluster Administrator, expand the Groups folder
10. Right-click Virtual Server, and select New and then Resource from the shortcut menu.
11. When the New Resource dialog box opens, provide a Name, Description, Resource Type, and Group. Click Next.
12. Enter the appropriate nodes in the Possible Owners list. Click Next.
13. Ensure that the Resource Dependencies list contains no information. Click Next.
14. In the TCP/IP Address Parameters dialog box, provide the Address, Subnet Mask, and Network information.
15. Click Finish.
16. To create a Network Name resource, in Cluster Administrator, expand the Groups folder.
17. Right-click Virtual Server, and select New and then Resource from the shortcut menu.
18. When the New Resource dialog box opens, provide a Name, Description, Resource Type, and Group. Click Next.
19. Enter the appropriate nodes in the Possible Owners list. Click Next.
20. In the Resource Dependencies list, add the IP Address resource. Click Next.
21. Enter the information for the Network Name Parameters dialog box.
22. Click Finish.

How to create a user account for managing the cluster

1. Click Start, Administrative Tools, and then click Active Directory Users and Computers to open the Active Directory Users and Computers management console.

2. Navigate to the Users container.
3. Right-click Users, and the select New, and then User from the shortcut menu.
4. Provide the necessary information for the First Name, Last Name, and User Logon Name text boxes. Click Next.
5. In the Password and Confirm Password text boxes, provide the password for the new cluster user account.
6. Enable the User Cannot Change Password checkbox.
7. Enable the Password Never Expires checkbox.
8. Click Next. Click Finish.

How to pause and resume a node

1. Click Start, Administrative Tools, and then click Cluster Administrator.
2. In the left pane, select the node which you want to pause.
3. Select the Pause Node command from the File menu item.
4. To resume the node that was paused, click Start, Administrative Tools, and then click Cluster Administrator.
5. In the left pane, select the node which was paused.
6. Select the Resume Node command from the File menu item.

How to perform maintenance on a node without evicting the node

1. Click Start, Administrative Tools, and then click Cluster Administrator.
2. In the left pane, select the node which you want to perform maintenance tasks for.
3. Select the Pause Node command from the File menu item.
4. In the Details pane, double-click Active Groups, and then for each group perform the following: Select the group, select the File menu, and then select the Move Group command.
5. Proceed to do the necessary maintenance for the node which was paused.
6. When done, open Custer Administrator.
7. In the left pane, select the node.
8. Select the Resume Node command from the File menu item.

How to perform maintenance on a node with evicting the node

1. Click Start, Administrative Tools, and then click Cluster Administrator.
2. Stop Cluster Service running.
3. Select the Evict Node command from the File menu item.
4. Remove the node from the shared bus.
5. Uninstall Cluster Service.
6. Proceed to do the necessary maintenance tasks.
7. When done, connect the node to the shared bus.
8. Install Cluster Service, and then join the cluster.

How to change the size of the Quorum log

1. Click Start, Administrative Tools, and then click Cluster Administrator.
2. In the left pane, right-click the cluster name, and then select Properties from the shortcut menu.
3. Switch to the Quorum tab.
4. Change the size of the Quorum log in the Reset Quorum Log box.
5. Click OK.

The Cluster Service Log File

When a Cluster Service event such as when a new resource group is created takes place, the event is written to cluster log file. The cluster log contains information on each Cluster Service event that occurs by the cluster. Logging is performed by default.

The cluster log file has a maximum size of 8 MB, and is located in the %windir%clustercluster.log directory. When the maximum log file size is reached, event entries are removed from the log file in the order that they were added.

All cluster log entries have the following information:

The process ID and thread ID which resulted in the entry.
Timestamp
Event Description.

Because Cluster Service consists of a number of components that each performs specific functions for the cluster, a component event log entry contains information on the inter operation of Cluster Service's components. A resource DLL log entry on the other hand contains information that is specific to the resource groups within the cluster.

Information contained in a component event log entry includes the following:

The component which resulted in the event being logged.
The node's state when the event was logged.
The combined component and state.

There are also a few cluster log entries that have a status code, error code, or state code. A state code is connected with the following types of objects:

Network interfaces
Networks
Nodes
Resource groups
Resources

Troubleshooting Cluster Service

A few strategies which you can use to troubleshoot Cluster Service and server cluster issues are detailed in this section of the Article.

For Cluster Service to operate; the shared SCSI bus must exist and the necessary SCSI devices must be connected. One device must exist as the Quorum disk on the shared bus. There are some System event log errors which pertain to cluster SCSI device issues.

When troubleshooting SCSI device event log errors, you can use the list below as a guideline.

Internal termination in the BIOS of the controller should be disabled.
The Automatic SCSI bus reset option should be disabled.
The total cable length of the bus must not be greater than the maximum SCSI length defined by the manufacturer.
Check whether the cables and connector pins are physically damaged.
Check whether there are any loose connections.
Check that the driver and firmware versions are the same for each server that resides in the cluster.
Verify that the SCSI bus is properly terminated. Check for duplicate termination.

Check whether there are any duplicated SCSI IDs on the bus.

If one cluster node can connect to the cluster drives and another node cannot:
- Ensure that the problematic node is connected to the cluster drive.
- Check that the SCSI IDs are unique.
- Check that the SCSI controllers are configured correctly. They should be transferring data at the same rate.
- The same drive letters should be assigned for the drive on each node in the cluster.

When clients are unable to access resources in the cluster, verify the following:

For each cluster node, examine the errors in the System event log.
For each resource group that cannot be accessed, ensure that the group has an IP Address resource and a Network Name resource.
For clients to connect to a resource group, the IP Address resource and Network Name resource should be online.
Ensure that network connectivity exists to the particular node who is the owner of the resources which cannot be accessed.
Ensure that the clients are using the proper IP address or network name to connect to the resource in the cluster.

You can view the state of the network interface, and the state of the private and public networks through Cluster Administrator:

A network interface can be in one of the following states:

Up; indicates that this interface can communicate with the other interfaces on the network.
Unavailable; indicates that the node of the interface is down.
Unreachable; indicates that this interface cannot communicate with the other interfaces on the network that are in the Up state.
Failed; indicates that this interface cannot communicate with any other network interfaces. Typically caused by network adapter and drive failures, or cable failures.

The private and public networks can be in one of the following states:

Up; indicates that the interfaces on the cluster can communicate.
Down; indicates that the interfaces on the cluster cannot communicate between each other, and with other hosts.
Partitioned; indicates that one or multiple interfaces on the cluster are in the Unreachable state, but at least two interfaces on the cluster can communicate.
Unavailable; indicates that interfaces on the cluster is unavailable.

When troubleshooting Quorum disk problems , use the strategies below:

If the Quorum disk failed and you are unable to start Cluster Service, then start Cluster Service with no Quorum resource. Once Cluster Service starts specify a new Quorum resource.
If the Quorum resource fails to start:
- Ensure that all connections and cables are connected.
- Ensure that SCSI devices are properly terminated.
- Ensure that the devices on the SCSI bus are connected and operational.
If the Quorum log is corrupt, Cluster Service first tries to automatically reset the log. If Cluster Service fails to start because of the corrupt Quorum log, you have to manually reset the Quorum log.

When troubleshooting node problems, use the strategies below:

If the cluster is down, first attempt to bring one node online. Next, examine the log information to obtain more information on the problem.
If one node is down, attempt to bring one node online and then check log information for information. You should though first ensure that the resources of the failed node have failed over to another node in the cluster.
If a node fails after operating poorly, check whether the CPU is running close to 100 percent. You might have an overloaded CPU.
If Cluster Service does not initiate fail over when a node fails, check whether Cluster Service is performing an update. When updates are being performed, Cluster Service will not initiate the failover process.
If theresources fail back all the time while your nodes are operational, check whether the power supply is failing. It is recommended that you use an uninterruptible power supply (UPS).
If one node cannot access all drives:
- Check the cabling between the drive and the node.
- Check the shared drive from another node.
- Check the configuration of the cluster.
- Check whether you can access the drive from a different node.
If you cannot connect to a node using Cluster Administrator, check whether Cluster Service, RPC service and the node in running.

When troubleshooting resource group problems, use the strategies below:

If you cannot bring a resource group online:
- Ensure that the disk can be accessed.
- Check whether hardware issues or configuration issues exist with the disk resources of the problematic group.
- Check whether all dependencies of the resource have been specified.
- Move the resource group to another node, and then check whether it can be brought online.
- Attempt to bring each resource of the group online one at a time.
If a resource group does not failover to another node in the cluster:
- Ensure that the resources' Affect The Group option is selected.
- Ensure that the node is specified in the Possible Owner list of the resource.
If a resource group fails over but does not restart:
- Check that the node is online.
- Check the information in the Possible Owner list of the group and the resources.
- Try to pinpoint the resource which is problematic by bringing each resource online one at a time.
If a resource group does not fail back:
- Verify that the node which you expect the resource to fail back to is defined as the preferred owner of the resource group.
- Verify that the Allow Failback option is selected.
- Verify that the Prevent Failback checkbox is clear.

When troubleshooting resource problems, use the strategies below:

If you cannot bring a resource online:
- Check whether the application is installed.
- Check whether the resource is configured correctly.
- Verify that the resource can run with Cluster Service.
If a specific resource does not fail over:
- Check that the device is configured properly.
- Check that the device and cables are operational.
If a resource does not fail back:
- Check that the hardware is working.
- Verify that the network connections are operational.
- Check the configuration of the failback policy.
If a failed resource does not come online again:
- Check that the Do Not Restart option of the resource is not selected.
- Check whether the failure threshold of the resource has been reached
- Check whether there are any dependencies of the resource which are offline.
- Check that all dependencies of the resource have been properly configured.
When you have IP Address resource problems, ensure that the IP address is unique. Next, ensure that the subnet mask defined is correct. You can use the Ping utility to test the IP Address resource.
When you have Network Name resource issues, verify that the IP Address resource dependency of the Network Name resource is configured correctly. The IP Address resource dependency should be online.
When you have Print Spooler resource problems, verify that the Physical Disk resource and Network Name resource dependencies of Print Spooler resource is configured correctly and online. Check whether any NTFS permissions are preventing access. The issue might also be due to a full spool directory's disk.
When you have File Share resource issues, ensure that the Network Name resource and the Physical Disk resource dependencies and all other dependencies are operational. Ensure that the file share's directory was created, and that it can be accessed. Check whether there are any NTFS permissions which are preventing access to the file share.

Comments - No Responses to “Implementing Windows Cluster Service”