|
For any operating system-whether or not it is Open Source-to be considered a full-fledged enterprise OS, it must be able to provide high availability and data integrity for commercial applications built on relational databases. The ability to support applications that are dynamically writing and updating data is the sine qua non for enterprise computing. Of secondary importance is support for IT infrastructure such as Web, file, and print services.
This review really serves as an introduction to what will become a long series on clustering and will in the longer run become the standard infrastructure on which OpenBench Labs will evaluate all enterprise-class hardware and software. In particular, we will be revisiting such applications as Anteil's Open Source CRM package, which is built on MySQL and Apache. This is precisely the class of application for which clusters were designed. Similarly, the ability for systems-management applications to handle the intricacies of clusters and SANs will more frequently become a requirement for future reviews.
For our first foray into Linux clustering, OpenBench Labs chose to begin installing Convolo Cluster 1.2 from Mission Critical Linux. Convolo is the commercial version of Kimberlite, an Open Source clustering project sponsored by Mission Critical Linux. The Convolo clustering package offers a more extensive GUI with which to configure and monitor the cluster. In addition, Convolo provides full support for NFS for file locking and server failover.
Convolo provides a very rich and robust environment. There are a substantial number of options that directly affect the robustness of a Convolo Cluster as a high-availability solution. As a result, Mission Critical defines the base requirements as suitable for staging development cluster scenarios. These extra options include serial line control over the power supply of the partner of each cluster member-currently Convolo 1.2. Clusters are limited to two members; this option provides the capability for a cluster member to power-cycle its partner if it senses that the system is hung.
Clearly, such an option is complex enough to support a review on its own merits. Having started with only a limited number of the high-availability options that Convolo Clusters support, we will
continue to refine our cluster in future reviews in the kind of staging process that most sites experience.
Convolo Clusters support both active-active and hot-standby configurations. In an active-active configuration, both nodes run different services, which appear to be running on the single logical
cluster alias. In our base-test configuration, OpenBench Labs set up NFS services on node Tuxilla1 and MySQL services on node Tuxilla2. This is a much more sophisticated processing model and requires extensive mechanisms to insure the correctness of the cluster state.
Convolo Clusters utilize extensive polling of nodes to track cluster-state information. In particular, the systems administrator can define any number of optional polling interconnects; however, he must define at least one, which can be the Local Area Network. Typically, a production cluster polls across both a private Ethernet connection and an RS-232 serial line to increase the configuration's resilience in the face of interconnect failures.
Unlike in most competing cluster schemes, the developers of Convolo also asked the question: What happens when all of the polling interconnects fail? If this sounds like questioning how many angels
can dance on the head of a pin, you may be right. Nonetheless, there is an even more important reason for asking this question. When Convolo Clusters support far more nodes than their current two, how does a node attempting to join the cluster quickly learn the state of all of the other nodes without a potentially long polling period?
The answer to this question is the use of a Quorum partition, which stores cluster-state information and is used by the cluster members to validate cluster membership in the event that all polling interconnects fail. A key feature of VMSclusters, a Quorum volume provides concurrent access to all servers in the cluster via a raw-disk partition. The Quorum volume is accessed as a raw device and not a mounted file system to eliminate any cache-timing issues. On a Convolo Cluster, the administrator must configure two raw-disk partitions, which the cluster mirrors.
Each server uses the cluster-state information on the two Quorum volumes, which includes node status, service definitions, activity time stamps, and node-locking status to validate cluster membership. As a result, cluster members discover services dynamically through the Quorum verification mechanism or at boot time. If a server cannot access either Quorum partition, it will not join the cluster.
The need to support concurrent access to mirrored Quorum volumes adds a great deal of complexity for the hardware vendor. As a result, for correct handling of storage bus reset conditions, Mission Critical Linux strongly recommends that the shared storage be
configured using a multiport storage controller with single-initiator SCSI bus-Fibre Channel is also supported-interconnects. With only one cluster node connected to it, a single-initiator SCSI bus provides host isolation, enhances performance, and ensures that each cluster system is protected from disruptions due to workload or initialization.
In contrast, Wolfpack Windows NT/2000 clusters typically utilize a shared multi-initiator SCSI bus. This scheme makes it difficult to configure the shared SCSI storage or to terminate the bus correctly and complicates removal of a node from an operating cluster. As a result, SCSI host-bus adaptors must be carefully tested and qualified for use in a Wolfpack cluster.
For the OpenBench Labs Convolo Cluster, we chose the multiport FlashDisk RAID Controller from Winchester Systems in conjunction with QLogic's QLA12160 HBAs. The FlashDisk RAID can be thought
of more as a RISC-based storage server that sits on a SCSI bus rather than a "controller." After all, with its PowerPC processor, a FlashDisk has more in common with an RS6000 than with a board-level controller.
At the heart of the FlashDisk is a PowerPC processor, which oversees a highly configurable, and hence very complex, storage subsystem. In a standard rack-mount configuration, the FlashDisk can be configured with 12 disk drives (18-GB Seagate Cheetah X15 drives with an Ultra SCSI interface), four SCSI channels for drives (three drives per channel), four SCSI channels for host connections, and 256 MB of RAM for an I/O cache; everything is configurable.
The FlashDisk provides enough tools to allow the systems administrator to configure the physical drives in any number of array configurations and then to present this array in an equally broad range of logical configurations. OpenBench Labs chose to configure all of the drives in a single RAID1+0 array, which sacrifices storage capacity to gain in both read and write performance. Next, we partitioned the RAID array as five logical drives. For the shared Quorum volumes, we created two 10-MB partitions. The remaining 108 GB was split into two very large logical volumes. A disk volume can be used by only one cluster service, which essentially owns the volume, and we had two. In addition, the Linux device names must be the same for each shared-storage volume on each cluster system. Cluster services are then free to migrate among nodes without the need for a distributed lock manager.
The next step was to specify exactly how the four drives would be presented to the cluster nodes. Tuxilla1 and Tuxilla2 were connected to channels 4 and 5-the first host I/O channels on the FlashDisk. We assigned the four partitions to each of these channels so that the logical drives representing the partitions would be accessible to both cluster nodes. Note that in this configuration we share the same cluster storage but do not share a single SCSI bus, the way the nodes in a Wolfpack cluster do. Finally, on both host SCSI channels, we assigned Lun 0 of the device IDs 0, 1, 2, and 3 to the four partitions, respectively.
Once all of this housekeeping was complete, each QLogic HBA in Tuxilla1 and Tuxilla2 could see three drives connected at boot time. We were now ready to finish the cluster configuration. This involved building two raw devices for the Quorum volumes and configuring three heartbeat interconnects: the local LAN, a private Ethernet connection, and a serial connection. All that remained was to run either the standard console scripts or to invoke the Convolo GUI to configure the cluster services.
With the cluster up and running we were then able to test the performance of the Winchester Flashdisk RAID Controller with the OpenBench Labs lload benchmark. The performance characteristics of the FlashDisk were quite spectacular. Configured to bias caching towards random I/O, the FlashDisk proved to be an exceptional engine for database applications. Sustained I/O throughput from a single cluster node in terms of I/Os per second were 3.5 times greater than a board-level RAID controller.
In the coming months, we'll be looking at how additional high-availability options for Convolo Clusters affect the failover and migration of software services. We'll also be looking at integration of Fibre Channel storage and the issues encountered when integrating a cluster into a SAN.
OPENBENCH LABS SCENARIO
UNDER EXAMINATION
Linux Cluster performance and functionality
WHAT WE TESTED
Mission Critical Linux Convolo Cluster 1.2b
www.missioncriticallinux.com
Winchester Systems FlashDisk RAID Controller
www.winsys.com
HOW WE TESTED
(2) Dell PowerEdge 2400 Servers
www.dell.com
Red Hat Linux v7.0
www.redhat.com
OpenBench Labs lload v1.0 benchmark
KEY FINDINGS
- The use of Quorum disks as shared raw devices dictates the use of multiported external hardware RAID devices.
- Software services own their respective "shared" SCSI volume, which simplifies failover and facilitates the migration of services in an active cluster.
- Extensive high-availability options are tied to sophisticated polling options and presence of cluster-state information resident on the quorum volumes.
|