back to reality
My “holiday” from my normal work is over, finally, after five weeks; it was extended for an extra week, after I returned from Germany, when my boss needed some emergency NPI work done. NPI is short for New Product Introduction, which means that something new is hitting the market, and we have to figure out what it takes to support it. In this case, a new Linux-based high-availability cluster, we had already ordered the basic server hardware, but some add-on bits were missing, so I had to scrounge those up. Could be worse – actually having the primary hardware makes a nice change from the normal “muddle through” situation.
Then there was the software installation; a fairly normal version of SuSE Linux with the addition of our “bits”, supplied as a bootable Altiris image (same principle as Symantec Ghost). I say “our” in quotes because this particular Linux cluster software was not made by my company, but bought in from an OEM. First attempt to burn a DVD failed with a “power calibration error”, then I remembered that I could use the image directly by extracting the ISO to a server and printing it from there under DOS – which worked. I hardly qualify as a Linux expert, SuSE or otherwise, but this side of the product is fairly standard, and I had no problems getting the servers ready. Then the cluster software and shared storage… oh hell.
A “high availability” cluster like this – as opposed to a “compute cluster” like Beowulf – uses shared SAN disks – disks that are visible to more than one server. This is an inherently dangerous thing to do without cluster software to manage access to the disks, but it’s about the only way to provide clients with quick access to data, with little or no interruption to normal activity.
Microsoft’s Cluster Services (MSCS) is the most well-known “high availability” cluster software in general, and Linux has similar systems, such as Linux-HA. You can make analogies with the real world, such as a railway junction, or planes taking off from and landing on the same runway; without some kind of traffic control system, collisions will happen, and that’s what storage cluster software offers for shared disks. A “disk collision” means lost or corrupted data, so write access in particular must be strictly controlled.
This software has some very specific requirements for the shared storage, which are poorly documented and which took some trial and error to figure out. For example, when software reports that a disk has a negative size, experience tells me the software is written to handle disks up to a certain size only – so I made smaller ones. (You can do that with virtualized storage systems, where you can create volumes of arbritrary size from a pool of normal disks, each of a fixed size.)
Once I got past that and other humps, it went fairly smoothly, but I know what I’m doing (and have access to the people who put it all together). What would a customer do in such situations as I was in? That’s the situation I have to imagine myself in. I would not say this product is ready for the market, in the form in which I’m seeing it – yet it already is on the market, hence the “emergency” nature of the work. We just haven’t sold any yet!