OS X Leopard and Snow Leopard servers have the interesting ability to cluster the mail services. The requirements to do such a thing carry a somewhat hefty price tag both in hardware and software, but it can be done. To make this go you’ll need to set up a fabric with some storage and make use of Apple’s clustered file system: Xsan.
We’re going to cover the setup of all this mess later down the line. For now, I thought I’d share with you a mishap that can occur when clustering mail services and how to get out of the snare without reinstalling your servers.
Mail services on Snow Leopard make use of several standard open source projects: dovecot (for IMAP and POP), postfix (for the mail transfer agent) and amavis/clamav (for message hygiene). These packages are available on just about any regular Linux or UNIX setup. Snow Leopard’s Server Admin utility generally takes the eye bleeding work out of administering these services. Hey, that’s what makes Apple servers great, right? All of these packages come preconfigured and requires very little in the way of administrative skills. That’s true until something goes wrong.
Imagine if you will… you’ve set up 4 Snow Leopard servers and clustered them with Xsan to produce a two-node active/active mail cluster. (Like I said, we’ll cover how to set that up another time). Now let’s imagine the horror of horrors: your Xsan implementation has absolutely, positively clocked out. Let’s say it’s clocked out so badly that you’re looking at an extended outage while you troubleshoot the issue.
Keep in mind that clustering the Mail services on Snow Leopard requires Xsan. There’s no getting around that. You just cannot have these services tagging the same data volumes without it. That’s Xsan’s job, after all… to manage all of that hilarity. Also keep in mind that Xsan requires some pretty heady thinking to avoid a single point of failure. If you manage to introduce a single point of failure into your deployment you may be facing this situation. Hopefully, you’ve implemented some type of backup strategy for your Xsan deployment. If that’s the case, chances are pretty good that you have a copy of the clustered Mail stores on another, non-Xsan volume that can be mounted and recovered quickly.
As a matter of fact, if you have such a volume with the data intact and your Xsan appears to be out of commission for a while, that’s where we are. You’ve mounted your backup volume onto a single server in the Mail cluster and have decided… Well, my Xsan is hosed so I can at least get the data mounted on a single server while I work on that, right?
If you were to open Server Admin at this point and try to reconfigure the mail cluster, you may start to bite your nails. To do this, you would open up Server Admin, click “Mail” on the left, then the “Settings” gear at the top. You would choose the “Advanced” tab and the “Clustering” subtab. You’ll note there that your Mail server is still configured as a clustered server. Normally, to change this configuration, you would click “Change” and walk through the wizard to set it back to a standalone server. However, I’m here to tell ya… it will fail.
You see, if your Xsan volume is not available (which, it isn’t in our scenario here), then Server Admin will do its best to reconfigure the mail cluster to standalone at your behest… and then error out. The error will read something akin to “FILE_NOT_FOUND_ERR for action ‘setState.’”
So what is going on here? After running through this wizard multiple times and trying different configurations you may notice several things. Many of the options you choose will not stick. If you were to monitor what is going on in /var/log/system.log, you may also notice that the bloody thing keeps trying to mount your Xsan volume. Yeah, it doesn’t matter if the volume is available or not – Server Admin expects to find it there to revert the configuration.
Now what?
Thanks to the magic of fs_usage in Terminal, you can figure out what Server Admin (and its command line sibling servermgrd) is trying to do. I’ll leave out the icky details on how I arrived at this conclusion, but I’ll hammer out what Server Admin did in the first place and what it’s trying to do now.
When you first set up a mail server in a clustered state, Server Admin performs several actions:
Postfix:
- Renames the directory “/etc/postfix” to “/etc/postfix.cl-rst”
- Renames the directory “/var/spool/postfix” to “/var/spool/postfix.cl-rst”
- Links the directory “/etc/postfix” to a hidden directory on your Xsan volume
- Links the directory “/var/spool/postfix” to a hidden directory on your Xsan volume
Dovecot:
- Renames the directory “/etc/dovecot” to “/etc/dovecot.cl-rst”
- Renames the directory “/var/spool/dovecot” to “/var/spool/dovecot.cl-rst”
- Links the directory “/etc/dovecot” to a hidden directory on your Xsan volume
- Links the directory “/var/spool/dovecot” to a hidden directory on your Xsan volume
AmavisD:
- Renames the file “/etc/amavisd.conf” to “/etc/amavisd.conf.cl-rst”
Note: the hidden directory referenced above is at the root of your Xsan volume. It’s named after the clustered server name you created. For instance, if you created a mail cluster called “MailXsan”, then the directory is physically located at “/Volumes/MailXsan/.MailXsan” and the configuration items listed above will be found deep inside there.
General plist file:
- Configures the file “/etc/MailServicesOther.plist” with information that tells Server Admin you’re in a clustered setup and provides information about the Xsan volume.
That last file was the hardest to figure out. It didn’t take much to realize that when you create the cluster, the old config files and directories are given the name “*.cl-rst” and the new configurations are symlinked out the Xsan volume. However, even after fixing those *.cl-rst files the Server Admin configuration utility would continue to bomb with the same error. It took some tasty analysis with “fs_usage -w -f filesys | grep servermgrd” to figure out that last plist file.
How does one unravel all of this? Again, assuming your Xsan volume is no longer available, go through the above configurations and reset things to the prior state. The steps are provided below. All of these commands must be executed as root, so be sure to use sudo in front of every one of them. Also, BE CERTAIN that the files exist in the way I have described. If you’re unsure of what you’re doing, ALWAYS list your directory contents before targeting files for removal. Additionally, I shouldn’t have to remind you that you should back up all of these files before you do anything, should I?
- sudo rm /etc/dovecot
- Removes the symlink to your Xsan volume
- sudo mv /etc/dovecot.cl-rst /etc/dovecot
- Restores the prior configuration for Dovecot
- sudo rm /var/spool/dovecot
- Removes the symlink to your Xsan volume
- sudo mv /var/spool/dovecot.cl-rst /var/spool/dovecot
- Restores the prior configuration for Dovecot
- sudo rm /etc/postfix
- Removes the symlink to your Xsan volume
- sudo mv /etc/postfix.cl-rst /etc/postfix
- Restore the prior configuration for postfix
- sudo rm /var/spool/postfix
- Removes the symlink to your Xsan volume
- sudo mv /etc/postix.cl-rst /etc/postfix
- Restore the prior configuration for postfix
- sudo rm /etc/amavisd.conf
- Removes the symlink to your Xsan volume
- sudo mv /etc/amavisd.conf.cl-rst /etc/amavisd.conf
- Restores the original config for amavisd
- sudo rm /etc/MailServicesOther.plist
- sudo cp /etc/MailServicesOther.plist.default /etc/MailServicesOther.plist
- Restores Server Admin’s plist of the mail services configuration
Now, rerun Server Admin’s mail setup utility and you’ll get through the whole thing. Be sure to point it to the new volume where your restored data resides. Voila, you’ve recovered your server to a single node service until you can get the Xsan crap cleared up.
Blabber back