Problem Solved, But Why Did It Happen

Latest response

I work in a large, global enterprise where we maintain several thousand Windows-, RedHat- and Solaris-based (in that order, with RH rapidly supplanting Solaris and even starting to eat into Windows by percentage of total hosts deployed) physical and virtual systems. In an environment as large and distributed as ours, our build and patching process is highly-standardized and automated.

Early in the week, I was attempting to patch up the NetBackup media servers at one of our data centers in the western US. Both media servers were running the same RHEL release and patch-level and built from our standard OS-sequence. Each had previously been brought up to the same patch-levels by way of our automated server management system. Both had previously been problem free.

Monday afternoon, I took one of the media servers out of the NetBackup job-rotation so I could patch it. I punched the "remediate" button in the automated server management system's dashboard for that host, and went about doing other things while the system was patched up. I kept the progress window open elsewhere on my screen. Everything went fine until the post-patch reboot. The "reboot" job took an incredibly long time to return a success status - far longer than was normal.

So, I located my IPMI login information for the server and logged in. I found the server sitting at a boot error screen, displaying a lovely "Can't find VolGroup00" message. I exhaled a stream of explatives, then began the recovery process.

We don't frequently keep bootable media on our networks, but, was fortunately able to reboot off the pre-patch kernel to bring the system online. I unpacked both the pre-patch and post-patch initrds and dug through them. Ultimately, I discoverd that the post-patch initrd was missing its ata_piix, cciss, libata, scsi_mod and sd_mod drivers. Given that it was an HP blade with onboard RAID-controller providing the root drives as a single /dev/cciss device, just the lack of the cciss driver was going to have caused this problem. That all the disk-related drivers seemed to be missing from the new initrd was mostly just icing on the cake.

At any rate, I manually rebuilt the initrd: the first time, just calling mkinitrd with no special arguments - resulting in the creation of an initrd that was similarly missing the disk modules; the second time forcing the inclusion of the missing modules - and booted the system to the hand-made initrd. System booted back up and was otherwise happy. I alerted our build-automation team that I'd encountered an issue on a physical system (since most of our RHEL builds are virtual and our build-testing is pretty much exclusively virtual, I figured it was best to let them know in case there may have been a Q/A issue).

That still left me with needing to patch the problem system's partner. Obviously, for the patching of the second system, I kept the remote console device open in case patching blew it up to. For better or worse, it didn't get blown up by the patch process. Yay for not having to recover a second system, but now, instead of having a centrally-addressable systemic problem, I have a one-off problem. And, I don't know whether that problem is a one-time one-off problem or if I have what's going to be a problem-child system each time it needs to be patched.

At any rate, the point of this post is, why the hell would this one system have botched its post-patch `mkinitrd` action? Especially given that the particular system had previously been successfully patched at least five times previously. It's not like any errors were logged during the patch process and no errors were called out when I did the "no special arguments" manual mkinitrd. So, I'm only left to wonder WTF was up with mkinitrd on this system.

Responses

Was this an RHEL5 or RHEL6? If former, can u post the contents of /etc/modprobe.conf?

Tom, you might like to tweak the discussion title to include a couple of keywords (mkinitrd etc) so that folks browsing will have a better idea of the topic.

I presume you're still able to reproduce by just running "mkinitrd". Try "mkinitrd -v" to increase verbosity, maybe it will tell you where it's tripping over?

Given the software is indentical, do the blades have the same version of system BIOS and any applicable firmware?

Tried putting four tags on it. The tag drop-down box doesn't make it easy to select more than one tag (and know that you've got more than one selected), however. :(

Ok, I was a touch unclear when stating "the first time, just calling mkinitrd with no special arguments". By "special arguments", I meant the forced-inclusion of specific modules. When I ran it with no "special arguments", I'd run it with several flags set - including the verbose flags. There were no errors, cautions, etc. noted in the output.

We have very tight version-control on our system firmware. It's kind of a pain in the ass, since it frequently causes us to run further behind latest-n-greatest than we'd probably really like. Basically, we have a release-schedule during which *all* system components get brought up to the currently-certified version (in other words, each blade in a given chassis is running all the same firmware). The two chassis the blades are in are on the same release level, thus all the chassis sub-components are on the same levels. 

RHEL 5(.8): our security folks haven't certified RHEL 6 for deployment into production datacenters, yet.

Modprobe file is basically empty (just eth and bond lines) - and identical across both hosts. Unfortunately, given the nature of our environment, while I can summarize/transcribe, can't actually donwload anything off of them.

I was just referring to the discussion title. I know the tag drop-down is a real pain to use right now - we've got some improvements on the way.

I doubt how these specific drivers will get added to initrd image by mkinitrd if modprobe.conf doesn't include "alias scsi_hostadapterx <driver>".

You can open a case with verbose output of mkinitrd from both systems and details of modprobe.conf if you cannot give those details in this thread.

Unfortunately, with our production environment, I can't do that in any kind of practical way. Manually transcribing a verbose `mkinitrd` session is, to say the least, a painful undertaking and prone to errors.

That said, having had an opportunity to re-review the previously problematic system's modprobe.conf (and compare it to healthy peers), it was missing a number of directives (specifically the scsi_hostadapter aliases for cciss and ata_piix). Interestingly, it was also missing its bonding aliases, yet, he bonding drivers started up just fine. Oddly, the permissions on the problematic box were also locked down with respect to its healthier peers. Wondering if one of our security folks was "tweaking" this system to bring it into compliance with some (braindead) instruction-sheet.