Problem Solved, But Why Did It Happen

Latest response

I work in a large, global enterprise where we maintain several thousand Windows-, RedHat- and Solaris-based (in that order, with RH rapidly supplanting Solaris and even starting to eat into Windows by percentage of total hosts deployed) physical and virtual systems. In an environment as large and distributed as ours, our build and patching process is highly-standardized and automated.

Early in the week, I was attempting to patch up the NetBackup media servers at one of our data centers in the western US. Both media servers were running the same RHEL release and patch-level and built from our standard OS-sequence. Each had previously been brought up to the same patch-levels by way of our automated server management system. Both had previously been problem free.

Monday afternoon, I took one of the media servers out of the NetBackup job-rotation so I could patch it. I punched the "remediate" button in the automated server management system's dashboard for that host, and went about doing other things while the system was patched up. I kept the progress window open elsewhere on my screen. Everything went fine until the post-patch reboot. The "reboot" job took an incredibly long time to return a success status - far longer than was normal.

So, I located my IPMI login information for the server and logged in. I found the server sitting at a boot error screen, displaying a lovely "Can't find VolGroup00" message. I exhaled a stream of explatives, then began the recovery process.

We don't frequently keep bootable media on our networks, but, was fortunately able to reboot off the pre-patch kernel to bring the system online. I unpacked both the pre-patch and post-patch initrds and dug through them. Ultimately, I discoverd that the post-patch initrd was missing its ata_piix, cciss, libata, scsi_mod and sd_mod drivers. Given that it was an HP blade with onboard RAID-controller providing the root drives as a single /dev/cciss device, just the lack of the cciss driver was going to have caused this problem. That all the disk-related drivers seemed to be missing from the new initrd was mostly just icing on the cake.

At any rate, I manually rebuilt the initrd: the first time, just calling mkinitrd with no special arguments - resulting in the creation of an initrd that was similarly missing the disk modules; the second time forcing the inclusion of the missing modules - and booted the system to the hand-made initrd. System booted back up and was otherwise happy. I alerted our build-automation team that I'd encountered an issue on a physical system (since most of our RHEL builds are virtual and our build-testing is pretty much exclusively virtual, I figured it was best to let them know in case there may have been a Q/A issue).

That still left me with needing to patch the problem system's partner. Obviously, for the patching of the second system, I kept the remote console device open in case patching blew it up to. For better or worse, it didn't get blown up by the patch process. Yay for not having to recover a second system, but now, instead of having a centrally-addressable systemic problem, I have a one-off problem. And, I don't know whether that problem is a one-time one-off problem or if I have what's going to be a problem-child system each time it needs to be patched.

At any rate, the point of this post is, why the hell would this one system have botched its post-patch `mkinitrd` action? Especially given that the particular system had previously been successfully patched at least five times previously. It's not like any errors were logged during the patch process and no errors were called out when I did the "no special arguments" manual mkinitrd. So, I'm only left to wonder WTF was up with mkinitrd on this system.

Responses