haldaemon fails to start on system with a large number of disks in RHEL 5 and RHEL 6

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 5
  • Red Hat Enterprise Linux 6

Issue

  • On server boot or when running haldaemon via initscript - hald fails to start:

    # /etc/init.d/haldaemon start
    Starting HAL daemon:               FAILED
    
  • When running in the foreground, starting hald is successful:

    # hald --use-syslog --verbose=yes --daemon=no
    
  • The haldaemon service takes a long time at startup and eventually fails to start, but running hald --daemon=no manually works.

Resolution

  1. Upgrade to hal-0.5.8.1-62.el5 or later.
  2. Then create the file /etc/sysconfig/haldaemon and edit it by adding the following command line argument for hald:

    --child-timeout=600
    
  3. Please tweak the timeout value in accordance with the maximum time the child process takes to probe all the LUNs existing on you system.

Root Cause

  • The hald daemon is timing out waiting for the child process to probe all the devices. By default, hald waits for 250 seconds (4 minutes, 10 seconds) for its child process to complete device probing.
  • The issue seems to occur most frequently on systems with a large number of disks.

Diagnostic Steps

  • Determine how long it takes for hald to fail to start. You can do this by
    • Running service haldaemon restart and then timing how long hald runs before failure, or
    • Running
hald --use-syslog --verbose=yes
  • and then examining the time stamps in the system log to determine when hald started and when it emitted its last message before exiting.
  • Component
  • hal

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

5 Comments

I just saw this issue on RHEL 5.9 with hal 0.5.8.1-64.el5 installed already.
It does in fact have a high number of disks/luns.
Creating the /etc/sysconfig file with contents suggested here resolved the issue.

Great to hear. Thanks Jeffrey.

If you have to add a timeout because you have too many devices, then is there even a point to keep it enabled? I have my timeout set to 900 before it quits. I'm thinking that I should just disable haldaemon. Without the timeout, it takes one hour to start up.

The point in this article is to tell you how to INCREASE the timeout from the default to PREVENT the timeout.
On my system that has 410 multipath devices with two /dev/sd* paths each not to mention many logical volumes in LVM, tape drives, etc... setting the value to 600 (10 minutes) gave it plenty of time to start everything.

Having said that on looking at whether haldaemon really needs to be running I found various links suggesting it does NOT on servers as its main benefit is for X.
One such link that talks about other default services:
http://www.cyberciti.biz/faq/linux-default-services-which-are-enabled-at-boot/
"man hald" doesn't suggest it is restricted to X so I'm not planning on turning it off on my systems.

Since it CAN timeout and your system will boot anyway it does seem disabling it wouldn't be a problem so it sounds like it is entirely up to you.

Thanks for the link. Yeah, I'm pretty sure nothing we run is using haldaemon. And since we have almost 2,000 LUNs on this machine, haldaemon takes one hour to start.