2.10.3. Recovering Failed Node Hosts

Important

This section presumes you have backed up the /var/lib/openshift directory. See Section 2.10.2, “Backing Up Node Host Files” for more information.
A failed node host can be recovered if the /var/lib/openshift gear directory had fault tolerance and can be restored. SELinux contexts must be preserved with the gear directory in order for recovery to succeed. Note this scenario rarely occurs, especially when node hosts are virtual machines in a fault-tolerant infrastructure rather than physical machines. Note that scaled applications cannot be recovered onto a node host with a different IP address than the original node host.

Procedure 2.7. To Recover a Failed Node Host:

  1. Create a node host with the same host name and IP address as the one that failed.
    1. The host name DNS A record can be adjusted if the IP address must be different. However, note that the application CNAME and database records all point to the host name and cannot be easily changed.
    2. Ensure the ruby193-mcollective service is not running on the new node host:
      # service ruby193-mcollective stop
    3. Copy all the configuration files in the /etc/openshift directory from the failed node host to the new node host and ensure that the gear profile is the same.
  2. Attach and mount the backup to /var/lib/openshift, ensuring the usrquota mount option is used:
    # echo "/dev/path/to/backup/partition /var/lib/openshift/ ext4 defaults,usrquota 0 0" >> /etc/fstab
    # mount -a
  3. Reinstate quotas on the /var/lib/openshift directory:
    # quotacheck -cmug /var/lib/openshift
    # restorecon /var/lib/openshift/aquota.user
    # quotaon /var/lib/openshift
  4. Run the oo-admin-regenerate-gear-metadata tool, available starting in OpenShift Enterprise 2.1.6, on the new node host to replace and recover the failed gear data. This browses each existing gear on the gear data volume and ensures it has the correct entries in certain files, and if necessary, performs any fixes:
    # oo-admin-regenerate-gear-metadata
    
    This script attempts to regenerate gear entries for:
      *  /etc/passwd
      *  /etc/shadow
      *  /etc/group
      *  /etc/cgrules.conf
      *  /etc/cgconfig.conf
      *  /etc/security/limits.d
    
    Proceed? [yes/NO]: yes
    The oo-admin-regenerate-gear-metadata tool will not make any changes unless it notices any missing entries. Note that this tool can be added to a node host deployment script.
    Alternatively, if you are using OpenShift Enteprise 2.1.5 or earlier, replace the /etc/passwd file on the new node host with the content from the original, failed node host. If this backup file was lost, see Section 2.10.4, “Recreating /etc/passwd Entries” for instructions on recreating the /etc/passwd file.
  5. When the oo-admin-regenerate-gear-metadata tool completes, it runs the oo-accept-node command and reports the output:
    Running oo-accept-node to check node consistency...
    ...
    FAIL: user 54fe156faf1c09b9a900006f does not have quotas imposed. This can be addressed by running: oo-devel-node set-quota --with-container-uuid 54fe156faf1c09b9a900006f --blocks 2097152 --inodes 80000
    If there are any quota errors, run the suggested quota command, then run the oo-accept-node command again to ensure the problem has been resolved:
    # oo-devel-node set-quota --with-container-uuid 54fe156faf1c09b9a900006f --blocks 2097152 --inodes 80000
    # oo-accept-node
  6. Reboot the new node host to activate all changes, start the gears, and allow MCollective and other services to run.