The RHEL docker package does not currently support --live-restore

Updated -

Container started with a daemon running --live-restore will not be accessible if you do systemctl restart docker.

What does --live-restore do?

First lets look at --live-restore:

man dockerd
...
      --live-restore=false
         Enable live restore of running containers when the daemon starts so that they are not restarted.

By default the docker daemon stops all running containers when you restart the daemon. Docker has always operated this before docker-1.12 added a new feature, called --live-restore. This feature tells the docker daemon to leave the containers running when the daemon is restarted. This does not work on RHEL7 systems, because of a bug in the RHEL Kernel.

If you look at the docker.service you will see that we run the docker daemon, with a MountFlags=slave option

$ grep slave /etc/systemd/system/docker.service 
MountFlags=slave

This tells systemd to create slave mount namespace for the docker daemon. When the docker daemon creates containers it creates them relative to its mount namespace. When an administrator restarts the docker daemon it gets a new mount namespace. If you run docker daemon with the --live-restore flag, the containers are set up under the mount namespace that the docker daemon/containerd is running in. When the docker daemon restarts it gets placed in a different mount namespace. Because of this is can not find the running container’s mount points.

Why do we do this?

This RHEL kernel bug forces us to run the docker daemon within its own Mount Namespace. The bug in the kernel is very simple. Basically if you create a directory named dmnt, and create a new mount namespace, now you mount a filesystem on top of dmnt in the new namespace. Now outside of the mount namespace you attempt to remove the directory dmnt, different things happen depending on the kernel. On the upstream kernel, the removal succeeds, and the other mount namespace continues to use the directory as a mountpoint. On RHEL7 systems the removal fails, with an error saying the directory is busy. The docker daemon follows this sequence when creating a container. If someone in the host mount namespace creates a new mount namespace, docker will fail. When we run the docker daemon on the host namespace, it is very easy to cause this situation to happen and suddenly docker starts to throw errors. By putting the docker daemon into its own mount namespace we prevent this bug from being triggered. Sadly at the expense of --live-restore.

Good News

There is good news though. The Kernel team at Red Hat is working on a fix for this issue that will hopefully be in the RHEL7.4 kernel, which should be shipped in the early fall. Here are the bugzillas that cover the issue.

Comments