Chapter 8. Linux Capabilities and Seccomp

Namespaces are one of the building blocks of isolation used by the docker-formatted containers. They provide such an environment for a process, that prevents the process from seeing or interacting with other processes. For example, a process inside a container can have PID 1, and the same process can have a normal PID outside of a container. The process ID (PID) namespace is the mechanism which remaps PIDs inside a container. Detailed information about namespaces can be found in the Overview of Containers in Red Hat Systems guide. However, containers can still access some resources from the host such as the kernel and kernel modules, the /proc file system and the system time. The Linux Capabilities and seccomp features can limit access by containerized processes to the system features.

8.1. Linux Capabilities

The Linux capabilities feature breaks up the privileges available to processes run as the root user into smaller groups of privileges. This way a process running with root privilege can be limited to get only the minimal permissions it needs to perform its operation. Docker supports the Linux capabilities as part of the docker run command: with --cap-add and --cap-drop. By default, a container is started with several capabilities that are allowed by default and can be dropped. Other permissions can be added manually. Both --cap-add and --cap-drop support the ALL value, to allow or drop all capabilities.

The following list contains all capabilities that are enabled by default when you run a docker container with their descriptions from the capabilities(7) man page:

  • CHOWN - Make arbitrary changes to file UIDs and GIDs
  • DAC_OVERRIDE - Discretionary access control (DAC) - Bypass file read, write, and execute permission checks.
  • FSETID - Don’t clear set-user-ID and set-group-ID mode bits when a file is modified; set the set-group-ID bit for a file whose GID does not match the file system or any of the supplementary GIDs of the calling process.
  • FOWNER - Bypass permission checks on operations that normally require the file system UID of the process to match the UID of the file, excluding those operations covered by CAP_DAC_OVERRIDE and CAP_DAC_READ_SEARCH.
  • MKNOD - Create special files using mknod(2).
  • NET_RAW - Use RAW and PACKET sockets; bind to any address for transparent proxying.
  • SETGID - Make arbitrary manipulations of process GIDs and supplementary GID list; forge GID when passing socket credentials via UNIX domain sockets; write a group ID mapping in a user namespace.
  • SETUID - Make arbitrary manipulations of process UIDs; forge UID when passing socket credentials via UNIX domain sockets; write a user ID mapping in a user namespace.
  • SETFCAP - Set file capabilities.
  • SETPCAP - If file capabilities are not supported: grant or remove any capability in the caller’s permitted capability set to or from any other process.
  • NET_BIND_SERVICE - Bind a socket to Internet domain privileged ports (port numbers less than 1024).
  • SYS_CHROOT - Use chroot(2) to change to a different root directory.
  • KILL - Bypass permission checks for sending signals. This includes use of the ioctl(2) KDSIGACCEPT operation.
  • AUDIT_WRITE - Write records to kernel auditing log.

For most applications in containers, from this default list, you can drop the following: AUDIT_WRITE, MKNOD, SETFCAP, SETPCAP. The command will be similar to the following:

# docker run --cap-drop AUDIT_WRITE --cap-drop MKNOD --cap-drop SETFCAP --cap-drop SETPCAP <container> <command>

The rest of the capabilities are not enabled by default and can be added according to your application’s needs. You can see the full list in the capabilities(7) man page.

A good strategy is to drop all capabilities and add the needed ones back:

# docker run --cap-drop ALL --cap-add SYS_TIME ntpd /bin/sh
Important

The minimum capabilities required depends on the applications, and figuring those out can take some time and testing. Do not use the SYS_ADMIN capability unless specifically required by the application. Although capabilities break down the root powers in smaller chunks, SYS_ADMIN by itself grants quite a big part of the capabilities and it could potentially present more attack surface.

EXAMPLE #1 If you are building a container which the Network Time Protocol (NTP) daemon, ntpd, you will need to add SYS_TIME so this container can modify the host’s system time. Otherwise the container will not run. Use this command:

# docker run -d --cap-add SYS_TIME ntpd

EXAMPLE #2 If you want your container to be able to modify network states, you need to add the NET_ADMIN capability:

# docker run --cap-add NET_ADMIN <image_name> sysctl net.core.somaxconn = 256

This command limits the number of waiting new connections.

Note

You cannot modify the capabilities of an already running container.

8.2. Limiting syscalls with seccomp

Secure Computing Mode (seccomp) is a kernel feature that allows you to filter system calls to the kernel from a container. The combination of restricted and allowed calls are arranged in profiles, and you can pass different profiles to different containers. Seccomp provides more fine-grained control than capabilities, giving an attacker a limited number of syscalls from the container.

The default seccomp profile for docker is a JSON file and can be viewed here: https://github.com/docker/docker/blob/master/profiles/seccomp/default.json. It blocks 44 system calls out of more than 300 available.Making the list stricter would be a trade-off with application compatibility. A table with a significant part of the blocked calls and the reasoning for blocking can be found here: https://docs.docker.com/engine/security/seccomp/.

Seccomp uses the Berkeley Packet Filter (BPF) system, which is programmable on the fly so you can make a custom filter. You can also limit a certain syscall by also customizing the conditions on how or when it should be limited. A seccomp filter replaces the syscall with a pointer to a BPF program, which will execute that program instead of the syscall. All children to a process with this filter will inherit the filter as well. The docker option which is used to operate with seccomp is --security-opt. To explicitly use the default policy for a container, the command will be:

# docker run --security-opt seccomp=/path/to/default/profile.json <container>

If you want to specify your own policy, point the option to your custom file:

# docker run --security-opt seccomp=/path/to/custom/profile.json <container>