There is a common misperception that now that containers support seccomp we no longer need SELinux to help protect our systems. WRONG. The big weakness in containers is the container possesses the ability to interact with the host kernel and the host file systems. Securing the container processes is all about shrinking the attack surface on the host OS and more specifically on the host kernel.
seccomp does a great job of shrinking the attack surface on the kernel. The idea is to limit the number of syscalls that container processes can use. It is an awesome feature. For example, on an x86_64 bit machine, there are around 650 system calls. If the Linux Kernel has a bug in any one of these syscalls, a process could get the kernel to turn off security features and take over the system, i.e. it would break out of confinement. If your container does not run 32 bit code, you can turn on seccomp and eliminate all x86 syscalls, basically cutting the number of syscalls in half. This means that if the kernel had a bug in a 32 bit syscall that allowed the process to take over the system, this syscall would not be available to the processes in your container, and the container would not be able to break out. We also eliminate a lot of other syscalls that we do not expect processes inside of a container to call.
But seccomp is not enough
This still means that if a bug remains in the kernel that can be triggered in the 300 remaining syscalls, then the container process can still take the system over, and/or create havoc. Just having open/read/write/ioctl on things like files/devices etc, could allow a container process the ability to break out. And if they break out they would be able to write all over the system.
You could continue to shrink the seccomp syscall table to such a degree that processes can not escape, but at some point it will also prevent the container processes getting any real work done.
As usual, any single security mechanism by itself will not fully protect your containers. You need lots of security mechanisms to control what a process can do inside and outside a container.
Read-Only file systems. Prevent open/write on kernel file systems. Container processes need read access to kernel file systems like /proc, /sys, /sys/fs ... But they seldom need write access.
Dropping privileged process capabilities. This can prevent things like setting up the network or mounting file systems, (seccomp can also block some of these, but not as comprehensively as capabilities).
SELinux. Prevents which file system objects like files, devices, sockets, and directories a container process can read/write/execute. Since your processes in a container will need to use open/read/write/exec syscalls, SELinux controls which file system
objects you can interact with. I have heard a great analogy, SELinux is telling people which people they can talk to, seccomp is telling them what they can say.
prctl(NO__NEW__PRIVS). Prevents privilege escalation through the use of setuid applications. Running your container
processes without privileges is always a good idea, and this keeps the processes non privileged.
PID Namespace. Makes it harder to see other processes on the system that are not in your container.
Network Namespace. Controls which networks your container processes are able to see.
Mount Namespace. Hides large parts of the system from the processes inside of the container.
User Namespace. Helps remove remaining system capabilities. It can allow you to have privileges inside of your containers namespaces, but not outside of the container.
kvm. If you can find some way to run containers in a kvm/virtualization wrapper, this would be a lot more secure. (ClearLinux and others are working on this).
The more Linux security services that you can wrap around your container processes the more secure your system will be.
It is the combination of all of these kernel services along with administrators continuing to maintain good security practices that begin to keep your container processes contained.