In a sea of data that contains a tiny speck of evidence of maliciousness somewhere, where do we start? What is the most optimal way to swim through the inconsequential information to get to that small cluster of anomalous spikes? Big data in information security is a complicated problem due to the sheer volume of data generated by various security agents, and the variety of attack profiles — ranging from script kiddies to nation-state hackers.
In this post, we will use a case study to demonstrate how to detect anomalous activity using a toolkit comprised of: Capsule8 Investigations, BigQueryML, and a machine learning model. We will show you that your organization can leverage these tools for sophisticated security detection without an official data science team or big budget. A little high-level learning and some SQL query skills are all that’s required.
An Unintentional Honeypot: Proof of Concept
It would be prudent to note that security ML has a somewhat garbage reputation, and rightly so — false positives, overfitting, poor training data, and uninterpretable black box models commonly plague defenders who just wanted to detect attacks better. Vendors are infamous for applying ML as the answer without first considering the question they’re trying to answer.
We want to approach things more intelligently, starting by asking, “What problem do we need to solve?” and then carefully evaluating whether an ML approach will help us get closer to the optimal solution.
Today, the problem we want to solve is one that arose from a real incident. We will use the tale of an unintentional honeypot as a proof of concept of the power of machine learning applied to our C8 Investigations capabilities. This tale starts with the compromise of one of Capsule8’s test clusters that was running vulnerable applications for demo purposes, which was alerted on by Capsule8’s detection running in the cluster. After some analysis by the research team, we determined that an RCE vulnerability in Kibana was being exploited for cryptomining.
This incident formulated a concrete problem statement for our data science team to solve; can we use off-the-shelf models from Google’s BigQueryML to build an unsupervised anomaly detection model? More specifically:
- Could we start with the smallest possible dataset to train a machine learning model that successfully detects future anomalous activity?
- What would be the overhead for this model?
- Could our model detect additional data points and anomalies that were not previously uncovered?
What can this unintentional honeypot teach us about the application of machine learning to quickly identify outliers? Is it a tool we should consider having in our security toolbox? Let’s try to answer these questions by exploring the dataset, applying BigQueryML and anomaly detection logic, and analyzing the results.
Exploring our Dataset
Leveraging incident data pertaining to the Kibana compromise — stored via our C8 Investigations capability — we began by only using process events out of all possible system events. We wanted to see how the model performed with a limited scope first to evaluate the results, and the results were indeed fruitful, as you’ll see! But first, let’s explore the raw data to set the context for setting up our model.
The specific Kubernetes cluster that generated our dataset was set up on October 4th, and we received the first alert from Capsule8 about the aforementioned attack on October 24th. To visualize how the process events look over this given period of time, we will create a data summary.
If you look at our data summary below, which shows the raw count of process events seen each day, the event counts are pretty consistent — around 5 million events per day from October 16th until October 22nd. However, there is approximately a 9% dip on October 23rd, and around 13% rise on October 24th to 25th, which suggests something fishy is happening on those specific dates.
With this incident data in hand, we now move to applying BigQueryML to our data.
Leveraging BigQueryML for Anomaly Detection
With Google’s BigQuery ML, building simple machine learning (ML) models is more accessible than ever. It allows users to build and execute ML models using SQL queries, which removes the traditional dependency on specialized data science teams — anyone who can write SQL queries can use the service. BigQuery ML currently supports linear regression, logistic regression, K-means clustering, and any pre-trained TensorFlow model. Further information on BigQueryML can be found in its documentation.
Preprocessing is a critical step for security data, given the large volume of data and duplicate events within it. Our raw process events data had features including (but not limited to) the process path, unix timestamp, and the login username. Thus, as part of our preprocessing, we encoded the process path to numeric values and extracted the date and hour of day from the unix timestamps.
Given we have around 5 million data points per day, we also wanted to see how many of the data points are unique. For instance, is the same process, (e.g., dhclient), being spawned by the same parent process, (/usr/sbin/NetworkManager), at the same time of day? By determining the unique data points, we can combine and reduce the size of our data.
After preprocessing, our sanitized dataset looks as follows:
As you can see, it includes readable date and hour values, as well as numeric values corresponding to specific path and username strings. Now that we have our data cleaned, let’s proceed with testing our use case.
But first, which BigQueryML model to use?
As mentioned before, K-means clustering is one of the off-the-shelf models available in BigQueryML. The goal of the algorithm is to split the data into K groups based on the features provided. Those of you who are security or ops engineers without any data science expertise might be wondering how difficult it will be to work with this model. The great news is that K-means clustering is one of the most accessible models available in the data science stack.
The two requirements you need to use K-means clustering in BigQueryML are a high-level understanding of the algorithm and some experience in writing SQL queries. The rest of the mathematical magic is covered by readily available functions from BigQueryML.
For our experiment, we will be following Google’s helpful and accessible tutorial on how to build a K-means clustering model using BigQueryML, as well as its companion on how to use K-means clustering for anomaly detection. We will be adding onto these tutorials in a few places, which we’ll note so you can follow along.
Creating our Anomaly Detection Logic
With the common complaints of security ML in mind, we decided that the model’s inherent logic should generate very few alerts. We also realized that the parameters of the model must be designed to be extremely tunable. These two design choices combined should allow us enough leeway to detect as little or as many anomalies we would want to analyze further.
Using K-means clustering as an unsupervised classification method helped us meet these design criteria, and so we proceeded to apply it to our problem of anomaly detection on the security event data we sanitized. As we mentioned before about Google’s blog post on how to do this, we follow the blog’s suggestion where you find the outliers in each cluster by computing the 95th percentile.
This method separates data into two groups, generating a threshold — the 95% of data points that are smaller than the threshold, and the 5% that are larger than the threshold, and hence are the farthest outliers of the specific cluster. The threshold here is the maximum distance from the cluster’s centroid, measured in Euclidean distance. Data points farther than that are considered anomalous.
However, a threshold at the 95th percentile is insufficiently precise for our use case. Assuming we have minimized the variance within each cluster, the farthest 5% outliers are still not specific enough to denote which ones are malicious. This method would just produce a large number of false positives, which we don’t want.
To solve this conundrum, we go one level deeper, Inception-style, and fish out the data points that are N standard deviations away from the mean among the farthest 5% outliers. In this experiment, we tested different values and settled on N=3, which demonstrated high sensitivity and specificity (higher N values will give lower number of anomalies, which would reduce the number of false positives in a good model). We optimized for these two characteristics because we are extremely cautious about acceptable false positive rates.
Let’s run through the steps of creating the K-means clustering model and creating the anomaly detection logic.
Step 1: Training and Model creation
From BigQueryML, we can use the CREATE OR REPLACE MODEL statement to create a K-means clustering model and train it on our dataset from October 16 until October 20:
CREATE OR REPLACE MODEL eric300.proc_clusters4 OPTIONS (model_type='kmeans', num_clusters=4, standardize_features = TRUE) AS SELECT DISTINCT hour, proc_encoded, username_encoded FROM `cap8-big-query-testing.eric300.process_train_data
Step 2: Detecting outliers in each cluster
Once the model is created post-training, we use the PREDICT statement to predict the cluster that each training data point belongs to, which also gives us the distance for each data point from its cluster centroid. We then calculate the 95th percentile threshold for each of the 4 clusters (using the APPROX_QUANTILES function). This will return the farthest 5% of data points in each cluster. Here’s how that query looks:
WITH Distances AS ( SELECT DISTINCT ML.CENTROID_ID, hour, proc_encoded, username_encoded, MIN(NEAREST_CENTROIDS_DISTANCE.DISTANCE) AS distance_from_closest_centroid FROM ML.PREDICT(MODEL eric300.proc_clusters4, ( SELECT DISTINCT hour, proc_encoded, username_encoded FROM `cap8-big-query-testing.eric300.process_train_data`)) AS ML CROSS JOIN UNNEST(NEAREST_CENTROIDS_DISTANCE) AS NEAREST_CENTROIDS_DISTANCE GROUP BY ML.CENTROID_ID, hour, proc_encoded, username_encoded), Threshold AS ( SELECT ROUND(APPROX_QUANTILES(distance_from_closest_centroid,10000)[ OFFSET (9500)],2) AS threshold FROM Distances), TrainingOutliers AS ( SELECT d.* FROM Distances d JOIN Threshold ON d.distance_from_closest_centroid > Threshold.threshold)
Step 3: Calculating our anomaly threshold
Now that we have the farthest 5% outliers in each cluster, let’s set the threshold for anomalies as 3 standard deviations away from the mean among the outliers. The query for setting the threshold looks like this:
MaxClusterDistance AS ( SELECT CENTROID_ID, AVG(distance_from_closest_centroid) + 3*(STDDEV(distance_from_closest_centroid)) AS max_distance FROM TrainingOutliers GROUP BY CENTROID_ID )
Now we have a set threshold, meaning any datapoint in the validation dataset (used in step 4) whose distance from the centroid of its cluster is more than this threshold seen during training, will be classified as anomalous.
Note: This method of calculating N=3 standard deviations away from the mean of outliers reduces the number of anomalies for the security analyst. Specifically, 5% outliers of all clusters contained 385 data points, which was then reduced to 109 anomalies after calculating 3 standard deviations, which is 71% less data!
Step 4: Running the clustering model on test/ validation dataset
Next, we repeat step 2 using validation data instead of the training dataset. This assigns every data point in the test dataset to one of the 4 clusters, and also gives us the distance for each of the data points from its cluster centroid.
Step 5: Finding anomalies
Now that we have predicted the clusters for the validation dataset, we must compare the distance from the centroid to the anomalous threshold for each cluster. For all the data points whose distance from cluster centroid is more than the corresponding threshold value, we classify them as anomalies. The query looks like this:
KMeansAnomalyPred AS ( SELECT Y.date, Y.hour, Y.path, Y.login_username, Y.proc_encoded, Y.username_encoded, Y.CENTROID_ID, Y.distance_from_closest_centroid1, X.max_distance FROM MaxClusterDistance X JOIN TestingOutliers Y ON Y.CENTROID_ID = X.CENTROID_ID WHERE Y.distance_from_closest_centroid1 > X.max_distance) SELECT * FROM KMeansAnomalyPred
Examining Our Results
With these results in hand, let’s return to the 3 questions we asked ourselves to ensure we are creating a valuable model rather than a cool-but-useless model.
1) Could we start with the smallest possible dataset to train a machine learning model that successfully detects future anomalous activity?
Yes! We only selected the process data from Capsule8 Investigations, trained the readily available K-means clustering model on baseline normal data, and successfully detected malicious activity (validated by security researchers). Out of 13,149 total test data points, the BigQueryML K-means clustering model detected 109 total anomalies.
We were immediately able to detect specific anomalous spikes in our input features, like process activity, weird username activity, and time of day, which were later confirmed to be cryptomining activities. We also found that most of these anomalies (93.5%) occurred from 6pm to 11pm ET, which is outside of normal work hours. Process and temporal anomalies were exactly the types of signals we were seeking — identifying outliers or anomalies spikes based on modeling the true baseline behavior.
If data science happens in a forest without a visualization, does anyone hear it? To visualize our data, we can use Data Studio, the visualization tool that’s included with every BigQuery project. Using these visualizations, let’s explore the processes that were classified as anomalous by our model:
As the chart shows, there was quite a variety of Linux processes related to events that were classified as anomalous. Unsurprisingly,
/bin/sh, the Linux system shell, was most common, since interactive shells on Kibana instances are not typical workflow components. Also of note is
/usr/bin/pkill, which is used by some cryptominers to kill competitive cryptomining processes, but isn’t a common process on an otherwise quiet Kibana host.
Additionally, we can see the timeline of detected anomalies:
For those defending Linux infrastructure, these sorts of visualizations can help inform detection policies, as well as rules on what processes and scripts are allowed on your infrastructure (and when). For instance, adopting immutable infrastructure allows you to disable shell access by default, which would eliminate the most common anomaly found by our model.
2) What would be the overhead for this model?
There are three major overhead costs which we need to consider for any real world application of ML: anomaly-free baseline data to train the model, training time, and false positive rate.
Baseline training data: A common cause of concern for applied ML in infosec is the need to train the model in a controlled, anomaly-free environment to ensure only pure baseline behavior is modelled. This is not an inconsequential ask and is generally expensively time-consuming. In our experiment, we started with the assumption that the training data represents true baseline. This assumption is supported by the fact that we didn’t receive any unexpected alerts from Capsule8, which was deployed in the same cluster.
Training time: As model complexity grows, the training time and infrastructure costs tend to skyrocket, which is understandably painful to organizations. For our experiment, we tackled this problem with the use of intelligent preprocessing steps to condense the data, and by using BigQueryML to deploy the model on a sanitized dataset. Our training time was around 3min 28sec, which is very small compared to complex models that can take multiple hours.
False positive rate: To validate the results, we enlisted our security research team to label all the 109 supposedly malicious data points. Out of the 109 anomalies our model detected, 99 were labelled as true positives and 10 of them as false positives.
But to calculate false positive rate, we need “ground truth” — labels to validate the model result. To do so, we enlisted our own product, which tags all events related to the alerts as “suspicious incidents.” We are considering these “suspicious incidents” as true positives, while all events not tagged as incidents are considered true negatives. With this data in hand, let’s calculate the False Positive Rate.
False Positive Rate = False Positives / (False Positives + True Negatives)
= 10 / (10 + 11917)
This means our false positive rate is 0.08%. This is a promising result, given that 98% of security teams experience false positives rates greater than 1% when using EDR products (and 77% of teams experience EDR false positives rates above 25%!).
Unfortunately, calculating a false negative rate here is tricky, since we don’t know the exact number of true malicious data points unless all 13,149 data points are manually labeled (which is expensive and unnecessary for this experiment). The bottom line is that the model outputs a readable number of anomalies that doesn’t exacerbate alert fatigue, while at the same time offering a desirably minimal number (<0.1%) of false positives.
3) Could our model detect additional data points and anomalies that were not previously uncovered?
Yes, it could! All the alerts generated from deterministic detection denoted either a new malicious interactive shell or a file created by a malicious program. Our model not only found some of these malicious interactive shells, but also many attacker-controlled processes descending from the shells.
Part of Capsule8’s detection special sauce is that we automatically tag malicious descendants to “specific incidents” that are presented as part of the same alert, creating an attack narrative to aid incident response (which is pretty cool!). Therefore, it’s a great sign that our BigQueryML model can also detect these malicious descendants as anomalous. It reassures the deterministic detection engine’s logic of tagging the suspicious events so they are readily available to those investigating an incident’s alert, but not alerting on each event individually. Our ML model helps verify the suspicions, enhancing detection confidence.
We hope we’ve shown that the fate for machine learning applied to security doesn’t have to be garbage! In this post, we’ve seen how we can leverage Capsule8’s investigations data along with Google’s BigQueryML to build powerful anomaly detection models for monitoring your infrastructure, like Kubernetes clusters.
This approach does not require you to build a whole new data science pipeline, or a fancy data science team — BigQueryML combined with Capsule8’s Investigations capability takes care of it all. The same combination can also be used to build user behavior models using pre-built TensorFlow models.
Many people are understandably wary about the application of ML to infosec due to the need for specialized data science teams and the humongous amount of false positives. But as we’ve shown, not only are the BigQueryML models less resource-intensive to create than traditional machine learning systems, our anomaly threshold logic for the model also offers a low false positive rate.
Thus, with a bit of quick studying and SQL query elbow-grease, you can gain the benefits of mathematical magic without needing to be a data science wizard, or burning yourself out with alert fatigue.
The Capsule8 Labs team conducts offensive and defensive research to understand the threat landscape for modern infrastructure and to continuously improve Capsule8’s attack coverage.
Harini Kannan is a data scientist at cybersecurity company Capsule8, where she applies her skills in statistics, visualization, and machine learning to a broad range of threat detection and computer security problems. She enjoys using Python, Jupyterlab, R, and TensorFlow in her daily work.