Major Key Alert: Data Discovery for Red Teams with an ML Tool for Keylogging

With the glut of security vendors who promise to secure to the moon and back on the star-glazed spaceship of Machine Learning (ML) technology, where is the equivalent for red teams? Imagine a scene: an earnest red teamer hunched at their desk, hand under chin, eyes hazy with fatigue as their finger presses the down arrow for what seems like eternity, sifting through irrelevant search history, Word document creation, and Slack messages between colleagues who also enjoy dogs and house-hunting and other trivial things, all to find the treasure they so tenaciously hunt — the user’s credentials.

Keyloggers are an essential tool for red teams, covertly capturing keystrokes by users to discover secrets that help further the goal of pwnage — with the keyboard output created by credential entry being one of the most valuable plums to pluck from the system. Gaining credentials through keylogging can assist with lateral movement or accessing valuable data — so it can be worthwhile for red teams to sit and wait for hours to days for a user to input their password within a veritable ocean of unimportant keystrokes.

For those of you who are ATT&CK Framework fanatics, keylogging corresponds to the “Input Capture” technique, part of the Collection and Credential Access tactics. As you can see under MITRE’s “Examples” section, basically every notable malware / RAT family (like DarkHotel, Duqu, and PoisonIvy) uses a keylogger, as do notorious APTs like APT3 (China), APT28 (Russia), and the Lazarus Group (North Korea).

Capsule8 is built by former black hats to help teams operating Linux production systems ruin other black hats’ fun, so we can empathize with the pain red teams face when having to sift through a mountain of junk data just to find the one credential they need. Leveraging the brilliance of our lead data scientist, Harini Kannan, we’ll demonstrate how to use topic modeling (a natural language processing (NLP) technique) to help downtrodden red teamers more quickly complete their quest of grabbing credentials out of their keystroke or system data dumps.

Topic Modeling 101

First, what is topic modeling? Topic modeling is a text-mining tool using the magic of statistics to uncover abstract concepts that appear in a set of textual data sources (usually documents). The model ingests text data, automatically tries to learn the context (called “topics”), defines the topics, and then splits the dataset into subsets of words based on the identified topics. You can think of a topic as just a repeating pattern of words in a corpus (a collection of texts), and each corpus can be described by a pattern of topics. As a result, topic modeling is a great way to organize, search, and filter text data. 

While there are a number of topic models out there, we used latent Dirichlet allocation (LDA) for this project. LDA is most commonly used for things like clustering similar news articles together or analyzing Shakespeare for themes, but it can be used for any sort of textual data. Although they don’t state it explicitly, Google very likely uses LDA (or a similar model) to enhance search functionality through what they call the “Topic Layer.” The goal of LDA is to map documents to topics and topics to words to help us more quickly understand what is present in the source material.

Efficient keylogging analysis may not be the first use case that comes to mind, but the ability to map disparate log files to topics and topics to words — like passwords — is a decidedly useful endeavor towards that devious ambition.

Setting up the Model

First, we install gensim, a Python library for topic modeling, and nltk, which helps us perform NLP with Python. We also need to install pyLDAvis, a Python library for interactive topic model visualization, and import all other necessary libraries (like pandas and matplotlib). Once all the dependencies are installed, we upload our dataset (if we aren’t running the model locally). 

Cleaning up the dataset

We need to clean up the dataset to transform it from a messy trash pile into a streamlined trash pile — the data pre-processing step. To begin pre-processing our data, we remove stop words, which are extremely common words with little value — think “and”, “in”, or “the” in English. Since system logs don’t really follow English grammar rules (Stannis would be so upset), we added stop words specific to systems, too. For example, stop words for TCP data would include “com”, “request”, and “response”.

Our pre-processing also includes the creation of bigrams and trigrams, as well as data lemmatization. Text that is read two words at a time is a “bigram”; text read three words at a time is a “trigram.” A bigram of a sentence would be: “I pet”, “pet a”, “a fluffy”, “fluffy cat”. A trigram of the same sentence would be: “I pet a”, “pet a fluffy”, “a fluffy cat”. Bigrams and trigrams facilitate our model’s understanding of sentences (or strings), and help it figure out the context around words. Using this understanding, the model can begin predicting subsequent words — think of how Google autocompletes queries for you based on what you’ve typed into the search bar thus far.

Lemmatization seeks to find the base of words (called the “lemma”) and removes inflectional endings. For instance, lemmatization removes the “ing” from “walking,” outputting the lemma of “walk.” Importantly, lemmatization can understand context, understanding that “hearing” can be either a verb or a noun, and only outputting the lemma “hear” when it is a verb. Lemmatizing data is an important step in pre-processing so that all forms of a word are treated the same.

Training the model

To train the model, we first need to compute the coherence score on this pre-processed data. The coherence score measures the relative distance between words within a topic — think of it like a measure of word similarity. Each topic gets its own coherence score. From these multiple scores, we select the model with the highest coherence scores — because the highest score implies a more distinct topic, and distinct topics expedite data spelunking. This model then guides us to the optimal number of topics, ensuring that each topic is sufficiently distinct but relevant.

Topic learning by the model

Once we’ve chosen our optimal model, we can extract the dominant topics. For instance, dominant topics from an article about Linux might include distributions, system calls, or vulnerabilities. Each of these topics would have distinct keywords, such as “Debian,” “Ubuntu,” or “Red Hat” for the distributions topic. These keywords can be visualized through a word cloud, with the size of the keyword reflecting its weightage within the topic. We’ll show you examples of these topic word clouds in our case studies below.

With these defined topics and keywords, we now can plainly see where the appropriate breadcrumb trail begins and begin intelligently searching within a particular topic. It’s like if Gretel and Hansel could apply a topic model to the magical forest and directly explore Topic 0: Candy. 

Now that we understand topic models and how to set them up, let’s walk through the three use cases we created at Capsule8 — finding AWS keys in keylogging data, passwords in a TCP pcaps, and passwords in HDD logs. If you are impatient, intrigued, overconfident, or all of the above, you can dive right into looking at our model by looking at the Google Colab doc

Case Study 1: AWS Keys in Keylogger Data

This is where the fun begins! We knew our problem — how to pick out credentials from keylogging data — and our statistical model (LDA), but we didn’t have any keylogging data with which to play. The Capsule8 Labs team offered up their machines to the Elder Research Gods in tribute, so we had multiple users run execsnoop for approximately one day to generate data. In the process, one or more of our colleagues may have “accidentally” spilled their credentials while doing their work on those machines.

We could’ve cheated and used their daily work profile to help us with the analysis, but we wanted to mimic real attackers who are attempting to analyze keylogging data of their victims who aren’t sitting next to them in the open office everyday. Again, the goal is to figure out if any sensitive information leaked, and if so, to figure out those data points from the execsnoop log. Importantly, we must do so without manually combing through the data set — the (super lame) status quo.

Before even touching simple search tools like grep, we want to filter and partition the dataset to gain a sense of where to start within this sea of data. Luckily, the topic model helps us reduce the time to fish out our target data by a lot. Our optimal model had five topics, which is digestible for most humans:

Let’s explore these topics with their respective top keywords (and the weights of those keywords) to see how we can assign categories for each topic.

Topic 0: Folders and source control

Topic 1: Docker-related activity

Topic 2: “Cloud Stuff” (it’s a technical term, promise)

Topic 3: Capsule8 build & GO-related

Topic 4: C-related

Based on the optimal topics identified by the model, we can now determine which topic holds to most potential as a grove of succulent leaked credentials ripe for plucking. Topic 2, aka “Cloud Stuff,” logically seems like it would contain login and configuration commands for various cloud services. Let’s take a look at the list of top 30 words within the topic below to see if any interesting keywords jump out:

(note: the graphic above says “Topic 3” because its numbering begins at 1, rather than 0 as the model uses)

As you can see, the intriguing keyword “aws” is second on this list, which gives us our first breadcrumb. With any luck, these users performed actions like configuring AWS credentials, making the keyword a solid starting point for hunting leaked credentials — particularly if users didn’t follow best practices during the configuration steps, which is regrettably common. 

Within Topic 2, the “aws” keyword has three data points we can investigate, which means the model identified “aws” as the primary keyword in three instances. It’s important to focus just on the relevant keywords for a specific topic, as there can be instances of the same keyword within other topics that are irrelevant (and are deemed as such by the model). For instance, there are 773 data points for “aws” within Topic 3, but it’s because “aws” appears in a URL path related to build commands, which isn’t relevant for our purposes.

Let’s take a look at these three data points for “aws” in Topic 2 to see if there’s anything that could help us:

Fantastically for us, the user made a heckin’ DevOops by using the wrong command, which leaked their AWS key. Now that we have their AWS access key, we could perform any actions we want in their AWS account (that are allowed by policy). We could use the key to launch EC2 instances for cryptomining or to access and exfiltrate sensitive customer data stored in S3 buckets — both of which represent a pretty big win for attackers. 

By using the power of topic models, we were able to quickly able to narrow down places to hunt based on relevant instances of the “aws” keyword. In fact, we didn’t even know we wanted the “aws” keyword (instead of a keyword like “password”) before we saw the topic the model generated — so it granted us a path previously unknown to us. This process helped us get to our goal in less than 30 minutes, rather than conducting an exhaustive manual search across the thousands of “aws” mentions within the entire dataset.

Case Study 2: Passwords in Captured TCP Traffic

Passwords are even harder for most red teams to discover as they sift through keylogs or syslogs, but using an LDA model helps with password discovery, too. Passwords could end up in a number of places or utilities, so they may be spread around multiple topics. For instance, on Linux, password input could occur frequently when using MySQL (mysql -u root -password), though passwords could also be pasted into bash by accident, too.

In the dataset from our first case study, any mistakenly pasted password into bash would be present in Topic 0 (Folders & Source control). If we want the mysql password, we would want to find the mysql keyword somewhere in one of the topics (which we didn’t have in our dataset). Thus, any red team’s starting point should be, “What topics are relevant for password leakage?” This helps you drill down into the right data, whether you’re seeking AWS keys or MySQL passwords.

To show the potency of the topic modeling approach, we will re-apply the LDA model to TCP dumps. We are operating as if we do not already know what the log contains, but have the mission to discover any passwords or credentials that could abet our malicious machinations. To test this out, we downloaded pcaps from the Mid-Atlantic CCDC, a competition testing students’ offense and defense skills. To our knowledge, all passwords and other information present in the data are not sensitive.

Let’s yeet ourselves right into the topic word clouds. The model found the number of optimal topics to be ten:

Topic 0 seems to be the most relevant to us, containing content related to login and authentication:

We can see the keyword “password” present in the top 30 list for Topic 0, so let’s dig deeper to see if we can find any passwords. As before, this is as simple as looking for a string containing pass within the topic:

We found them! They’re pretty simple passwords, probably because it’s for a student competition. However, whittling down 17 separate pcap files into topics for exploration and searching for the pass string within that topic saves a lot of manual effort rather than having to search through each file and hit matches for pass that may have the wrong context. Really, what is a valuable tool but something that facilitates laziness without reducing success?

Case Study 3: Passwords in a Windows HDD

We at Capsule8 enjoy a proper challenge, so we wanted to make the password discovery problem even harder for ourselves. For this final case study, let’s assume we gain direct access to our target’s hard drive, running Windows. We find a subdirectory that contains log files (an admittedly small sample of the entire dataset). Our goal now is to find passwords within it. As discussed in the two previous case studies, we’ll start by building an LDA model that learns the optimal number of topics, analyzing the topic clusters, and then seeing if any topics discernibly divulge the presence of passwords.

The model found five topics to be the optimal number. Because we used a real data set (i.e. a real, wild-caught HDD), we will only show the two topics relevant to our interests and with some sensitive keywords obfuscated:

The word cloud in Topic 3 exposes the keyword “password,” so we know the best starting point for our search. In fact, the model uncovered 162 passwords within Topic 3, which should make any attacker salivate: 

“Excelsior!” we cry out in furious joy, for we now hold the golden keys to paradise, where attackers can drape themselves on velvet chaises while languidly munching plump grapes by making the machines do all the thinky-thinky. But our brow furrows, imposter syndrome descending like a dark shroud, and the spark of ambition flares anew, curiosity pricking our neurons and leading to the harrowing question, “But what if there’s more?”

Looking back at the word clouds above, Topic 0 also looks promising with beguiling keywords such as “verification”, “verify”, “account”, and “code”. My initial instincts were that this might contain 2FA codes, and my hunch was correct:

Note: we only are displaying a snippet of these 2FA codes for a popular chat app — there were 3,655 codes in Topic 0 uncovered by our model.

HDDs contain a ton of data — far more than is manageable for an analyst to comb through in a day. Using topic models, you can extract context from a huge volume of disparate data and divide it into manageable subsets for further analysis. In this last case study, these subsets included password information as well as verification codes — and this represents only a tiny fragment of the data on the HDD.

Conclusion

Using a bit of data magic, attackers can sift through large quantities of data to determine specific topics, identify promising keywords in those topics, and then analyze those keywords’ data points to see if juicy treasure lies within. Topic models help point you in the right direction to hunt for leaked credentials. And, the best part is that topic models only get better as the dataset increases in size, which makes it a lot easier (and hopefully more fun!) for attackers to find the data they need to advance their operations.

It would be natural and reasonable to think that perhaps we’ve overengineered this problem and that we are using fancy math for fancy math’s sake. Well, yes, but also no. Grepping is undeniably efficient if you already know what you seek and where to look for it. A key advantage topic models have over just grepping is they will tell you about topics about which you weren’t aware — which means you would never know to search for it without the model’s guidance. 

Model training took just under 30 minutes for the first case study and only 15 minutes for the last one, but it could take more time with a larger dataset (like many days worth of daily system activity across many users). However, that represents a significant improvement over the amount of time it currently takes attackers to sift through keylogging data for the data they need. Let the model do the hard work of sifting through the data mountains to find the gilded path that leads to the cave of wonders within.

To explore our topic modeling tool yourself, check out our public Google Colab doc.

The Capsule8 Labs team conducts offense and defense research to understand the threat landscape for modern infrastructure and to continuously improve Capsule8’s attack coverage. 

Harini Kannan( Data Scientist )

Harini Kannan is a data scientist at cybersecurity company Capsule8, where she applies her skills in statistics, visualization, and machine learning to a broad range of threat detection and computer security problems. She enjoys using Python, Jupyterlab, R, and TensorFlow in her daily work.