In case you hadn’t heard, Linux is a big deal. It’s used in the vast majority of production systems, the ones running the apps and services everyone uses. But, as said by the great infosec #thoughtleader and uncle to Spiderman, “with great power comes great responsibility.” These systems need to be performant all the time — downtime isn’t considered cool these days.
Most organizations aren’t yet fully resilient, so a compromised system might even mean having to pull systems out and swap in new ones. And, there’s the annoying risk of bitcoin miners wanting your delicious computing power that vastly exceeds the systems in their basements.
You might think, “Well, we can just monitor Linux prod systems and detect the attackers!” and basically everyone wishes it were that easy. This is not the case, however (at least in this Earth’s timeline). The most common approaches to monitoring Linux add components that cause chaos — kind of like adding dinosaurs to a theme park on an otherwise wild and beautiful tropical island. It may sound cool, but it’s likely to end poorly for everyone, except the dinosaurs — or in this case, the solution vendors who will gladly chomp up your dollars.
I’ll go through some of the approaches to monitoring Linux for production systems in a way that is digestible for folks learning about Linux monitoring for the first time and hopefully still fun for Linux nerds, too. Disclaimer: a lot of things will be really simplified, since the Linux Kernel is kind of like a pocket full of spaghetti with an extra heaping of nuance sauce required.
I’ll discuss any reservations DevOps has about the different monitoring approach (hint: there is scar tissue around this topic for them) and general pros and cons of each method. We’ll go through a brief primer on the Linux kernel, then dive into AuditD, kernel modules, LD_PRELOAD, kprobes, ring buffers, perf, BPF/eBPF, and the role of ML/AI to explore the amazing world of Linux monitoring — and conclude with what kind of monitoring approach will get the thumbs up from both DevOps and SecOps.
The Linux Kernel
The kernel is like a super detailed gatekeeper. Similar to a bank teller, any program has to ask the kernel (teller) to do something on its behalf. Really the only thing for which a program doesn’t need the kernel’s assistance is arithmetic or raw computation — but use of hardware, like a graphics card, or other resources, like
/etc/shadow (which stores password hashes), must be facilitated by the kernel.
Syscalls represent the point by which you transition into the kernel. Think of these like the window at the bank under which you slide your request to the banker teller — you have to wait for a response to that request before anything happens. Syscalls can be requests from opening a file to giving the program more memory. Generally a more privileged entity — like the kernel — is the one that provides those things.
Some requests, however, will be a bit trickier, so the response won’t be immediate. Think of this like asking your bank teller, “Hey, can you give me a new bank account?” Your bank teller responds that they’ll have to collect a bunch of stuff to give to you. Likewise, for some requests, the kernel will tell the program to wait because it has to give it a bunch of stuff.
Thus, syscalls become an interesting point for inspection, because they are showing all the requests programs are making to the kernel. Basically every approach to Linux monitoring is either looking at syscalls or at other internal kernel functions for this reason — because you get better insight into the internal workings of the system.
AuditD is a subsystem of the Linux kernel designed for monitoring access on Linux, writing as much audit info as possible to disk. All you probably need to know is that a core maintainer for syscall infrastructure on Linux hates AuditD and recommends staying away from it. But, here’s a very short explanation on why it’s so hated by the Ops community anyway.
The root cause of the issue is that AuditD conforms to a government standard that all events have to be auditable. So, you can’t have the ring buffer (more on that below) dropping anything — think of this like you can’t have any events leaking out of the information pipeline. In practice, this means that the whole system blocks processing until it’s confirmed that an auditable event was consumed. In some modes, this results in instability.
Another concerning issue is that AuditD’s log output is notoriously difficult to use and work into other systems. Unless you’re optimizing for the worst performance possible, this isn’t how to go about security observation on Linux.
Although it’s not frequently used, LD_PRELOAD is a way to monitor Linux while staying in userspace. The loader manages libraries — think of things like
pthread in Linux which programs use to do stuff. This means with LD_PRELOAD, you can leverage any sort of library you want and make sure it runs before the main program is loaded.
LD_PRELOAD says before the program starts, load all my code in there as a library. This approach provides a way of placing hooks inside the program without having a debugger attached — similar to the approach EMET uses on Windows. The thinking here is, “Okay, program, I know you go through these points to talk to the kernel, so I’ll set my hooks from the inside so I can see what you’re doing.”
As far as stability concerns, if your hooking function isn’t well behaved, then you’ll crash the program. The program expects memory and registers to be returned in a certain way, so if your hook doesn’t meet those expectations, the program will panic. In practice, most of the time you normally know the calling convention of the functions you want to hook with LD_PRELOAD, so it’s usually a safe thing to do and pretty fast, making it less of a risk than something like AuditD.
Another downside of LD_PRELOAD is that the visibility you get isn’t global and is instead process to process — because you can only see the processes you’ve hooked. As you try to make it more global, it becomes more fragile. This is because of variations in library versions result in different symbols to be exported — that is, it’ll export data or instructions into a different space in memory between each version. Finally, it doesn’t work with statically compiled binaries; static binaries don’t load other libraries, so you can’t preload your own library in them.
If you’re an attacker and clued into the fact that LD_PRELOAD is being used, you’ll be motivated to make the extra (but not difficult) effort to evade monitoring by performing syscalls directly. Even for programs linked to a particular library (like
libc), it’s still possible to perform the syscall instruction (x86_64) or int 0x80 instruction (x86) directly. A preload-based monitor would completely miss events like those, and the last thing security people want is a big oof by missing easily-performed bypass maneuvers.
The original approach to monitoring activity on Linux was through kernel modules. With a kernel module, you add some code to the kernel that executes as part of the kernel and becomes as powerful as the rest of the it — allowing you to do things like see all the syscalls coming in, plus any other capabilities you might want.
If this doesn’t sound safe, it’s because it isn’t! By adding code to the kernel, you’re adding attack surface that isn’t nearly as reviewed and tested as the base Linux kernel itself (which already has plenty of bugs on its own). Forcing custom C code into the kernel creates more code for attackers to exploit, although there are serious concerns beyond security for kernel modules as well. There are some workarounds, such as through Dynamic Kernel Module Support, but it requires recompiling — thereby requiring a compiler to be in production, which isn’t great ops hygiene.
If there were a Master Wizard of Linux gracing us with their sagely commandments, one would probably be “don’t break userland.” This means you shouldn’t change how all the programs on the system are written to talk to the kernel. While kernel maintainers adhere to this commandment to preserve outward-facing behavior of the kernel, the kernel itself thrives in a state of chaos all the time, so whatever kernel module you write risks not melding into the chaos and may instead cause a catastrophy. The kernel module is like a drama queen joining a carefully orchestrated fight scene — the likelihood that someone is going to get hurt shoots up.
Most Ops people understand the risks involved with using kernel modules, which engenders wariness. After all, the last thing you want is to crash the kernel on a production system — the likelihood of which can increase dramatically if a third-party kernel module is present. Ops teams also won’t be super jazzed at the fact that adding a kernel module often invalidates support contracts with their providers, like Red Hat. Think about it like jailbreaking your iPhone and getting Apple technical support.
Even worse, a kernel module means that every time a new kernel is deployed and updated, you have to recompile the kernel with the module. This adds manual labor and a layer of complexity that is a huge turn off for Ops teams. Ultimately, the risk tends to outweigh the reward with kernel modules.
But kernel modules aren’t the only way to collect data from the kernel. Linux provides tons of subsystems to help people collect data for performance and debugging. The Linux Kernel org specifically gave the community little mechanisms to use for visibility through the introduction of kprobes well over a decade ago. Basically, this was Linux saying, “Hey, we know you want to collect system data, so we’re going to make a standardized mechanism for you to do it.”
A kprobe is a means of saying, for a given function in the kernel, “I want to record when this function is called, and collect information such as the arguments it was passed.” The way it would be written on Linux using its special subsystem is like:
p:[name-for-probe-output] [function-you-want-to-probe] [what-will-it-collect]
For instance, it could be
p:myexecprobe execve $ARG1, $ARG2. This would output the kprobe
myexecprobe, which would include the programpath (
$ARG1) and cmdline (
$ARG2) when a user uses the
execve syscall. You name your kprobe to tie any resulting output back to what you wanted to monitor. If you’re familiar with canarytokens, it’s similar to how you name a particular token, like
MyPictures so when you receive the alert, you know the source of the alert is from your
Most commonly, kprobes are placed on syscalls between specific processes (which run specific programs) and the kernel, as shown in the above diagram. There are also uprobes, which snoop on userspace functions instead of kernel functions, but they’re a topic for another time.
The output from kprobes is a super useful data source, but this output needs to be collected and analyzed. Unfortunately, most people still use a kernel module to handle the output from kprobes. This is still adding risk, which DevOps doesn’t (and shouldn’t) want. But, let’s continue to explore how they actually can be used more responsibly.
Ring buffers are also known as circular buffers. If you’re familiar with accounting, the ring buffer follows a FIFO approach. Or, you can think of it like an empty pizza pan where slices are added one by one and the next slice eaten is always the oldest slice added to the pan (because who wants to eat cold pizza?). Given the finite space within the ring buffer, when new data needs to be added, the oldest data will always be overwritten.
The reason why ring buffers are used rather than logging syscalls to disk space is because that would be expensive resource-wise and slow the system down. With the ring buffer, you ensure that the resource overhead is the pizza pan, rather than accumulating the equivalent of a ball pit full of pizza in your room.
I’ll be talking here about ring buffers for kprobes specifically. The first “block” (think: open space for that first slice of pizza on the pan) is the one that keeps track of the rest of the pan. This block will tell you where to begin reading and where to end reading based on what you’ve already read — that way you’re never missing anything.
So, what data is actually in these kprobe ring buffers? Let’s go back to our prior example of a kprobe of:
p:myexecprobe execve $ARG1, $ARG2. These last two parameters,
$ARG2, define what
myexecprobe will write to the ring buffer. In the case of
execve, it might be the
cmdline values. This is particularly helpful if you’re only looking to monitor specific sort of metadata for particular functions — that way you get exactly what you need (arguments, return values, etc.).
For monitoring, you need something fast so you catch any bad activity quickly. You want to copy things as little as possible for performance, and mmap, which is short for memory mapping, can help with that.
Think of using mmap like cutting out a middleperson. Let’s say there’s a factory that is bulk producing a variety of widgets that get packaged and shipped out. Rather than waiting to receive one of those packages, having to unpackage it, and then finally being able to see what the widget is, wouldn’t it save more time getting a window into the factory to see the widget? Mmap provides that sort of window, allowing you to specify the window of how many widgets you can see, saving you a lot of time.
To recap, kprobe ring buffers allow you to efficiently output the data you defined wanting from your kprobes, and mmap makes the process of accessing that data even more efficient.
About a decade ago, the Linux Kernel introduced a new mechanism to help extract data collected via sources like kprobes, called perf. As you might imagine, perf is short for performance, because its original use case was for performance analyzing for Linux. Perf collects the data from kprobes and puts it into the ring buffer. Let’s say Apache performs a syscall on which you have a kprobe; perf then says “Okay, collect this data based on the kprobe’s specifications, and we’ll write the data for you here in this buffer.”
The great news is that perf grants you access to kprobes, but in a much safer way than kernel modules. Using perf is extremely stable, in contrast to the instability and chaos of using kprobes from a kernel module. Simply put, by using perf, you’re a lot less likely to mess up anything up in the kernel. Really the biggest downside of perf is that it doesn’t backport well to ancient kernels — think the Linux equivalent of Windows XP. For more on how perf can be used, see Brendan Gregg’s excellent write-up.
BPF & eBPF
BPFs (aka Berkeley Packet Filter) are essentially itty bitty programs. BPFs are written to take baby rules like
source_ip = X.X.X.X and translate it into a tiny program. Because BPF is its own baby instruction set, it can only do simple things, like basic arithmetic. It can jump forwards, but not backwards, so no loops for these little ones. All access to data is validated to be safe by the kernel. These restrictions timebox BPF programs and ensure access to all resources are provably safe, meaning you can run custom BPF programs in the kernel without risking stability.
eBPF is extended BPF. eBPF is a means of loading tiny programs into the kernel to execute during certain events. For instance, you could set little programs to evaluate data you collected through your kprobe.
By far the biggest downside of BPF and eBPF is its backwards compatibility. The ability to attach eBPF to kprobes is only in kernels from the past five years — and, as most Ops people know, a good chunk of enterprises have kernels way older than that. Therefore, while it might be the method of the future, it lacks support for the reality of most organizations’ environments today.
For more about how all the aforementioned mechanisms work together, I highly recommend Julia Evans’ blog and infographic on Linux tracing.
ML, AI, & Unicorn Dust
In addition to using the drama queen kernel modules, some solutions will go really heavy-handed with machine learning and artificial intelligence in order to improve their monitoring and detection. Regrettably, despite the cool math, there’s a significant training period upfront and a lot of tuning required.
DevOps doesn’t want a heavy footprint, however — not on the network, and certainly not on the CPU. But for most of these AI and ML-driven solutions, the architecture generally doesn’t scale well. These solutions collect data points all the time and send them over the network, which makes the network really sad and tired. It’d be like UPS having to deliver all Christmas presents on one day — the streets would be clogged, front steps and hallways would be packed, and drivers would be absolutely miserable.
Looking for anomalous behavior across the entire syscall boundary can give a sense of whether an app is acting weird, but not really if it’s acting weird for a security reason. This means a lot of false positives, which is obviously a big thumbs down for SecOps.
Batch analysis is generally required for solutions heavy in machine learning or artificial intelligence, which also means real-time monitoring isn’t possible. Real-time, despite being an overused buzzword, actually does matter to SecOps, because it’s better to catch exploitation as it’s happening than reading your alerts only to find a compromise that’s already occured.
The reason why batch analysis isn’t real time is because the mountain of data must be collected, sent out over the network, analyzed for policy violations or anomalies, and then finally an alert will be generated after minutes or even hours. If you don’t put a lot of money into a machine collecting all the data in one place and performing analysis, it could take even longer. This kind of computing is super expensive — and businesses tend not to like expensive things, particularly for areas like security that are already perceived as “cost centers.”
Another issue for machine learning-led approaches is that the algorithms have to learn from training data. The problem is that in order for the algorithm to catch an attack, it needs to have seen it before to identify it — which is not a solid bet when relying on historical data for training. Machine learning can be useful at catching some basic stuff at scale, but it will always be linked to the past, and worse at catching novel or sneakier exploitation.
The Capsule8 Way
This is where I show us off a bit — you are more than welcome to skip to the conclusion, but you’ll miss out on some pretty awesome stuff! We’ve put a lot of thought into our product by thinking about what would make us most mad as hackers if we encountered it while attacking an organization.
One difference between Capsule8 and other Linux detection solutions is that our detection happens locally. It’s far less expensive for everyone to do computations locally vs. sending off the machine via the network, and we like avoiding unnecessary pain. We also make Capsule8’s resource cost tunable, so you can set how much CPU time goes towards Capsule8 doing its thing.
How you configure Capsule8 depends on your risk tolerance and your preference. This is how we’ve gotten the thumbs up from Ops, even at large enterprises. But what’s extra cool about Capsule8 is that we use kprobes + perf in an intelligent way, driven by our extensive experience breaking into systems.
Think of it like this: if you want to know whether someone broke into your apartment, there are a few strategies you could create. The most obvious ones would be catching the window breaking, the doors being bashed in, or the safe with the jewels opening. This is where our exploitation expertise comes in handy (more on that below) because we look at what attackers must do to complete an attack.
We maintain multiple vantage points, with an eye to the attacker’s “kill chain” — like looking at the front door, window, stairs, and closet door to the safe. You could also create a policy like: normally you hear the front door open before the window opens, so if you hear the window open before the front door, that’s weird. Think of this as the “check the context around the window opening” strategy.
Most ML, AI, and other mathematically-magical solutions are more like: “Let’s spend some time counting all the times the doors opened, and then send alerts on anything that deviates from this baseline we established.” Some go to an extreme and perform analysis that sounds cool, akin to asking, “Are the bricks moving? We should analyze the atomic states of the bricks in the building, maybe we should look at the chemical composition of the glass, too.”
For too many of these ML and AI-based solutions, they not only ignore how attackers actually behave, but also end up wasting a lot of network traffic and compute time. No one wants their time wasted, and the organization certainly doesn’t want its resources wasted.
At Capsule8, we also want to make attackers sad — hopefully crushing their souls. Ideally, you want to pinpoint exactly which points are interesting to attackers in the kernel and watch those closely. This is where a deep bench of exploitation experience (like our Capsule8 Labs team) comes in handy. That way, you know which places in the kernel are most loved by attackers, and can collect data that matters rather than collecting all the things.
For example, attackers love popping shells. But developers and programs love using bash or bin shells, too — it’s just a handy tool that everything uses. Most solutions will create a rule like, “show me anytime a shell runs,” which is going to be all the time.
This sort of rule becomes functionally useless unless you absolutely love sifting through the same alert over and over. Capsule8 instead can determine the difference between a user or a program using it, and answer questions like, “Is this shell executing to routinely process a script, or is it interacting with a live user?”
Capsule8’s approach means we can detect the entire spectrum of unwanted behavior on Linux, from devs risking production stability to sexy kernel exploitation by motivated attackers. This includes real zero-days — not zero-days as in “a slightly modified malware strain,” but the top-shelf expensive stuff that is exploiting the system in a way you can’t patch.
If you want a cool recent example, check out our Part 1 and Part 2 of how we exploited systemd (the first public exploit, I might add). The best part is, our stack pivot detection catches it — and that’s just one of the many detections we have in place. So it’s not about finding CVEs or anomalous behavior, it’s about finding the exact techniques attackers must use to compromise the system.
IT Ops, DevOps, InfraOps, and other *Ops people hate kernel modules because they:
- Create instability
- Invalidate support contracts
- Require re-compiling kernel updates with the module
- For real, they require compilers on prod instances :scream:
- Make you wait to rebuild the kernel if there’s a critical vuln
- Totally break the build chain people use
- (but at least it isn’t AuditD?)
A kprobe + perf approach is the safer way to perform Linux monitoring, even for ancient systems that are probably covered in moss like Ancient Guardians in Zelda at this point — but still happily running along despite their age. Keep an eye out on BPF and eBPF, but keep in mind that until the vast majority of enterprises move beyond those ancient kernels, they aren’t for widespread usage.
Ultimately, if you don’t want to make DevOps miserable, you need Linux monitoring that:
- Can’t crash the kernel
- Won’t flood the network
- Won’t require extra labor by Ops
- Fits into the existing build chain
- No, seriously, it really can’t disrupt prod
If you’re SecOps or DevSecOps, you’ll probably be happy if:
- Real attacks are detected
- Even better, real attacks are prevented
- You aren’t spammed with alerts
- Your time isn’t wasted with false positives
- You aren’t drowning in a data swamp
- You can control analysis costs
- Ops isn’t telling you your monitoring idea is bad, and that you should feel bad
Obviously, this is Capsule8’s mission — for security and operations to live in harmony, while making attackers mad that they can’t compromise your production Linux systems.
Kelly Shortridge is currently VP of Product Strategy at Capsule8. In her spare time, she researches applications of behavioral economics to information security, on which she’s spoken at conferences internationally, including Black Hat, AusCERT, Hacktivity, Troopers, and ZeroNights.
Most recently, Kelly was the Product Manager for Analytics at SecurityScorecard. Previously, Kelly was the Product Manager for cross-platform detection capabilities at BAE Systems Applied Intelligence as well as co-founder and COO of IperLane, which was acquired. Prior to IperLane, Kelly was an investment banking analyst at Teneo Capital covering the data security and analytics sectors.