Posts Tagged ‘attack detection’

Zero-Day Attack Detection: Focus on the Catch, not the Patch

Posted by

When high profile zero-day vulnerabilities hit the headlines, security professionals around the world scramble to patch and remediate the damages. Zero-days such as ImageTragick, Shellshock, and most recently, Meltdown and Spectre, showed how even complex, modern infrastructures are susceptible to highly impactful security issues. Meltdown and Spectre, in particular, also signaled a shift in focus towards security issues at deeper levels of the computing ecosystem.

The Problems with Software Patches

As with most things, the allure of quick fixes did not live up to reality. The patches and remediation strategies, while important, are neither quick, nor a fix, and in some cases, add to the pile of problems instead of helping solve them. While it is unambiguously better to be invulnerable to these issues, their full impact cannot be seen or patched up. Software patches are slow-to-install roadblocks in the arms race of attacker vs defender. While you’re trying to patch, and often even if you already have, they’ll get around it. And you need to know the second they do.

Deciding Between Zero-Day Attack Detection and Patches

While no one would recommend not applying important mitigations as fast as reasonably possible, to do so across entire enterprise environments takes significant attention and time. This complexity allows more agile attackers to take advantage of these vulnerabilities, possibly even using readily published proofs-of-concept, while defenders are still performing their own risk analysis. As CloudFlare’s analysis of Shellshock exploitation activity in the wild showed, attackers were much quicker to begin attacking than defenders could even hope to patch. Protections based on zero-day attack detection can be deployed much quicker than mitigations based on patching alone.

Performance impact is another major consideration for why relying on patching may not be as effective as detection. For some situations, if the workload doesn’t execute untrusted code and is fairly locked down, then the performance hit of the mitigations seems like a high cost for little to no benefit. Additionally, much of the infrastructure affected by Meltdown and Spectre ran on older kernels that are a challenge to upgrade — doing so would result in huge cost and stability risk. Existing mitigations (kernel upgrades and recompiling software) will also fall well down the priority list, as the risk of a successful attack will be outweighed by the cost of the remediation. That’s not even including other potential costs such as that added by additional performance overhead or potential false positives.

Remediation vs. Zero-Day Attack Detection

There is a very long tail in remediation; it’s not going to happen quickly and it’s not a panacea. That’s why companies need to think critically about zero-day attack detection as part of their strategy for dealing with these high-profile vulnerabilities. If a workload seems unlikely to be practically exploitable without other major failures, detection makes much more sense than relying on mitigation and/or remediation. Even in cases where there’s legitimate risk, zero-day attack detection is still a viable alternative if you can automate the detection and shut down of an offending process before sensitive information is fully exposed. High-risk environments also need to investigate a hybrid approach where both mitigation and detection are involved as mitigation alone is not enough.

In the end, organizations do not have to choose completely between one approach or the other. Mature and advanced environments combine both to utilize each for their strengths. Eliminating the underlying cause of a vulnerability is the most effective, but also takes the highest amount of time, effort and money to fix. Detecting exploitation of broad classes of vulnerabilities and attacks against your infrastructure through advanced security monitoring and automating real-time responses provide protection while known vulnerabilities are in the process of being eliminated. They can also provide valuable signals of malicious activity while also revealing exploitation of yet- unknown vulnerabilities.

To learn about preparing for zero-day attacks, download our infographic here!

Download our Infographic!

Detecting Meltdown and Spectre by Detecting Cache Side Channels

Posted by

Last week, we delivered an open source detector for some variants of the Meltdown attack and promised that we’d provide a more generic detection for more variants of Meltdown and Spectre. Today we are delivering on that promise with the introduction of our Apache-licensed cache side channel detector for Linux.

In addition to releasing that detector, we are urging a broadening of the conversation beyond mitigation to also focus on accurate and effective detection. We believe that solely focusing on expensive mitigation steps can leave many environments vulnerable while they are in the process of updating their systems. For most organizations, Linux kernel security updates are among the most disruptive security updates and require a large amount of planning and time to execute. Lightweight, non-disruptive, and reliable detection can provide effective protection against newly disclosed vulnerabilities where attackers can move much faster than defenders can.

The Case for Detection

It’s been an eventful week in many respects. Seeing the major cloud providers upgrade their entire infrastructure so seamlessly was a huge positive, even though there was a big downside that was well captured in this tweet:

Some sources indicate that the performance issues in the Kernel Page Table Isolation (KPTI) mitigation strategy are overblown, especially on recent hardware. However, additional reputable sources argue that the technique does have a practical impact, dependent on the application, possibly landing in the range of 5% for most workloads, with some instances being significantly higher (especially workloads with lots of calls into the kernel).

To make matters worse, none of the existing mitigations are complete mitigations to the problem. For example, Google warns that despite their exhaustive efforts some applications could require mitigations, such as web browsers that execute remotely provided untrusted code. There’s also the strong possibility that relatives of the attack could circumvent mitigations, such as side-channel attacks which are not always well handled by the industry despite the general understanding that modern CPU architectures have significant potential for such problems.

Many environments without mitigation could, in practice, be very low risk for a practical attack. If the workload doesn’t execute untrusted code and is fairly locked down, then the performance hit of the mitigations seems like a high cost for little to no benefit (Note: an unmitigated machine would be at high risk of an unprivileged remote execution vulnerability turning into full root).

Additionally, while it was inspirational to see cloud providers able to move so quickly, there is going to be a massively long tail on upgrades. Most organizations will never be able to upgrade their fleet that quickly. Their production environments often run old software on old OSs, where any upgrade comes with a tremendous amount of risk. It’s likely more cost effective to focus on detection and response strategies, rather than full mitigation, particularly when the probability of a practical attack is low for the environment.

For Spectre and Meltdown, we believe some use cases warrant focusing on detection rather than full mitigation. First, it’s reasonable to believe that detection will often be more efficient than mitigation. For instance, our simplistic detector below uses 1-3% of a single CPU core for typical workloads, and its worst case in our testing was not higher than 10% of a single CPU core on a very heavy load. This results in minimal overhead to workloads.

Additional performance overhead generally translates to additional cost for someone. However, the detection of false positives can lead to additional cost too, so the effectiveness of the detection has to be considered.

Still, there should be hope that generic strategies looking for anomalies could detect a potentially larger class of cache side channel attacks than we previously knew how to exploit.     Again, as long as the detection doesn’t create a huge burden due to falsing, broader detection should be valuable.

If a workload seems unlikely to be practically exploitable without other major failures, detection could certainly be preferable. However, we feel that even in cases where there’s more legitimate risk, detection is still a decent alternative, especially if a response can be automated. For instance, it should often be feasible to detect and shut down an offending process before sensitive information is fully exposed.

Additionally, where performance truly isn’t an issue, high-risk environments should be looking at a “belt-and-suspenders” approach where there is both mitigation and detection, in case the mitigation is not enough.  

Certainly, detection is critical to dealing with zero-day vulnerabilities. Assuming a mechanism like KPTI will get rid of all cache-related side channel attacks might turn out to be as wrong as assuming that ASLR with good randomization would be impossible to circumvent.

To that end, we wanted to make one of our more straightforward detection techniques available to the Linux world to help manage the problem for workloads where existing mitigations are not a good match.

Our detection strategy seems to give highly accurate results in practice, meaning low amounts of false positives and false negatives.  But there are definitely trade-offs — we have tuned it so that small reads (under a cache line size, like the 40 bytes default size for the Spectre PoC in the original paper) under system load may be a false negative to ensure acceptable performance and that it is with the preference to generate a “low” severity false positive than miss a small read.

Please do let us know if you do find any cases where accuracy is suspect, as we have only tested with a handful of representative workloads.

Cache Side Channel Attacks

A common element to all of the published attacks for all three vulnerability variants of these attacks so far has been the use of cache timing attacks to leak the read speculatively read data to the attacker. In this section, we’ll briefly explain what they are and how they work in order to make it clear how their detection is important to detecting exploitation of these vulnerabilities.

Cache timing attacks take advantage of a few aspects of modern processors. The first and foremost is that the time required to access a memory address in the cache is significantly less (usually 80 or less clock cycles) than reading a memory address not in the cache (at least 200 clock cycles). In addition, there is a last-level cache on multi-core processes that is shared across all cores which is affected by both privileged and unprivileged processes. Finally, multiple memory addresses map to the same physical storage in the cache in a deterministic way. This sharing and overloaded memory address mappings are what allows an unprivileged process on one CPU core to discern the contents of privileged memory that was loaded into the cache by a privileged process running on another CPU core.

Given these facts, Cache side channel attacks work by putting the cache into a known state and then measuring time of operations to determine the change in cache’s state. This all works because of the overloaded associativity between addresses and cache lines. To make this concrete let’s start with an example of the FLUSH+RELOAD technique.

FLUSH+RELOAD

The attack described in the FLUSH+RELOAD paper by Yuval Yarom and Katrina Falkner, works in four simple steps:

  1. Flush an address that maps to a chosen cache line
  2. Wait for sometime for your victim process to do something
  3. Time accessing the address again
    1. If it’s slow then an address that mapped to yours was not accessed by the victim
    2. If it’s fast then an address that mapped to yours was accessed by the victim

In the context of the original paper the authors assumed that there would be shared code and by inferring access patterns to shared code were able to determine RSA keys. In the case of Meltdown, this is described in section 4.2 of the paper but it works like this:

  1. Attacker allocates an array of bytes in userspace and flush their contents from the cache
  2. Then using code that gets executed speculatively, access some memory that’s normally inaccessible due to page permissions (e.g. kernel memory) and store it in a register.  
  3. The speculatively executed code checks a bit at a given range,
    1. if it’s 1 then access an entry in the array from step 1
    2. If it’s a 0 then don’t access any memory
  4. The attacker’s code times accessing entries in the array from step 1
    1. If an entry is fast to access (less than 80 nanoseconds), then the bit at the inaccessible address was a 1
    2. Otherwise if not entries are fast then the bit was a zero
  5. Repeat to get rid of noise that might come from context switches

The Achilles heel in FLUSH+RELOAD has been that it’s using cache misses to signal 0s, which by its very nature causes LLC counters to increment by large margins.

In the case of Spectre, they’re able to read out a byte at a time by performing a write via their victim function:

uint8_t temp;

void victim_function(size_t x) {

  if (x < array1_size) {

    temp &= array2[array1[x] * 512];

  }

}

After successive calls this causes the branch to be taken speculatively which causes two out-of-bounds accesses.

The first one to generates an address, the second uses that generated address to read from. This read causes a cache line to be filled. Since the first address is actually victim process data determining which cache line in the array2 access was filled is equivalent to leaking the byte value. While this does not directly use cache misses to transmit data, it does cause significant amounts of cache misses by not accessing memory linearly.

Detecting Cache Side Channels with Linux Perf

Right after the vulnerabilities were announced last week, we started discussing whether we could reliably detect their exploitation and our own Pete Markowsky suggested that the counters for Last-Level Cache misses may provide a strong signal that a cache side channel was being used to leak the data (as was also noticed by the researchers at Endgame). These types of side channel attacks are used to exploit vulnerabilities like Meltdown and Spectre and are often also utilized in exploiting other hardware-level vulnerabilities like Rowhammer.

The Linux Perf subsystem performs system and software profiling using both software and hardware performance counters. It is also the built-in interface to the Intel Performance Counters. Since we already wrote our own 100% pure Go interface to Perf in our open-source Capsule8 Sensor, it was trivial to make the changes necessary to support accessing hardware-based events through it as well.

Our detection strategy for cache side channels involves setting up the LLC Loads and LLC Load Misses hardware cache counters on each logical CPU and configuring Perf to record a sample every 10,000 LLC loads. Each sample includes the logical CPU number, active process ID and thread ID, sample time, and cumulative count of LLC Loads and LLC Load Misses. This is a very low-impact way to continuously calculate and monitor the cache miss rate on an entire system. In our testing, running this detection consumes an average of 3% CPU on one core, peaking at 10%, during our simulated CPU and cache intensive workloads.

Our detector readily detects the Spectre proof-of-concept published in the original paper, as shown below:

$ sudo ./cache_side_channel 

I0109 02:33:56.943214   13788 main.go:61] Starting Capsule8 cache side channel detector

I0109 02:33:56.944320   13788 main.go:109] Monitoring for cache side channels

I0109 02:33:59.609506   13788 main.go:156] cpu=4 pid=13838 tid=13838 LLCLoadMissRate=0.9551


-----

$ ./spectre_poc

Reading 40 bytes:

Reading at malicious_x = 0xffffffffffdd75c8... Success: 0x54=’T’ score=2 

Reading at malicious_x = 0xffffffffffdd75c9... Success: 0x68=’h’ score=2 

Reading at malicious_x = 0xffffffffffdd75ca... Success: 0x65=’e’ score=2 

Reading at malicious_x = 0xffffffffffdd75cb... Success: 0x20=’ ’ score=2 

Reading at malicious_x = 0xffffffffffdd75cc... Success: 0x4D=’M’ score=2 

Reading at malicious_x = 0xffffffffffdd75cd... Success: 0x61=’a’ score=2 

Reading at malicious_x = 0xffffffffffdd75ce... Success: 0x67=’g’ score=2 

Reading at malicious_x = 0xffffffffffdd75cf... Success: 0x69=’i’ score=2 

Reading at malicious_x = 0xffffffffffdd75d0... Success: 0x63=’c’ score=2 

[...]

Our detector is even more effective at detecting cache side channels the more data that they transfer, so running the published PoC with a larger length specified on the command-line will generate significantly noisier (in a good way) detection alerts. The more data that is transferred through the cache side channel, the stronger the signal from our detector that something malicious may be going on.

Our detector’s full source code is available under an Apache 2.0 license as an example in our open-source repository.

Detecting Meltdown using Capsule8

Posted by

Meltdown and Spectre are such pervasive issues; they’re news on every major outlet. The security world is simultaneously in awe of the attack and panicking about remediation.What nobody is talking about is detection!

Remediation can be effective, and thanks to increased use of the public cloud, we can expect that applications running in the three major cloud providers are all going to be in good shape as long as they update to new AMI images that use patch kernels; although some unknown subset of applications that will suffer performance-wise due to the page table isolation that Linux is using to remediate.

However, we need detection strategies because there’s a very long tail in remediation. A lot of infrastructure runs on older kernels that are a challenge to upgrade — doing so would result in huge cost and stability risk. Existing mitigations (kernel upgrades and recompiling software) are probably not going to be a priority in many such environments — the risk of a successful attack will be outweighed by the cost of the remediation.

We at Capsule8 don’t think it’s feasible to do generic detection of these attacks at the network level, due to the nature of these attacks. But we’ve already developed practical strategies for detecting them, which we’ve implemented on Linux systems. By practical, we mean:

  • Easy to deploy: There is no need to recompile software, or update a kernel.
  • Stable: The detection runs in userland, without the need of a kernel module, etc.
  • Efficient: The sensors run with minimal CPU overhead.
  • Portable: The sensor works for any out-of-the-box version of Linux, dating back to the Linux 2.6 Kernel.
  • Effective: There is an extremely low chance of a false negative in the majority of environments. In many environments, we’d expect no false positives. In some environments, there could be, but we’d expect them to be easily manageable.

In this series of blog posts, we’ll introduce our most basic detections, and provide Apache-licensed code you can use, on top of Capsule8’s Open Source Sensor. While these strategies would certainly work for Windows and other operating systems, we leave that as an exercise for the reader.

Today we’ll focus on a simple and effective detection strategy that can detect the most basic Meltdown attack with practically no chance of a false positive. Next week we’ll go through a more generic strategy that can detect both of these speculative execution attacks and more.

The Meltdown Vulnerability

Today we’re only going to concern ourselves with the Meltdown vulnerability, saving Spectre for next week.

The Meltdown vulnerability is the result of speculative execution, which is an optimization mechanism the processor uses to anticipate operations it may need to perform. Speculative execution optimizes execution by pre-emptively computing the results of instructions before necessarily knowing that those instructions should be executed. You can think of this as doing computational work just in case it becomes necessary. For instance, consider the following conditional operations:

IF user orders eggs special:
    calculate special price of eggs
ELSE:
    calculate the price of normal breakfast

In the case of speculative execution, the processor will calculate both prices before knowing which price is correct based on the user’s order, discarding the results of the incorrect calculation. If the instructions in one of these calculations involve memory read operations, the speculative execution will affect memory mechanisms such as processor caching.

The Meltdown vulnerability specifically is due to the impact speculative execution can have on reading memory contents — not just caching of addresses where instructions are, but also the memory which those instructions access during execution. This occurs because in some cases Intel processors speculatively execute instructions before checking memory access privileges, for instance, checks to ensure that userland instructions do not refer to kernel memory.  Thus the speculative execution of privileged memory-read operations affect the processor’s cache even though the instructions might be restricted from actually executing, because execution may not be permitted. This impact on the cache can be timed, and by measuring many successive repetitions of speculative execution, it is possible to conduct a side-channel attack to determine the contents of kernel memory from userland.

For much more detailed information on Meltdown, see the Meltdown paper.

Meltdown Detection Strategy

To understand and implement our first Meltdown detection strategy we are going to use the Linux Tracing facility.

The Linux kernel includes a suite of subsystems to enable performance and diagnostic tracing, including ftrace, tracepoints, uprobes, kprobes, perf, eBPF. These subsystems have been available for many years and have more recently become better understood and more accessible. For an introduction to Linux Tracing, start with this overview by Julia Evans. Then for a deeper dive, check out Brendan Gregg’s Linux Performance page.

Linux Tracing is designed for high-performance, non-intrusive introspection into running systems from user mode software (no kernel module required). It allows hooking of statically defined tracepoints as well as arbitrary symbols in kernel space for introspection purposes (such hooks are called kprobes). This facility is broadly supported in Linux 2.6 and later kernels. All of this makes it an ideal solution to build security monitoring tools and that is why we chose to use it as the basis of Capsule8 Sensor.

The Linux kernel supports an exceptions:page_fault_user tracepoint that generates events for user process page faults. This tracepoint, introduced in Linux kernel 3.13, includes fields for the faulting virtual memory address and an error code indicating the nature of the page fault. If we add a filter on the trace point, we can restrict the trace events generated to those that occur when a kernel memory address is attempted to be read from user mode and generates a protection fault.

Recall that the Meltdown vulnerability involves attempting to read a kernel memory address resulting in a segmentation violation signal being sent to the user process. The malicious process can handle the signal and proceed to retrieve the attempted memory from the cache. In order to perform any meaningful attack, an attacker would need to generate and handle a significant enough number of segmentation violations for kernel memory addresses that their activity is easily discernible from accidental program crashes.

Detecting Meltdown Using Capsule8

The open-source Capsule8 Sensor uses Linux Tracing under the hood to produce behavioral system security telemetry. While the currently supported telemetry events do not surface an event that would indicate exploitation in progress of Meltdown, we can use its lower-level EventMonitor interface to easily tap into a Linux tracepoint that indicates an attempted exploitation of Meltdown.

In order to improve performance of the tracepoint, we also attach a trace event filter for in-kernel evaluation. The filter matches when a user mode process causes a page protection fault trying to access a kernel-space memory address. This is exceedingly rare and has generated minimal false positives in our (albeit limited) testing.

[...]

//

// Look for segmentation faults trying to read kernel memory addresses

//


filter := "address > 0xffff000000000000"

_, err = monitor.RegisterTracepoint("exceptions/page_fault_user",

onPageFaultUser, perf.WithFilter(filter))

[...]

[...]

Best of all, the performance impact of our meltdown exploitation detector is negligible since page faults are relatively rare and the kernel-based trace event filtering minimizes the number of events sent to userland for processing. Even on a heavily loaded system this simple Meltdown detector incurs significantly less performance impact than the Linux KAISER patches that mitigate exploitation of both Meltdown and Spectre.

The meltdown detector is designed to emit alerts to server logs and can also easily be packaged in a container and run as a Kubernetes DaemonSet to quickly deploy it across an entire cluster. We’ve supplied a Dockerfile and Kubernetes manifest to assist in this.

To quickly deploy our meltdown detector across your Kubernetes cluster, you can deploy our DaemonSet with the following single command:

$ kubectl apply -f 

https://raw.githubusercontent.com/capsule8/capsule8/master/examples/meltdown/capsule8-meltdown-detector.yaml

Our meltdown detector can also be run outside of a container by simply running the static binary. The example below shows the output while running one of the public meltdown proofs-of-concept:

$ sudo ./meltdown

I0105 16:00:04.112232       1 main.go:46] Starting Capsule8 Meltdown Detector

I0105 16:00:04.204312       1 main.go:66] Monitoring for meltdown exploitation attempts

I0105 16:00:07.044165       1 main.go:84] pid 19599 kernel address page faults = 1

W0105 16:00:07.044222       1 main.go:87] pid 19599 kernel address page faults = 10

W0105 16:00:07.044892       1 main.go:90] pid 19599 kernel address page faults = 100

E0105 16:00:07.048324       1 main.go:93] pid 19599 kernel address page faults = 1000

E0105 16:00:07.081778       1 main.go:96] pid 19599 kernel address page faults = 10000

The full Meltdown detector is available as an example in the open-source Capsule8 GitHub repository.

Stay tuned, there’s more….

Our simple detector tracks page faults for kernel memory addresses by process ID (PID) and alerts with low, medium, or high severity when a process generates numbers of events that cross defined event count thresholds. These thresholds are all triggered by published proof-of-concept exploits for Meltdown and are exceedingly unlikely to be triggered otherwise. The performance impact for this strategy is also virtually immeasurable, but it definitely will not cover all variations of these speculative execution techniques.

Stay tuned to the Capsule8 Blog — next week, we will go deeper into Meltdown and Spectre detection. We will describe a more general strategy that covers both exploitation of Meltdown and Spectre, including the exception-suppression approach using Intel TSX.

Read Part 2 

Related Posts:

Related Solutions: