Author Archive

From Catastrophe to Chaos in Production

Posted by

Production is the remunerative reliquary where we can realize value from all of this software stuff we wrangle day in and day out. As has been foretold from crystal balls through to magic eight balls, the deployment and operation of services in production is increasingly the new engine of business growth, and few industries are spared from this prophetic paradigm foretold by the Fates. A rather obvious conclusion is that we should probably keep these systems safe — that our endeavors should maximize confidence that these systems will operate as intended.

“Keep prod safe?” you say, incredulous that I would impart such trite advice. Rest easy, for the discussion of how we should approach keeping prod safe turns from trite to tart! Simply put, the best way to keep production systems safe is by harnessing failure as a tool rather than a stain to wash out in a Lady Macbethian fashion. If we seek to recover from failure as swiftly and elegantly as possible so that service disruption is minimized, we must wield failure as a magical staff with which we can illuminate essential knowledge of our systems.

Thus is the way of Security Chaos Engineering. To whet your appetite for implementing chaos magic in your own organization, this post will explore how failure can manifest in production systems and provide examples of security chaos engineering experiments that can help you gain the confidence in production that any venerable coven of software sorcerers deserves.

Failure in Production

The natural response to fear of failure in production is an attempt to prevent any and all potential issues from manifesting in prod. In the realm of security, this often looks like removing vulnerabilities before software is deployed to prod. While that can certainly be useful, it’s unrealistic to assume that all problems will be caught and remediated ahead of deployment — especially in components over which you lack control, like container runtimes, orchestrators, or the underlying kernel itself. Deploying to production will always involve some level of ambiguity, despite the preference of human brains to the contrary.

Another wrinkle in the fight against failure in prod is that said failure can manifest in a mess of multiplicitous manners. Production systems can consist of a diverse range of infrastructure, including private or public clouds, VPSs, VPCs, VMs, containers, serverless functions, and, one distant day, perhaps even computerless compute!

As a result, production systems are justifiably classified as complex systems full of interrelated components that affect each other. Failure, therefore, is like a tapestry of interwoven strands that can spread fire to the rest and engulf the whole thing in flames. That is, there is a strong risk of contagion in prod, which can transform an event in one node into a poison that seeps into other components and severely disrupts your services.

To complicate matters further, there is a dizzying array of activity that can jeopardize production operations. We can simplify this activity into two high-level types of unwanted activity: deliberately malicious (attackers) and accidentally careless (devs). For instance, attackers can spawn a remote interactive shell, disable SELinux, and exploit a kernel vulnerability to get root access. Or, developers can download production data or accidentally delete log files. Oftentimes, these two types of activity can look pretty similar. Attackers and developers can both be tempted to attach a debugger to a production system, which can facilitate privilege escalation or exposure of sensitive information. 

Finally, when we think about production failures, we would be remiss not to note that most production infrastructure runs on Linux, where everything is a file. This means that failure in production often bubbles up like rancid alphabet soup (which I’ll deem “Spaghetti-uh-ohs”) from unwanted file-related activity.

To make these considerations more concrete, let’s explore three examples of failure in production (of which countless more are available in the Security Chaos Engineering e-book):

  1. Log files are deleted or tampered. To be honest, if this occurs, your ops is likely screwed. Log pipelines are critical for operational monitoring and workflows, so telling your local ops or SRE that this failure has happened likely counts as a jump scare.
  2. Boot files, root certificate stores, or SSH keys are changed. Well, this presents quite the pickle! Modification of these critical assets is akin to defenestration of your system stability — even aside from the potential stability snafus arising from an attacker maintaining access and rummaging around the system.
  3. Resource limits are disabled. This activity is highly sus and doubtless disastrous. Whether due to an infinite script or a cryptominer gorging itself on compute like a mosquito on blood, the dissipation of your resource limits can lead to overhead overload in your system that results in service disruption. 

Now, how do we constructively cope with this complexity? By replacing catastrophe with chaos.

Security Chaos Engineering in Production

The tl;dr on Security Chaos Engineering (SCE) is that it seeks to leverage failures as a learning opportunity and as a chance to build knowledge of systems-level dynamics towards continuously improving the resilience of your systems. The first key benefit of SCE is that you can generate evidence from experiments which will lay the foundation to a greater understanding of how your systems behave and respond to failure. 

Leveraging that understanding, a second benefit is that SCE helps build muscle memory around responding to failure, so incidents become problems with known processes for solving them. SCE experiments not only allow teams to continuously practice recovering from failure, but also encourage software engineering and security teams to practice together — so when a real incident transpires, an accomplished alliance can rapidly respond to it.

Another core benefit of SCE for production systems is its utility for grokking complex systems. Identifying relationships between components in production helps reduce the potential for contagion during incidents, which helps you recover and restore service faster. Performing security chaos experiments can facilitate this discovery, since simulating failure in one resource can exhibit which of the resources connected to it are also impacted.

Naturally, if you want to proactively understand how your prod systems respond to certain failure conditions, you really have no choice but to conduct chaos tests in production. The reality is that if you don’t run your tests in prod, you won’t be as prepared when inevitable incidents occur. Of course, it’s pretty reasonable that your organization might be reticent to start SCE experiments in prod, so staging environments can serve as the proving grounds for gaining an approximation of how your systems respond to failure. This should be viewed as a stop-gap, however; it’s important to have a plan in place to migrate your SCE tests to prod (the SCE book enumerates those steps).

“Enough of philosophy,” you cry out, “show me some chaos summoning spells!” A reasonable request, dear reader. Let’s now explore three examples of SCE experiments in production and what important questions they answer (again, this is just a sample to get you thirsty for the full list available in the free SCE e-book download).

  1. Create and execute a new file in a container. Is your container actually immutable? How does your container respond to new file execution? Does it affect your cluster? How far does the event propagate? Are you containing impact? How quickly is the activity detected?
  2. Inject program crashes. Is your node restarting by itself after a program crash? How quickly are you able to redeploy after a program crash? Does it affect service delivery? Are you able to detect a program crash in prod?
  3. Time travel on a host — changing the host’s time forward or backward (my fav test tbh). How are your systems handling expired and dated certificates or licenses? Are your systems even checking certificates? Do time-related issues disrupt or otherwise bork service delivery? Are you relying on external NTP? Are time-related issues (e.g. across logging, certificates, SLA tracking, etc.) generating alerts?

Parting Thoughts

As in life, failure in production is inevitable… so you might as well turn it into a strength by learning from it early and often. Experimentation with chaos magic by injecting failure into prod systems uncovers new knowledge about your systems and builds muscle memory around incident response. This is why Security Chaos Engineering is the optimal path forward to build confidence in the safety of our production systems — the moneymakers on which our organizations desperately depend.

If you want to learn more about how to adopt a Security Chaos Engineering approach in your own organization, download the free O’Reilly e-book I co-wrote with Aaron Rhinehart. It’s full of guiding principles as well as case studies from Cardinal Health and CapitalOne that can get you started on your SCE journey. 

And, if you want to detect some of the production failures and unwanted activity enumerated above, Capsule8 can help with that — so give us a ping.

Grubbing Secure Boot the Wrong Way: CVE-2020-10713

Posted by

Today, researchers at Eclypsium disclosed a buffer overflow vulnerability in GRUB2, CVE-2020-10713, affectionately termed “Boothole.” It basically results in a total pwn of Secure Boot in systems using GRUB, which is a lot of them — all Linux distros, a bunch of Windows machines, and more. Additionally, the mitigation process is a certified hot mess, so even the proposed solution is kind of grubbing salt in the wound.

Why is it cool?: Secure Boot, as the name suggests, is designed to verify that all the firmware required to start up a computer is trusted. A complete defeat of Secure Boot, therefore, means the computer is essentially helpless against malicious tampering from the first blinky light of its digital dawn after the power button is pressed. Such boot bamboozelry facilitates furtive persistence useful for ransomware (see Lockbit’s variant employing boot record hijinks), Zeus-like keyloggers, cryptominers, espionage activity (see Rockboot by APT41), etc., because the code operates before the OS  — where security tools usually reside — is up and running.

Kelly’s spicy take: We might see more damage from people attempting the mitigation (more on revocations later) rather than attackers leveraging this in dastardly digital crimes.

Digging deeper: The problem essentially lies in GRUB’s inadequate error handling. So, what even is GRUB? It’s a bootloader designed to load and kick off any OS on any hardware. But let’s back up and take a look at a highly simplified version of how a Linux system with Secure Boot boots up:

  1. Firmware loads the smol first-stage bootloader binary that contains a trusted certificate (known as a “shim”)
  2. Shim loads the GRUB binary (another bootloader) and validates it with the certificate
  3. GRUB loads any required configurations, located in grub.cfg (Chekov’s config file in this tale), which point to where the kernel image can be loaded
  4. GRUB validates the kernel via keys stored in the firmware’s trust database (db and dbx, the authorized and forbidden signature databases, respectively)
  5. GRUB hands over control to the kernel
  6. Kernel boots the OS

This vulnerability resides in step #3. GRUB uses a language parser (generated via flex and bison) to read the config file. If the text in the config file is too large, the flex engine says “no thank you!” with the expectation that the processing function will exit or be halted. Quite unfortunately, GRUB’s implementation does not fulfill that expectation. Instead, GRUB is like, “Oh, a fatal error indicating that the string is too big for the buffer? Cool, cool, cool, I’ll copy it straight into the buffer so we can proceed with executing the function!”

As a result, you can put massive strings into grub.cfg (which isn’t signed or verified) that the parser will happily copy into memory, leading to a buffer overflow. From there, attackers can write anything they want to system memory without any constraints (gaining what’s known as a write-what-where primitive). 

Boot and EFI land relies on memory being in fixed locations, so it doesn’t possess ASLR or other fancy exploitation mitigations that exist in OS space — which is a relief for attackers. This total control over the system means the attacker can insert their own bootloader, allowing them to hijack the boot process and maintain control every time the system starts.

Okay, but: Exploiting this vulnerability requires root / admin access to access the grub.cfg file located in the EFI System Partition, which means the attacker must first gain a foothold on the system and escalate privileges (physical access also works). The vuln only helps with persistence across system reboots, so it’s unnecessary — and perilously noisy — for attackers to employ this if they already have root on a system that never reboots. It’s also preposterously unlikely that any attacker will spontaneously write on-the-fly real mode shellcode that will perform boot injection and OS loading. If they do, they probably deserve the win. 

And yet: This will become incredibly bad news if enterprising criminals incorporate this vulnerability into their nefarious bots as part of the standard “be hacker, do crimes” pipeline of bootkit creation -> licensing the bootkit to a bot author -> deploying or selling the bootkit-armed bots (like in a botnet). This pipeline will not pop out pwnage overnight, so the question becomes whether mitigations can be successfully rolled out before criminals can scale this attack.

What’s the impact?: A bunch of Windows machines will be affected, but I’ll be focusing on Linux land. Every Linux distro using Secure Boot with the standard UEFI certificate is affected, since all signed versions of GRUB are vulnerable. The infosec community will tell you that Secure Boot has been broken for 10 years, and yet nobody cared — but the reality is that a non-trivial number of organizations rely on it to protect more sensitive systems.

If you’re using Linux in the cloud, you potentially aren’t impacted, depending on your cloud provider. Google Cloud Platform’s Shielded VMs use Secure Boot, which are now the default for the Google Compute Engine (GCE) service and are also used as the underlying infrastructure for Google Kubernetes Engine (GKE), Cloud SQL, and more. Elsewhere in the digital heavens, Azure cloud instances don’t support Secure Boot, and AWS doesn’t seem to support it, either — so they seem to be unaffected by this vuln.

Is there a patch?: Well yes, but actually no. The first part of the mitigation process is a logistical challenge; the subsequent part is Kafkaesque. The first part requires coordination among Linux distros using GRUB2 (which is all of them), relevant open-source projects, vendors (like blinky box security solution peddlers), and Microsoft. The issue first must be fixed in GRUB2, then the Linux distros, relevant open-source maintainers, and vendors need to update their bootloaders, installers, and shims. These new shims need to be signed by Microsoft, who, as it turns out, is the designated signer of third party UEFI certificates.

On the sysadmin side, you’ll need to update your Linux systems to the latest distro versions — both in your base images and in your deployed systems (don’t forget about your backup and disaster recovery stuff). But the next phase of mitigation descends into a nightmare.

The plan is for Microsoft to release an updated denylist (the UEFI revocation list located in the aforementioned dbx database) that will block malicious shims that can load unpatched versions of GRUB susceptible to this vulnerability. The problem is that updating the denylist before you update the bootloader or OS results in a bricked or borked system that can break workflows. Thus, for now, it’ll be on ops to manually apply the updated denylist given that it’s far too risky to push out updates automatically. Please send good vibes to your ops team when that day comes.

The bottom line: If or when criminals operationalize this in an automated fashion, then you probably need to press the panic button (if you haven’t yet mitigated it at that point). For now, try to apply updates once they’re available, but ops folks should put on their metaphorical safety goggles for the revocation process to avoid borking systems. You can also do some manual threat hunting to look for changes in the config files if you’re the paranoid type, but it’s not strictly necessary.

For Capsule8 customers, you can enable our detection that checks for attempts to write to the boot partition, in addition to detection of users escalating privileges (which is a necessary preliminary step for exploiting this vuln).

The Capsule8 Labs team conducts offensive and defensive research to understand the threat landscape for modern infrastructure and to continuously improve Capsule8’s attack coverage.

High STEKs: On-path attacks in GnuTLS (CVE-2020-13777)

Posted by

This month, Fiona Klute disclosed a vulnerability in GnuTLS, CVE-2020-13777. It can either enable on-path attackers for TLS 1.3, or facilitate passive decryption of traffic between servers running GnuTLS for TLS 1.2. Either way, it’s not great!

Why it’s cool: Attackers could exploit this vuln to recover previously captured network traffic, like conversations (for servers using TLS 1.2), which is pretty bad. Even in the TLS 1.3 case, allowing on-path attackers to gain control over resumed sessions isn’t ideal. 

The underlying cause of the bug is attention-grabbing, too: the encryption and decryption key used to restore session state were all-zero. An all-zero key basically means that anyone listening in on the conversation can read the key used to encrypt the conversation after the initial full handshake. Unpredictable keys are essential to properly encrypt plaintext, and a key that’s all zeros is about as far from unpredictable as you can get. Thus spake Zarathustra: “Encryption is hard.”

Digging deeper: Session Tickets were designed with stateless servers in mind, including the need for scalability. By using Session Tickets, there are fewer round trips required for clients to resume a TLS connection with a server with which they’ve already communicated. This makes reconnecting faster and saves bandwidth on both the client and server side.

Let’s explore by way of metaphor — you’re attending a concert. You present your ticket receipt and photo ID to the venue staff, which is like initiating a new session and performing all the necessary verification between a client and server. In exchange for presenting your receipt, you receive a wristband that allows you to come and go during that evening. The wristband is like a server using Session Tickets, which are protected with the Session Ticket Encryption Key (STEK), to show that the necessary checks were previously performed.

You realize you forgot earplugs, so you walk out of the venue to buy some at the bodega on the corner. When you come back to the venue, you present your wristband to enter without having to present your receipt or photo ID again. This is akin to the client asking the server to re-initiate a session by presenting the Session Ticket. 

Unfortunately, this ticket exchange presents ample opportunity for bamboozling. As Filippo Valsorda noted in 2017, the encrypted ticket is sent in plaintext at the beginning of the handshake, because the cipher negotiation didn’t happen until after the key exchange and ticket issuance. This, of course, means the ticket is just chillin’ in the middle of the network highway waiting to be picked up by an attacker. The attacker would need to decrypt it with the STEK, of course — but when the STEK is all-zero, it makes that hurdle trivial to surmount. And then, all your chats are belong to attacker.

Yes, but: GnuTLS is seldom used, Exim (with its own problems) being one of the few packages using it. OpenSSL, in contrast, is basically Big Crypto (i.e. the widespread incumbent in TLS), so thankfully this particular vulnerability is likely to have a minimal footprint in most enterprises. Debian has fixed it in buster (version 3.6.7-4+deb10u4), and it’s fixed in Fedora 32, too. RHEL versions <8 are unaffected, but there’s isn’t yet a patch for RHEL 8 — though you can always upgrade GnuTLS manually. 

The bottom line: You don’t need to hit the panic button on this one. This vulnerability is certified juicy — one could even say it’s like a rare STEK — and any client that supports session ticket resumption is vulnerable. But, it’s unlikely that you’re using GnuTLS too widely in your own environment. It’s already patched in most distros, so you should upgrade when you can.

The Capsule8 Labs team conducts offensive and defensive research to understand the threat landscape for modern infrastructure and to continuously improve Capsule8’s attack coverage.

Security Delusions Part 3: Cheat Codes

Posted by

Organizations are unearthing the potential of digital transformation, but security often remains a gatekeeper to this path of promised potential, largely due to its own delusions about what modern infrastructure means. As Herman Melville wrote in Moby Dick, “Ignorance is the parent of fear” – and security is too frequently hindered by its fear of the new and the agile precisely because of its ignorance about blossoming technologies.

In this blog series, drawn from my QCon talk last year, I will explore the history of infosec’s errant gatekeeping in the face of new technologies, and how we can encourage security to embrace new technologies to enable the business, rather than get in its own way. Part 1 and Part 2 are already published.

Now that we went through our history journey of infosec’s wariness towards cloud computing, and explored security’s present fears about microservices, how should we go forth in this muddled world? How can we evangelize real threat models and real solutions to security issues while prying traditional FUD-borne notions from enterprise infosec’s white-knuckled hands? In this final post of the series, I will detail the “cheat codes” for securing cloud and microservices environments and how to efficiently evangelize these best practices to security teams.

This discussion must start with how infosec perceives the engineers implementing all this newfangled, scary tech. Infosec tends to look at DevOps as reckless, overpowered frenemies rather than an ally who could teach them a thing or two about process improvement. As one security professional (who shall remain nameless) said, “DevOps is like a black hole to security teams because they have no idea what DevOps is doing and have no way of ensuring security policy is enforced.” The current conflict is ultimately about control – the fact that security is not exclusively gripping the wheel anymore. 

This means that engineers should be cautious when evangelizing cloud infrastructure, APIs, or containers to security folks. When someone is overwhelmed by fear, they will react quite poorly to being told to “calm down,” or that there is nothing to fear. Instead, engineers as well as infra-savvy security professionals must acknowledge that there are valid concerns borne from cloud or microservices environments — just not the ones commonly believed by the infosec industry. 

Cheat codes for cloud, APIs, and container security

What realistic concerns should be highlighted to replace the security delusions I covered in the first two parts of this series? Before we dig into specific best practices for clouds, APIs, and containers, there are three fundamental security tenants to remember for each category:

  1. Do not publicly expose your cloud storage buckets (AWS S3, Google Cloud Storage, Azure Storage).
  2. Do not use unauthenticated APIs.
  3. Do not use “god mode” in your containers – minimize access wherever possible.

The fortunate news is that there are established best practices for security for all the “super scary” technology – and these best practices should absolutely make infosec’s job easier. If anything, infosec takes on the role of evangelizing and enforcing best practices rather than implementing anything themselves.

IAM as the new perimeter

Analogizing security in cloud or microservices environments to the old, pre-Copernican ways (when the firewall was the center of the security universe) can help translate modern best practices into the language of traditional security professionals. Security groups and network isolation by CSPs are the firewall equivalent. Ingress and egress routes defined through AWS, GCP, or Azure are similar to firewall rules, letting you specify, “No, this resource can only talk to these systems.” It requires trust that the CSPs properly segregate resources, but again: it is a delusion to believe you can do so better than the CSPs.

Leverage your CSP’s tools

For cloud systems, making sure your AWS S3, Google Cloud Storage, or Azure Storage  buckets are not available to the public is the most valuable step you can take to avoid data leaks like Accenture and Time Warner’s. AWS offers a wealth of tools to help ensure best practices, including Amazon Inspector (looking for deviations from best practices), and AWS Trusted Advisor (provisioning resources using AWS best practices). 

Ensure the principle of least privilege

The CSP’s IAM roles can help ensure the principle of least privilege when accessing systems. Each provider has their best practices for IAM policies readily available, only a search away1. Segmenting production and development environments through maintaining separate AWS accounts for them is an alternative strategy. Instead of users, use Assumed Roles instead. This way, admins will log in as read-only users, and you can create keys with fine-grained permissions without needing a user with a password for each key or service account. 

API hygiene habits

Basic API hygiene will suffice for most organizations, consisting of authentication, validation, and the philosophy of not trusting external data. OWASP maintains a valuable “REST Security Cheat Sheet,” and its advice proves far simpler than the tangle of considerations for monolithic apps. For instance, sensitive data like API keys should not be exposed in the URL – instead, they should be exposed in the request body, request headers, or HTTP header (depending on the request type). Only HTTPS endpoints should be used, and there should be access control at each API endpoint. Apply allowlists of permitted HTTP methods for each endpoint.

Granular allowlisting in microservices

In the vein of API hygiene, ensure you validate input and content types. Do not trust input as a rule, so add constraints based on the type of input you are expecting. Analogize this to any traditional infoseccers as a form of granular allowlisting – previously impossible with monoliths, but now possible with microservices. Explicitly define what content types are intended and reject any requests with unintended content types in the header. This also engenders a performance benefit and is often part of API definition anyway – again, making the security team’s job much easier.

God is not a mode

For containers, the most prevalent “threat” is misconfiguration – just as it is for cloud and APIs. Much of the security best practice for containers is related to access management, a common theme across modern technologies. Do not expose your management dashboards publicly. Do not let internal microservices remain unencrypted – use of a service mesh can reduce friction when implementing encryption.

Crucially, do not allow “god mode” or anonymous access in your containers – and generally make your access roles as minimal as possible. Any CISO will be very familiar with the concept of least privilege already. Do not mount containers as root with access to the host. Disable your default service account token. Enforce access control on metadata. These amount to the new security “basics” in the modern era.

Your CI/CD is your new patch manager

Patching becomes palpably easier with containers – which can be argued as an antidote to the “Equifax problem,” in which procrastination due to the friction of taking systems out of production to patch them contributes to an incident. Continuous releasing means versions will be upgraded and patched more frequently – and container patching can be baked into CI/CD pipelines themselves. Any infosec team should be delighted to hear that containers let you patch continuously and automatically, removing them from the awkward position of requesting downtime for necessary security fixes.

Leverage containers for resilience and visibility

The fact that containers are managed through images in a registry removes work for security, too. The container image can be rolled out or rolled back, which should add a feeling of control for infosec teams. Further, visibility into which containers are affected by emerging vulnerabilities is much easier – container registries can be scanned to see which containers are vulnerable, instead of scanning production resources directly. And, live migration becomes possible through slowly moving traffic to new, healthy workloads from existing, vulnerable workloads, without any impact on the end user.

It will be hit or miss whether your organization’s security team really understands containers. You can try using the example of updating a Windows laptop to provide an analogy to live migrations. Usually, you have to shut down Word or PowerPoint and disrupt your work. Instead, imagine the Word document migrates to an updated OS in the background, followed by the PowerPoint presentation, until all the work is moved to the patched OS. Now, the unpatched OS can be safely restarted without interrupting work.

Codify secure configurations

It is critical for enterprise infosec teams to help codify secure configurations and enforce all of these best practices. This is the modern equivalent of crafting security policy templates2 (but less painful). Infosec teams can lead the charge in documenting threat models for standardized APIs, containers, and other resources. They should start with scenarios that would be most damaging to the business, such as customer data being leaked, data loss, disruption of service, then working backwards to the most likely avenues for attackers to accomplish those feats.

Prioritize protecting prized pets

Infosec teams should put additional effort into securing prized “pets” (vs. cattle), which are enticing to attackers and less standardized. As shown through the surveys mentioned in the prior post, visibility is one of the most coveted capabilities among enterprise infosec teams, and is crucial for protecting prized pets. However, the types of tools that could provide the right visibility for infosec teams are often already used by operations teams seeking to optimize performance. This is a propitious opportunity for security and DevOps to collaborate, with the benefit of sparing budget and integration work required by removing duplicate functionality.

Build your audit use cases

Hitting the right compliance boxes can encourage adoption of modern tech as well, since compliance is a consistent budget item. File integrity and access monitoring (known as “FIM” and “FAM”) is an underpinning of nearly every compliance standard, from PCI to HIPAA to SOX. FIM/FAM requires monitoring and logging of file events for a few different purposes, but primarily to catch unauthorized modification of sensitive data (a violation of data integrity) and to create audit trails of which users accessed sensitive data (to preserve data confidentiality).

Because of the improved inspectability of containers, FIM/FAM becomes easier – even without a tool like Capsule8, which does it for you. Because microservices are distilled into simpler components than in a monolithic application, it is easier to pinpoint where sensitive data is being handled, helping target monitoring efforts. Demonstrating the ease at which visibility is obtained can help assuage concerns about control. Note, however, that infosec professionals are less familiar with the term “observability,” so translation is required when collaborating.

Caveats and cautions

Each CISO and infosec team maintains different priorities and possesses different skills, so not every tactic here will necessarily be effective for every team. Some teams prioritize compliance work, others seek to rigorously define policy, and yet others are only familiar with maintaining network security equipment and SIEMs. Many enterprise infosec practitioners will be more proficient with Windows than Unix, think in a network-centric model, and rarely develop anything themselves. Therefore, patience, analogies, and proof that not all control is lost will be critical in gaining buy-in.


It is hard to let go of long-held beliefs, and the firewall-centric model in a well-understood world of monoliths is tricky to dislodge from the heart of enterprise information security. Many of infosec’s fears over modern technology can be distilled into fears over losing control. For those in DevOps functions looking to help infosec evolve — or security professionals wanting to help their teams enter the modern era — assuaging those fears by redirecting control from grasps at threat phantasms towards tangible, meaningful threat mitigation is an essential step forward.

Work together to build secure standards for APIs and containers, to document appropriate cloud configurations, and to create threat models that can help continuously refine design towards more secure outcomes. Enterprise infosec teams, freed of many maintenance burdens through native controls and standards, can now focus on securing the “pets” in this modern world. Security will no longer herd cats and cattle, but instead be an evangelizer and enforcer of best practices.

Everyone maintains delusions in one fashion or another, but I sincerely believe we are not bound to them like Andromeda chained to a rock in the stormy sea. Information security can survive this Copernican revolution of cloud and microservices, but they could use their Perseus to save them from their Cetus – the devouring fear fueled by the siren song from infosec vendors urging them to succumb to dread. My hope is my guidance throughout this series can help us unchain infosec, allowing them to go forth into a new dawn of secure and resilient software delivery performance.


[1]:  If you don’t feel like Googling for them, here are the links to each: Security Best Practices in AWS IAM; Using Google Cloud IAM securely; Azure identity & access security best practices

[2]:  These are often required by compliance, and most CISOs should have familiarity with them.

Security Delusions Part 2: Modern Monsters

Posted by

Organizations are unearthing the potential of digital transformation, but security often remains a gatekeeper to this path of promised potential, largely due to its own delusions about what modern infrastructure means. As Herman Melville wrote in Moby Dick, “Ignorance is the parent of fear” – and security is too frequently hindered by its fear of the new and the agile precisely because of its ignorance about blossoming technologies.

In this blog series, drawn from my QCon talk last year, I will explore the history of infosec’s errant gatekeeping in the face of new technologies, and how we can encourage security to embrace new technologies to enable the business, rather than get in its own way. You can read part one here.

Now that we explored infosec’s history of cloud compunction, we can turn to the new looming beast for security teams to face: microservices. 

This darkling terror security harbors in its heart is that microservices creates a titanic, labyrinthian attack surface. It is as if they believe that each microservice adds the same attack surface as a traditional monolithic application – and thus with thousands of microservices, the attack surface of the monolith days is multiplied by a thousand as well. Through this lens, it is understandable why microservices would be absolutely terrifying – but this mental model is, of course, wildly mistaken.

In this infosec Copernican Revolution, it is exceedingly difficult for security to let go of the perimeter model. Although proven false countless times, the pervading belief was and still often is that if the perimeter is secure, then the enterprise will be safe. This is an illusory history. Lateral movement was so pernicious because once attackers bypassed perimeter defenses, the only defense they encountered was #yolosec, giving them free reign over internal networks.

While security is lamenting the dissolution of the perimeter and the daunting monster that is microservices, they completely miss that microservices forces the purported dream security held for so long – that security would be baked-in rather than bolted-on. Because microservices are typically considered publicly-facing by default, no one can rest on the assumption that perimeter defenses can save them – thus turning native security controls into the necessary default rather than a nice-to-have.1

Let us now turn to two essential components of microservices environments to explore the security delusions about each individually: APIs and containers.

APIs: Infosec’s Anathema

In a November 2018 survey by Ping Identity on API security concerns2, 51% of respondents noted that they are not certain their security team knows about all the APIs in their enterprise’s network. Certainly, developers are now opening many API endpoints – but that does not differ from the prior mode of developers opening particular ports on an internal network. 30% of respondents said they do not know if their organization has experienced a security-related incident involving their APIs – and I suspect the 30% would not know whether they have been compromised whether it involves APIs or not.

CISOs are particularly fraught over the idea of public APIs – that they add attack surface, that they are so close to the grasp of attackers, that it is impossible for security to have control over all of them. As one security professional publicly opined, “Formerly, local networks had only a few connections to the outside world, and securing those endpoints was sufficient.”3 That, in fact, was never truly sufficient. This strategy resulted in local networks that were astonishingly brittle because of the assumption that network security controls would prevent anyone from gaining access. 

Infosec practitioners will cite related fears that APIs can provide a “roadmap” for underlying functionality of the application, and that this roadmap can aid attackers. These fears are, quite frankly, ridiculous. Any legitimate security expert will caution that the “security through obscurity” approach is a terrible idea. Hiding functionality does not make your app inherently secure or insecure. However, if infosec teams are concerned about this, there is a high degree of certainty that the app is not designed to be resilient – which is a failure of the infosec program.

As I advocated in my previous research on resilience in infosec, the only way to ensure resilient systems from a security perspective is to assume that your added security controls will fail. Specifically, I recommended treating any internal or private resources as public – because otherwise you will bask in a false sense of security when your controls are inevitably bypassed. It is eye-opening how few enterprise security teams traditionally treat their internal or private resources in this way, as if there was not extensive documentation of attackers bypassing network security tools.

Further, what security practitioners often do not realize is that standard OWASP-based attack tools (such as Burp or Nessus) do not work nearly as well on API endpoints, because there are no links to follow, no attack surface to map, unknown responses, and potentially no stack traces. What is more, for RESTful JSON APIs, whole classes of vulnerabilities around cross-site scripting (XSS), session management vulnerabilities, compromised cookies, or protecting tokens are removed through the use of digest authentication and JSON Web Tokens (JWT tokens). If anything, API-centric apps abate application security (appsec) concerns rather than aggravate them.

One of the performance benefits of a microservices approach is borne out of standardization – and standardization also begets security benefits. However, standardization is not a common, nor commonly understood, topic among enterprise infosec professionals. They still live in the tailored and monolithic universe, not grasping that there can be a singular, well-developed API deployment that can be replicated – thus reducing their work down to rigorously testing the single API deployment until they are comfortable with its security posture. Standardization is a prevalent factor in the world of containers, as well – and is one no less fraught with security concerns.

The Curse of Containers

This new world of public-facing API connections is not the only aspect of modern technology receiving condemnation and trepidation by enterprise information security – containers themselves are seen as quite a grave bouquet of threats. 

Not every infosec professional realizes that containers are not, in fact, featherweight virtual machines (VMs). Frequently asked questions, as noted by Mike Coleman, may include “How do I implement patch management for containers running in production” or “how do I backup a container running in production?” – questions that evince the lack of understanding of the nature of containers. They do not know that there is a separate data volume that is backed up and they do not know that you patch the container image instead of the actively running container. 

A recent survey by Tripwire4 incidentally exposes this confusion among information security professionals. 94% of respondents have concerns regarding container security – and this “lack of faith” has led 42% to delay or limit container adoption within their organization due to security concerns. The winning reason (54%) among respondents for their security concerns is inadequate container security knowledge among teams – and we should be grateful they are at least acknowledging that their lack of understanding is a contributing factor. 

Source: Tripwire

The remaining concerns include visibility into container security (52%), inability to assess risk in container images prior to deployment (43%), lack of tools to effectively secure containers (42%), and the most nebulous one: insufficient process to handle fundamental differences in securing containers (40%). I, for one, am deeply curious to know what they perceive these fundamental differences to be, given prior erroneous beliefs about cloud security.

To crystallize the confusion and anxiety, the survey results around infosec professionals’ desired security capabilities for containers are worth exploring, too. 52% quite reasonably desire incident detection and response – something we (Capsule8) provide. Another reasonable request, by 49% of respondents, is for isolation of containers behaving abnormally. Regrettably, 40% also want “AI security analytics” for containers, and 22% want blockchain to secure containers, so we can presume somewhere between 9% to 12% are sane, and at least 22% have absolutely no idea what they are doing. 

Source: Tripwire

Beyond survey data, a frequently suggested straw man by infosec is that each container requires its own monitoring, management, and securing, leading to time and effort requirements that spiral out of control. The whole point of containers is for them to be standardized, so such claims are directly ignoring the purpose of the technology. Yes, they need to be monitored – but were you not monitoring your existing technology?

A cited fear of standardization itself is that vulnerabilities can be replicated many times as source code is used repeatedly. This ignores the status quo. Testing containers is still monumentally better than having developers write random queries every time in different parts of the application stack. At least in a container, you can find the vulnerabilities easily and orchestrate a patch to all relevant containers. Good luck finding the vulnerability in a custom-built Java app with intricate functionality.

It is as if infosec forgot the trials and tribulations of dealing with monolithic applications, as they now will cite that “you know exactly where the bad guys are going to try to get in” because there was one service and a couple of ports. They apparently have not heard that “complexity is the enemy of security,” or have conveniently forgotten the mantra.

In a monolithic application, workflows are enormously complex, making it extremely difficult to understand every workflow within it – meaning it is nearly impossible to understand how workflows can be taken advantage of by attackers. Because microservices represent one workflow each and are standardized, they can be mapped out in an automated fashion, making threat models considerably easier. For instance, JSON mapping and Swagger are designed to describe exactly how APIs interact, and modern web appsec tools will ingest these maps to understand an app’s API endpoints.

Another vital, but overlooked, benefit of containers for security teams is immutability and ephemerality (as discussed in my Black Hat talk last year). An immutable container is one that cannot be changed after it is deployed — so attackers cannot modify it as it is running. An ephemeral container is one that dies after completing a specific task — leaving only a short window of opportunity for attackers to do their thing. Both characteristics embed security by design at the infrastructure level, and are far easier to implement with containers than with traditional monolithic applications. 

If you segregate identity and access management (IAM) roles in Amazon, containers can only talk to each other based on what you specify, removing network services from your systems. Any infosec professional pretending authentication between microservices is not easy is either lying or has not actually attempted to learn how to do it. The shared environment of containers, much like concerns infosec held over the shared environment of cloud, are a frequent fear as well. This, too, forgets history.

Before, your systems would talk over FTP, telnet, SSH, random UDP ports, port 80 talking to other things – but now, all that network mapping is removed because you are using TCP, authenticated APIs, and HTTP standards. Using containers, someone needs to pop a shell in (a.k.a. compromise) your web server infrastructure, whereas before, they could get in just through an FTP service running.

The update process for containers also concerns infosec practitioners – specifically, that it is still too easy for developers to use vulnerable versions of software. I ask: this is in contrast to what paradigm? When people were still using versions of Windows Server 2008 that were built with Metasploit backdoors ready to go? Software versioning is, was, and probably will always be an issue – containers or otherwise. Pretending this is a new issue is disingenuous. And containers present an opportunity in this regard — that you can ensure your software is compliant, secured, and patched before the workload even spins up.

In this modern world, you do have multiple services of which you must keep track, but you are also separating out complex functionality into separate services. With big, complicated applications, one of the key issues previously when moving from staging to production was needing to track every single place where you needed to remove, for instance, stack traces. If you deploy in a container-based environment, you have a build for stage, a build for production, and you can track exactly what the systems are, building the API on top of it. 


During this exploration of these “modern monsters,” we saw that the security industry’s present fear of microservices (both APIs and containers) do not match with their realistic threat model. Unlike the concerns over cloud computing, it seems security teams are less reticent to acknowledge that part of their hesitation is driven by a lack of understanding — and acknowledging the problem is a necessary first step on the path to recovery.

Unfortunately, security’s apprehension of microservices is also withholding opportunities for security teams to leverage microservices to improve organizational security. Promoting standardized APIs should reduce a whole host of security headaches — moving away from manual security reviews across knotted monoliths towards automated security that checks whether an API endpoint adheres to the defined standard. While containers are certainly not secure by default, they present an opportunity to scale security workflows — as well as raise the cost of attack through their ephemeral nature.

It is all well and good to document the anxieties of infosec teams, but what can we do to handle these concerns? In the final part of this series, I will dive into the cheat codes for dealing with all of this — including recommendations on best practices for securing modern infrastructure.

Read part 3 of this series: Cheat Codes


[1]: A caveat here is that typical internal microservices often will not use encryption because of certificate challenges that create friction for engineers. Yet, this is undesirable, and will certainly panic your security team if done.

[2]: Canner, B. Solutions Review. (2018, November 19). Ping Identity Releases Survey on the Perils of Enterprise APIs. Retrieved from

[3]: Because of their apparent predilection for espousing FUD, I am not naming them so as to not give them more attention.

[4]: Tripwire. (2019). Tripwire State of Container Security Report. Retrieved from

eBPF’s Rollercoaster of Pwn: An Overview of CVE-2020-8835

Posted by

Last Friday, Manfred Paul published a blog post about the vuln he used at Pwn2Own 2020, CVE-2020-8835, a local privilege escalation bug in the Linux Kernel. It affects any Linux distros using Linux kernels 5.5.0 and newer.

Why it’s cool: eBPF is the Hacker News hotness for tracing (i.e. monitoring execution of) the Linux kernel, so a vuln in it is guaranteed to gain attention. Because the bpf syscall, by design, facilitates data sharing between userspace and kernel space, exploiting this vuln means the attacker only needs to load their malicious eBPF program once to go HAM into the good night.

The underlying problem is due to bad math on the part of a runtime security control (the “ALU sanitizer”), who was supposed to ensure only memory within the appropriate boundaries could be accessed by eBPF programs, but failed to do so. Unfortunately, the security of eBPF programs largely relies on the assumption that the verifiers work as intended…

Digging deeper: You can think of eBPF programs kind of like rollercoasters. The eBPF bytecode is a design for a rollercoaster. This bytecode gets turned into assembly that runs in the kernel, like a rollercoaster being installed in a park. The JIT verifier, combined with the ALU sanitizer (a runtime check intended as a last layer of defense), makes assumptions about the safety of this eBPF bytecode, either allowing or denying it to run — similar to the role of a rollercoaster architectural review firm.

So, if you’re a theme park designer, you probably (hopefully!) don’t want people to die on your rollercoasters. So, you enlist an architectural review firm to verify (approve or deny) the safety of each of the arbitrary ride designs you create. Importantly, you, as the park designer, are relying on that firm to find proof of safety before you install any rides to ensure they are rollercoasters, not yolocoasters.

If the architectural review firm makes incorrect assumptions while proving the safety of a ride’s design, it may result in a coaster being installed that goes off the rails, yeeting its passengers off the map. This vulnerability, in essence, allows the attacker to trick the architectural review firm into making these poor assumptions — thereby allowing installation and operation of a wildly dangerous coaster that makes the system scream, “I want to get off MR BONES WILD RIDE!”

In computer nerd terms, the verifier performs static analysis on the eBPF program prior to JIT compilation and loading of the program — it attempts to form proofs on the eBPF program’s range of memory accesses to avoid the overhead of bounds checking at runtime. This is akin to the architectural review firm forming proofs of a rollercoaster design’s safety to avoid having to add super expensive safety systems post-installation.

However, incorrect assumptions underpinning the verifier’s proof, combined with the faulty math by the ALU sanitizer, results in the ability to perform out-of-bounds (OOB) reads and writes — like our rollercoaster track hurtling off into the sky. Because the eBPF program’s JIT’d assembly is running in kernel space, these OOB reads and writes allow escalation from the bpf syscall straight to ring 0 — giving the attacker full (i.e. root) access on the system. 

Yes, but: It’s a local vuln, not remote, so the attacker still requires another vuln to gain an initial foothold on the system. Additionally, Docker’s default seccomp profile blocks the bpf syscall by default, so hopefully people listened to Jessie Frazelle’s advice and started following it at some point over the last four years.  

This vulnerability had a fairly short life, and only affected a handful of distributions and releases, none of which are super popular to run in production: the non-LTS (Long Term Support) Ubuntu 19.10, Debian unstable, and Fedora (which uses a bleeding-edge kernel that, unsurprisingly, comes with bleeding-edge bugs). 

There are patches out already for Ubuntu, Debian, and Fedora, and RHEL 5, 6, 7, and 8 aren’t affected anyway (because they didn’t backport the commit that originally introduced the issue). In fact, in RHEL, unprivileged users aren’t allowed to access the bpf syscall by default. This isn’t the case in Fedora, so they recommend disabling unprivileged access to the bpf syscall by setting the following sysctl variable: 

# sysctl -w kernel.unprivileged_bpf_disabled=1

The bottom line: This weapon of math destruction is available to motivated attackers only, due to the initial foothold required to take over the system via an eBPF program. While the impact footprint is somewhat limited, given the nascency of the feature being exploited, we recommend patching promptly. And it certainly wouldn’t hurt to run the default seccomp profile in your Docker containers, too (remember – seccomp is off by default in Kubernetes).

For Capsule8 customers, we detect when BPF programs are loaded and executed, so you’re already covered.

The Capsule8 Labs team conducts offensive and defensive research to understand the threat landscape for modern infrastructure and to continuously improve Capsule8’s attack coverage.

Security Delusions Part 1: A History of Cloud Compunction

Posted by

Organizations are unearthing the potential of digital transformation, but security often remains a gatekeeper to this path of promised potential, largely due to its own delusions about what modern infrastructure means. As Herman Melville wrote in Moby Dick, “Ignorance is the parent of fear” – and security is too frequently hindered by its fear of the new and the agile precisely because of its ignorance about blossoming technologies.

In this blog series, drawn from my QCon talk last year, I will explore the history of infosec’s errant gatekeeping in the face of new technologies, and how we can encourage security to embrace new technologies to enable the business, rather than get in its own way.

Let us take a trip down memory lane, back to the early 2010s when “cloud transformation” reached sufficient significance to warrant concern by security professionals. These concerns presented as simplistically as fears of “storing data online,” extending to fears of shared resources, data loss, insider threat, denial of service attacks, inadequate systems security by cloud service providers (CSPs), supply chain attacks. 

However, the crux of the matter was rooted in a loss of control. No longer would security teams maintain the security of infrastructure themselves. No longer would their programs be anchored to firewalls. While the general IT question of the moment was usually, “What happens if our connectivity is interrupted?”, the question for IT security was, “How can we keep things secure if they aren’t directly under our control?”

For those of you who joined infosec more recently or who are interested observers from other disciplines, you may wonder why the prior model fostered such a sense of control. Traditional information security programs centered around the firewall – the first line of defense for the organization’s internal network perimeter, the anchor of the perimeter-based strategy, the key producer of netflow data that populated dashboards and provided signal for correlation across products.

Image result for security architecture firewall ids
Figure 1 – Fittingly in Comic Sans
Image result for ngfw diagram
Figure 2 – The Next-gen Firewall (NGFW) did not change things much

The Defense in Depth model became quite popular, one that advised a “multi-layered approach” to security (which is not wrong in the abstract). The first line of defense was always network security controls, starting with the firewall and its rules to block or allow network traffic. Intrusion prevention systems (IPS) worked just behind the firewall, ingesting data from it to analyze network traffic for potential threats. Fancier enterprise infosec programs segmented the network using multiple firewalls – what a SANS Institute paper called the “holy grail of network security.”1

But the transition to cloud erodes the traditional enterprise perimeter, and thus erodes the firewall’s position as center of the security universe. Thus, one can view the cloud transition as a Copernican Revolution for enterprise information security. And with such a shift, it is perhaps natural for enterprise infosec teams to reject it, wary of their relevance in this new world.

Survey data throughout the years covering infosec’s skepticism towards cloud helps fill in this picture. In 2012, Intel performed a survey on “What is holding back the cloud?” discovering that the top three security concerns regarding private cloud were related to control2. 57% of respondents cited concern over their inability to measure the security measures by CSPs, 55% cited lack of control over data, and 49% cited lack of confidence in the provider’s security capabilities.

Source: Intel

After the loss of control followed concerns of “lack of visibility into abstracted resources,” and “uneasiness about adequate firewalling.” Hypervisor vulnerabilities were widespread concerns for both public and private cloud – though with the benefit of hindsight, they never materialized for the typical threat model (even today). And concerns about adequate firewalling more than anything reveal the stickiness of the network perimeter security model.

By 2014, 66% of security professionals surveyed by Ponemon3 said their organization’s use of cloud resources diminished their ability to protect confidential or sensitive data. 64% said cloud makes it difficult to secure business-critical applications, and 51% said the likelihood of a data breach increases due to the cloud. In 2015, a survey by the Cloud Security Alliance (CSA) highlighted that 71% of respondents view the security of cloud data as a big red flag. 38% said their fear over loss of control kept them from moving data into cloud-based apps – thankfully fewer who believed the same in Intel’s 2012 poll.

Source: Ponemon

Distilling these fears, it is absurd in hindsight that enterprise defenders could believe that a few people maintaining a firewall could outmatch the security efforts and measures of Amazon, Google, or Microsoft. It can only be through the endowment effect – that people overvalue things which they already possess – with a bit of sunk cost fallacy that could lead to such a hubristic conclusion.

Looking back, seldom were the major CSPs hit by publicly-disclosed data breaches. Salesforce has no known major data breaches, outside of disclosure that attackers were using fake websites to phish customers in 2014. Heroku, a Salesforce subsidiary, disclosed a vulnerability in early 2013 that could lead to the potential to access customer accounts – but they did not appear to possess proof of an actual breach. AWS, GCP, and Azure have no known breaches outside of customer misconfiguration.

The most notable CSP breaches include Dropbox in 2012 (68 million usernames + passwords), Evernote in 2013 (50 million usernames), and Slack in 2015 (500 thousand usernames). Yet, these were all breaches of user account databases, rather than evidence of customer accounts or storage repositories being breached themselves.

Despite these sentiments, cloud adoption inexorably marched onwards, and security teams mostly had to shut up and deal with it. The notion that security teams could secure infrastructure better than Amazon, Microsoft, or Google finally became fringe – but truthfully it only did so within the past two years or so, well after operations teams realized that the CSPs could provide more performant infrastructure than most could manage on their own.

The reality of cloud systems is that misconfigurations present the biggest concerns, such as an S3 bucket that is accidentally publicly exposed. Gartner indeed suggests that “through 2020, 80% of cloud breaches will be due to customer misconfiguration, mismanaged credentials or insider theft, not cloud provider vulnerabilities.”4 Luckily, there are considerable resources to assist with the misconfiguration problem – more on that in the third part of this series.

Another reality is that security operating expenses can decrease when using the CSP’s native security controls, according to McKinsey research5. This research suggests that an enterprise with an annual budget of $200 million would spend just under $12 million per year on security, which is $5 million less than they would spend if they did not use the CSP’s native security controls.

This is not surprising. One large security vendor’s web malware protection system sells for over $100,000, as does its email protection system, both of which are deployed as blinky boxes on the network. A larger security vendor’s next-gen firewall (NGFW) starts at $50,000, though higher throughput models quickly reach $150,000 or more. An even larger security vendor’s firewall (with five year support) is priced near the $200,000 mark.

One might assume that a transformation resulting in less hardware to manage and less expense would be a welcome one – but this was not the case for enterprise infosec teams and cloud transformation. Of course, some CISOs readily embraced the potency and efficiency of cloud adoption, but even today, you can still find CISOs reticent to acknowledge cloud’s security benefits. 


During this history lesson, we saw that the security industry’s palpable fears of cloud computing did not match the eventual reality. Infosec was not only late to the cloud party, but often unnecessarily stalled organizational transformation, too. 

Much of security’s reluctance was driven by status quo bias — the stickiness of the defense-in-depth perimeter security model that gave security considerable control, and thus a sense of comfort. Importantly, that sticky bias led (and still leads!) many security teams to attempt recreating the same old-school model in new environments. Such an approach defies resilience, and serves as a warning for how security will handle the adoption of other tech today and in the future.

Now that we are faced with the rapid adoption of APIs and containers in the enterprise, will history repeat itself? In the next part of the series, I will explore how infosec is currently responding to microservices (APIs and containers) and what delusions are being conjured…

Read post 2 in this series: Modern Monsters


[1]: Bridge, S. (2001). [Achieving Defense-in-Depth with Internal Firewalls]( Retrieved May 2019.

[2]: Intel, Intel IT Center Peer Research. (2012). What’s Holding Back the Cloud? Retrieved from

[3]: Ponemon Institute LLC. (2014). [Data Breach: The Cloud Multiplier Effect]( Retrieved May 2019.

[4]: A bunch of vendors cite this quote, but I cannot find it directly via Gartner. I am assuming it is behind Gartner’s paywall.

[5]: Elumalai, A., et al., McKinsey Digital. (2018). Making a secure transition to the public cloud. Retrieved from (Note that these statistics are only true if the apps are rearchitected for the cloud in parallel.)

RAMming Down Hype via Intel CSME

Posted by

Recently, security researchers found new vectors of exploiting a vulnerability in Intel CSME, CVE-2019-0090, affecting all Intel chips other than Generation 10 (Ice Lake). The researchers haven’t released exploitation details yet, but proclaimed that “utter chaos will reign”… but not by exploiting this vulnerability! Instead, there’s a potential for chaos if attackers figure out how to exploit secure storage (which these researchers didn’t) and leverage the same whisper of time — in which protection mechanisms aren’t loaded — that enables the disclosed attack.

With a statement so clearly designed to incite panic, the elder Panic Button gods told us it would be sacrilegious not to cover this vuln, so here we are.

Why the freak out? This vulnerability is “unfixable,” because it’s in hardware that is unchangeable by design once deployed. Most of its hype-worthy potential impact — shattering Intel’s root of trust on systems using any of the affected chips — is based on the assumption that attackers will figure out how to leverage a tiny window where this root of trust is unguarded from external devices. Specifically, the assumption is that attackers will figure out how to compromise the Secure Key Storage (SKS) in which the hardware encryption key — the key that validates system firmware and protects against tampering — resides.

Wait, backup! What’s “root of trust”? A root of trust is a trusted source that can validate that a system’s firmware is operating as intended and hasn’t been tampered. It’s usually implemented as a hardened hardware module that checks the cryptographic signatures of firmware before the system boots up. By design, a root of trust can’t be modified or updated, otherwise it’d be much easier for attackers to seize control of the validation chain for themselves (see this talk from 2012 for examples). 

In Intel chips, the Converged Security and Management Engine (CSME) is the subsystem which loads and verifies firmware. It has its own firmware and isolated execution environment, allowing it to perform secure boot, trusted execution of security services, overclocking, and more. We recommend perusing the slides from two Intel security researchers at Black Hat 2019 if you want a deeper dive into the CSME.

Importantly for understanding the “utter chaos” vuln in question, the CSME is the gatekeeper of static RAM (SRAM), where encryption keys and other system treasure is stored. To ensure external or internal devices can’t directly reach this prized SRAM (though the range of devices which can access SRAM vs. main memory is limited, as we’ll see), the CSME uses the input-output memory management unit (IOMMU) to control direct memory access (DMA).

Got it, so how does this attack work? When you turn on the system, the CSME’s ROM is the first thing to start. To oversimplify, ROM basically says, “Good morning, I’m going to generate an encryption key and the map of my memory addresses, and stick both over here in the SRAM box so no one else can touch them.” Then, the ROM Boot Extension (RBE) says, “Greetings, I am going to load and execute the CSME operating system.” Then, the micro kernel (aka uKernel) starts performing enforcement of code execution and process isolation — and, of most relevance for this vulnerability, drives the IOMMU.

As you can see, there’s a brief window of time where the encryption key and memory goodies are sitting in SRAM, but the uKernel is still making its coffee — meaning anti-DMA attack protections aren’t in place. Therefore, an attacker could wait for or trigger the ROM to say “good morning!”, perform code execution in the Integrated Sensor Hub (ISH), and then gain access to the data structure that controls the firmware validation process. This would grant the attacker arbitrary code execution at basically the lowest level of system operation (i.e. escalating privileges to Ring -2).

The researchers also note that many of the IOMMU mechanisms within the CSME are disabled by default. If these mechanisms of controlling access to SRAM aren’t flipped on, this means you’re not getting protection from DMA attacks — which leaves the CSME open to tampering via a side channel (i.e. a malicious external device connected to the system).

Yes, but: Arbitrary code execution is bad! But exploiting this vulnerability requires local access at a minimum,  compounded by the attacker needing to exploit a relevant device to gain a foothold on the system. This list of valid footholds is quite limited. For instance, an attacker would need to perform code execution in the ISH or other Platform Controller Hub (PCH) devices — exploiting PCIe devices (like GPUs or RAID controllers) wouldn’t suffice. Additionally, per the original blog post, other methods of exploitation require physical access. Either way, this is limited to incredibly motivated and well-resourced attackers (like a nation-state with a high-value target identified).

For real, though: The headlines and hype are primarily driven by the speculative part of the researcher’s blog post: that an attacker could use that same window of time, when the uKernel is making coffee and not stopping direct memory access, to extract the hardware key that drives the cryptographic system of the machine. 

The reason why they are all Book of Revelation about it is that there’s a single hardware key for each of Intel’s chipset generations. This means an attacker possessing the key could decrypt encrypted data, spoof hardware, and annihilate DRM (on any system using that generation of Intel chips). Consumer techies raging against DRM would rejoice, while enterprises relying on trusted computing would panic — perchance a dichotomy heralding the end times.

With that said: This highlights the risk created when a vendor builds opaque technology — the user is rendered to be at the vendor’s mercy, which risks systems security in cases like this. It is, perhaps, worth wondering whether enterprises and users should be reliant on a root of trust that is based on blackboxes, regardless of whether there are known vulnerabilities in its components. 

The bottom line: This disclosure itself demonstrates a new evil maid attack for Intel chipsets. It requires a supremely motivated and resourced attacker with at least (extremely non-trivially obtained) local access — if not physical access — which considerably reduces the probability of widespread exploitation. The speculation on a hypothetical decimation of Intel’s root of trust is a cute way to generate headlines — and there’s plausibly a nation state who’s already done it — but this isn’t the sequel to Heartbleed. 

However, there are real risks arising from the use of blackbox tech, and enterprises should be conscious of assuming an opaque root of trust is always patchable, let alone “unhackable.” Enterprises using Intel’s Management Engine to power full disk encryption should consider it potentially compromisable in their threat models, even though the likelihood of non-targeted exploitation is low.

Update: This post has been updated based on feedback from generous hardware security experts. Many thanks to Trammell Hudson (@qrs), Director of Special Projects at Lower Layer Labs, for his insight into how PCH devices are required to DMA into SRAM (not PCIe as we originally surmised). Additional thanks to Jeremiah Cox (@int0x6) for the great suggestion of disabling busmastering on PCI bridges as a method of raising attacker cost (though, as per Trammell’s insight, PCI bus devices wouldn’t be a valid foothold for DMA into SRAM anyway).

The Capsule8 Labs team conducts offensive and defensive research to understand the threat landscape for modern infrastructure and to continuously improve Capsule8’s attack coverage.

What is Container Security?

Posted by

Container Security – Nobody Knows What It Means But It’s Provocative

The current understanding of “container security” as a term and market is muddled, especially given containers are used by different teams in different contexts. It could mean scanning image repositories for vulnerabilities or exposed secrets, managing credentials for container deployment, or monitoring running containers for unwanted activity. 

This confusion isn’t particularly helpful to anyone — developers and operations teams are increasingly being asked about security for their deployments, while security practitioners are looking to either secure their container-based workloads themselves or partner with their DevOps colleagues to do so. Thus, my goal in this post is to help provide some clarity around the market for all involved.

To frame the container security market by use case, I’ll be using the three phases of a typical software delivery lifecycle as outlined by the Accelerate: State of DevOps report: software development, software deployment, and service operation. Others may refer to these phases as “build,” “ship,” and “run.” I’ll highlight the primary security features of each phase, the benefits and downsides, and some of the representative vendors in the space (including open source software). Naturally, not every vendor will offer all features outlined, but hopefully it can help you navigate the vendor landscape a bit more easily.

Container Development

This phase (a.k.a. the “build” phase) is about securing the container development efforts, including container image repositories and new container images being created. The main goal is to spot vulnerabilities in the container’s code early so they can be fixed during development, rather than spotted right before a release deadline or while the container is already running in production (where it could be exploited by an attacker). However, vendors in this category often look for vulnerabilities in running containers, as well, extending into the security of operating containers, too.

Primary Features:

  • Scanning for vulnerabilities, malware, or exposed sensitive information (like API keys)
  • Blocking insecure builds and images from untrusted sources
  • Validating adherence to compliance standards or custom policies 
  • Reviewing third party dependencies and matching base images to allowlists
  • Integration into CI/CD pipelines, including source code repositories (GitHub, GitLab) and build servers (Jenkins)


  • Integrates security assessments earlier in the CI/CD pipeline (that “shift left” thing)
  • Surfaces obvious security issues (like known CVEs) before deployment
  • Helps enforce adherence to compliance standards (like HIPAA, PCI, or CIS benchmarks) within the context of that build.
  • Tracks vulnerabilities in running containers that weren’t previously discovered or fixed (overlapping with container operation security)


  • Constrained to known vulnerabilities that can also be spotted with relative ease
  • Misses container security risks beyond application vulnerabilities
  • Any sort of blocking mode adds some friction to developer workflows and the dev pipeline
  • On the flip side, flagged vulnerabilities can be ignored (especially in high volumes) and remain unfixed before deployment
  • Leaves gaps if there isn’t support for all languages and frameworks used by an organization
  • Vulnerability feeds can contain incomplete or faulty information
  • Doesn’t usually take into account context under which container images will be used 

Representative Vendors:

Note: not all companies listed offer all primary features listed.

  • Startups: Aqua Security, NeuVector, Stackrox, Sysdig, Snyk
  • Large Security: Qualys, Palo Alto Networks (Twistlock acquisition), Tenable, TrendMicro
  • Ops / Platforms: Amazon ECR, Azure Security Center, Docker Enterprise, GitHub, GitLab, Google Cloud Platform, JFrog, Synopsys
  • OSS: Anchore, Clair, OpenSCAP Workbench

Container Deployment

This phase (a.k.a. the “ship” phase) secures the container deployment phase, including built containers that are deployed to image repositories but are not yet running (either in test or prod). The main goal is to audit and ensure only the right people can access and manage container repositories or orchestration layers — because otherwise attackers or accidentally-rogue developers could tamper images, jeopardizing security and stability once those images are running. 

Primary Features:

  • Define and enforce access control policies across different clusters and workloads
  • Managing credentials across different clusters and workloads
  • Performing drift detection (spotting deviations from expected configurations) and validating the integrity of clusters and image repositories 
  • Integration into orchestration tools (AKS, Kubernetes, etc.)


  • Minimizes misconfigured resources that could lead to security or performance issues
  • Enforces the principle of least privilege 
  • Spot vulnerabilities in image repositories that may have been publicly disclosed since the images were originally pushed to the repo
  • Facilitates adherence to compliance standards, which often require auditing of file access and modification
  • IAM capabilities are largely available natively through cloud service providers
  • Maintaining parity between prod, QA, and dev/test environments can be simplified by properly managing configuration parameters


  • Kubernetes-only use case addresses a limited part of the microservices threat model
  • Doesn’t apply to serverless container instances (though that’s admittedly a minor slice of the market)
  • Lack of namespace support limits per-container policy creation

Representative Vendors:

Note: not all companies listed offer all primary features listed.

  • Startups: Aqua, NeuVector, Alcide, Octarine, StackRox, Styra, Tigera
  • Large Security: HPE (Scytale acquisition), Qualys, Trend Micro, Palo Alto Networks
  • Ops / Platforms: AWS, Azure, Docker Enterprise, GCP, RedHat OpenShift
  • OSS: Kubebench, Kubehunter, Kubediff, OpenShift (RedHat), SPIRE (SPIFFE) 

Container Operation

This phase (a.k.a. the “run” phase) secures the actual instantiation and operation of containers, once they are deployed and running in enterprise environments (especially production, but let’s not forget test or QA). The main goal is to protect against unwanted activity within containers in operation, ideally detecting and responding to an incident before it results in downtime or service degradation. 

This category also covers the collection of system telemetry to facilitate post-hoc investigation of incidents. As mentioned previously, many of the container development security vendors also track vulnerabilities in running containers, but we consider that an extension of those aforementioned capabilities.

Primary Features:

  • Detection of unwanted activity (both attacker and developer), usually either deterministic or behavioral / ML-based
  • Automatically responding to unwanted activity (like shutting it down)
  • Collecting and querying system telemetry for incident investigation and auditing
  • FIM, AV, and policy enforcement for compliance requirements
  • Integration into SIEM / log management, incident response tools, and container runtimes (Docker, CRI-O, containerd) 


  • Preserves uptime and reduces impact by stopping unwanted activity or detecting it quickly
  • Monitors erosion of isolation boundaries between containers
  • Reduces effort required during incident response, speeding up recovery time 
  • Upholds system resilience by enforcing immutability and ephemerality
  • Exposes information about container operations that is useful to both security and operations teams
  • Creates a feedback loop to continually improve infrastructure architecture and design
  • Creates an audit log of file-related activity to meet compliance requirements


  • Kernel module-based agents can create reliability and stability risks in production
  • Cloud-based analysis can create network bottlenecks and increase instance costs
  • Lack of coverage for serverless instances with host-based approaches
  • Blocklist or allowlist enforcement can create performance issues when improperly defined
  • Machine learning-based tools may generate large volumes of alerts and false positives without upfront tuning
  • Network-centric tools may interfere with orchestration-based security enforcement or service meshes

Representative Vendors:

Note: not all companies listed offer all primary features listed.

  • Startups: Capsule8, CMD, Lacework, StackRox, Sysdig, ThreatStack, Uptycs
  • Larger Security Vendors: AlertLogic, TrendMicro, VMware (Carbon Black)
  • Ops / Platforms: Amazon GuardDuty, Azure Security Center
  • OSS: Falco, osquery

Parting Thoughts

I’d love to see people referring to “container development security,” “container deployment security,” and “container operation security,” but that’s a lot of syllables — probably “build,” “ship,” and “run” will ultimately reign supreme. Nevertheless, hopefully this post helps delineate the different areas of container security and why they’re each important. As a community — not just security, but also embracing our friends in DevOps — we must work to secure the containers lest we become contained by our own failures.

Yet, it’s important to remember that we don’t live in a digital shipyard filled to the brim with containers; other types of workloads are still predominant in most enterprises. Container security is generally only one component of enterprise infrastructure protection (what of your VMs, VPCs, VPSs, V*…), which is something to consider when evaluating container security tools.

If you’d like to learn more about how Capsule8 protects your containerized systems from runtime threats to security and performance, check out our Solution Brief: How Capsule8 Protects Containerized Environments.

A Cloudy Forecast for ICS: Recap of S4x20

Posted by
Photo credit: @montaelkins – Kelly Shortridge Keynote at S4x20

Last week, I keynoted S4x20, the biggest industrial control systems (ICS) security conference in the world, and was able to catch quite a few talks, too. While it took place in sunny Miami Beach, my highlights from the conference suggest a far cloudier outlook. Specifically, there seems to be a growing rumble about the adoption of cloud-based infrastructure, including the DevOps mindset it entails, and what it means for ICS security.

Three talks I saw at S4x20 really stuck out to me, covering critical infrastructure as code, applying chaos engineering to field-based critical systems, and rethinking how we measure security. In this post, I’ll highlight what I found most interesting from each of them to whet your appetite for when the videos of the talks are posted.

Critical Infrastructure as Code

Configuration as code is blooming with popularity in the DevOps world, but is less popular in the security world — let alone the ICS security world, which tends to be even more conservative given the importance of critical infrastructure. However, Matthew Backes from Lincoln Labs outlined how to bring the decidedly modern practice of config as code to ICS, in what he calls “Infrastructure as Code” (infrastructure in the ICS sense), to help reassert control over ICS systems.

One thing I learned from his talk is that configuration of ICS devices is a hot mess. It tends to be highly manual, ad-hoc, and vendor-specific, with a high probability of disrupting systems or bricking devices when it goes wrong. Even pulling the current configuration of devices is a nasty affair — it usually resides in the sysadmin’s head or in scattered, undocumented locations. Otherwise, the device must be directly accessed (or “touched” as is apparently common ICS parlance) to gain the ground truth of the config. Backups are a fantasy. As you can tell, the pain is real.

As Matthew described in his talk, the goal of infrastructure as code for ICS is to cleanly define configurations and push them down to the device — no touching or mind reading required. There must also be version control (something like git) and accessibility (something like an API). Lincoln Labs’ approach was to use Ansible with a JSON config file and then a standardized API written in Python. They experimented this on protection relays, remote terminal units, genset controllers, and serial-to-ethernet converters (of which I knew nothing about and recommend you look out for the recording of the talk to learn more).

The workflow will look familiar to those of you who know how config as code works — ICS is now getting in on the YAML party. There are standardized templates to define device / service functionality, a host inventory (list of devices in the environment), and variables at the global, vendor, and device level (like IP addresses, protection settings, or specific files). These components generate a config file, which can be used to push settings across devices using a tool like Ansible, or even as a compliance report for auditors.

The lessons learned were probably the most interesting to me, given they’re based on real experiments Matthew’s team conducted among operators in field environments. First, there’s a tradeoff between manageability and security, depending on the device type. For instance, protection relays expose their config files through telnet FTP, not SSH, and exposed management interfaces often lack password protection. Second, there isn’t a lot of interoperability, which can lead to bricked devices or other issues when attempting to parse config files. Thus, quite a bit of modification is required to get config as code to work for your ICS.

It’s clear that simplicity and manageability are sorely lacking in ICS, making it difficult to control those systems. There’s appetite for central config storage and standardized workflows, but the tooling hasn’t caught up. While Matthew made a call to ICS vendors to improve their config functionality for the benefit of the community (which is absolutely needed!), it struck me that the Ops community could provide a valuable helping hand here, too. 

The ICS security and DevOps communities are far from close, but my hypothesis is that Ops engineers from Silly Valley and elsewhere can pattern match their way through some of the roadblocks ICS defenders, being new to the concept, will face in attempting to implement config as code. It’s a way to harness your hours of using Terraform, Puppet, Ansible, Chef, Cloud Formation, and more for the benefit of us all, as even helping set up the basics will help ICS security seriously level up their abilities!

Chaos Engineering to Avoid National Chaos

Virginia Wright, a Program Manager in Idaho National Laboratory’s Cybercore division, discussed how to apply DARPA’s RADICS program, which stands for “Rapid Attack Detection, Isolation, and Characterization Systems”, to chaos security engineering. She acknowledged it involves a mindset shift from contemplating how to keep attackers out of our systems towards figuring out how to cope once they’re inside. This closely aligns with my own advice of moving away from the ideal of perfect prevention and instead striving towards resilience — ensuring you can recover gracefully in the face of inevitable failure.

Applying chaos engineering to critical systems in the field is a daunting task. But operators need to feel confident in their ability to recover from attacks, and cannot do so without practicing within real, or at least realistic, environments. Virginia walked through an exercise her team conducted consisting of a “black start recovery of a crank path amidst a cyber attack on the power infrastructure to enable grid restart operations.” In non-ICS speak, I believe it means “restoring operations of an isolated part of the energy grid to seed power back into the overall grid.”

Importantly, this exercise ran on real equipment, not simulated systems, and played with failure modes, like an extended regional power outage. Thus, while the scenarios were a bit scripted to create some structure to the exercise, it still felt pretty random to the operators themselves. 

The campaigns themselves use “consequence-based engineering,” beginning with defining the highest-consequence outcomes and figuring out how the attacker would achieve that outcome. This is essentially the same as the advice in my keynote to start with the most important assets to your organization (like customer data in an S3 bucket, uptime of a particular service, etc.) and work back to how the attacker would most easily compromise that asset. Either way, repeatedly running campaigns with these scenarios helps you to determine the potential blast radius of a real incident and to continuously refine your processes and tools.

I was especially delighted that Virginia espoused the philosophy of confronting your worst fears, bringing to mind one of my favorite lines by Chuck Palahniuk, from his novel Invisible Monsters: “Find out what you’re afraid of and go live there.” We need to develop the “muscle tone,” as Virginia put it, to deal with that fear and move forward from it — otherwise we will feel lost, uncertain, and afraid when our fears (like a real attack on a power plant) materialize. This confrontation via experimentation can (and should!) start small, letting us build confidence and iterate over time.

My own talk recommended starting with less critical systems that are still a part of ICS, like predictive maintenance systems, Office 365, cloud storage, or grid edge systems. Relatedly, I recommend Bryan Owen’s insightful talk on the ICS shift to cloud from last year’s S4 to learn more about the current transition underway. I still think that beginning with those less critical systems is a fantastic way to become more comfortable with the practice of chaos security engineering, but Virginia’s advice can help you safely navigate practicing it on in-the-field critical systems.

Rethinking Security Measurement 

It turns out that figuring out how to rank or score the security of companies is hard. Derek Vadala, Global Head of Cyber Risk at Moody’s, discussed their approach to rating the security risk of companies. While this arena is incredibly complex and full of slippery variables, I liked how Derek offered a straightforward division of companies into two buckets: those who can defend against commodity attacks, and those who can additionally defend against “sophisticated” attacks. Digestible mental models are underrated in security, in my opinion.

I found his recognition that technology shifts — like the shift to cloud-based systems — impact measurements to be particularly insightful. One example that Derek highlighted is that uptime is no longer a desirable metric. The longer systems are online, the higher the chance attackers will compromise and persist in them. This notion neatly fits within my own talk, in which I advised a goal of ephemerality and using a metric of reverse uptime to keep it in check (minimizing the amount of time a host is online to reduce persistence opportunity).

The type and quality of data provided for security ratings is also highly influential. As Derek noted, there’s a tradeoff between data fidelity (how valuable it is as a signal) and effort involved to retrieve it. Validated data, such as that collected from a live exercise, is invaluable in cultivating confidence in the security of deployed systems. This is another overlapping point on the benefits of chaos security engineering from my keynote. Repeated experimentation via the injection of security failure generates feedback about the resilience of your systems, giving you greater fidelity with minimal ongoing effort (albeit with non-trivial upfront effort).

The general inaccessibility of valuable signals will assuredly present a stumbling block for cyber insurance companies as well, who want the highest quality signal possible for the potential risk a company poses, but are likely to receive self-provided data at best and external data only at worst. Nevertheless, these sorts of security ratings pass the muster of a “good enough” health check to give insurers a high-level sense of a company’s security health. Given quote speed is a key driver of insurance purchases, fast but “good enough” will win over slow but “highly accurate” within the industry.

Final Thoughts

In the quintessential hallway track, much of my personal discussions involved the stickiness of the existing ICS security mindset. Specifically, one of the barriers in ICS to adopting modern infrastructure and security practices isn’t technical in nature at all — it’s the emotional cocktail of skepticism and inertia. Part of the goal of my keynote was to demonstrate benefits to security of all this cloudiness and containerity, but it will take many members of the community continuing the drumbeat before the collective perspective shifts.

Luckily, S4x20 had a palpably strong community vibe, which is essential for changing hearts and minds, not just technical approaches (and free lunch from a taco truck doesn’t hurt in fostering a positive spirit). The range of backgrounds, from practitioners in ICS, national laboratories, risk quantifiers, to vendor-resident thinkers like me, helped provide a healthy variety of topics and perspectives. 

As always in infosec, the conference was predominantly white and male. I noticed that the sponsor stage was especially homogenous, which presents an opportunity for S4’s organizers to offer incentives to vendors who put forward speakers from underrepresented backgrounds (such as a small discount on the speaking slot price). S4 is far from alone in this (looking at you, RSAC), but I was impressed by how professional and welcoming it felt, giving me greater confidence in its ability to lead the charge on this front.

Overall, I can attest the conference was impeccably organized, so would highly recommend speaking there if you have the opportunity. I learned a lot about the ICS niche and can empathize better with the challenges ICS security practitioners face, and hopefully those who attended my talk feel more comfortable with DevOps, modern infrastructure, and chaos security engineering. We all know ICS security is vital for our economy and society (even if just to avoid “cyberpocalypse” headlines in the event of a major compromise of critical infrastructure), so it’s heartening to see an openness to fresh ideas and an eagerness to work together to level up the community.