Resiliency Part 3 of 3: “Cattle, Not Pets,” Leveraging CI/CD Practices and the Concept of Reverse Uptime

Part 1: What is Resiliency and How Can We Apply to All Phases of Attack?

Part 2: Using Misinformation and Intentional Failures to Your Advantage

Part 3: Cattle, Not Pets, Leveraging CI/CD Practices and the Concept of Reverse Uptime

As we conclude this series about resiliency, you’ll recall that we’ve focused on ways to make life harder for hackers through strategies ranging from OTP and other forms of MFA to exploring ways to use misinformation and intentional failures to your advantage. For this final post on the topic of resilience, let’s move beyond examples like credential theft and on to more specific ways to utilize resilient defense ideas for infrastructure.

Security aside for a moment, resiliency has become a key element in the success of the large technology companies in ensuring their ability to deliver products and services despite the chaos and obstacles presented at scale. Big tech companies have built their infrastructure to support elasticity by scaling, making their environments fault tolerant, where traditional servers are replaced with sets of swappable, ephemeral microservices. One driver for this has been the movement towards Continuous Integration, Continuous Delivery (CI/CD).

CI/CD is the software engineering practice of always being capable of deploying new code with automated fallout recovery mechanisms in place. Crashes or faults are simply remedied by replacing faulty instances with new instances of the same microservice. A phrase commonly used in the DevOps community referring to these ephemeral instances is “cattle, not pets.” When one is corrupt or causing a problem, cull it from the herd. A certain percentage of instances will always need culling.

Netflix actually has published an open source suite of tools designed to test its resiliency, which it runs on its environments to ensure its resilience and recovery planning works. The company’s tools are named after its goals: Chaos Monkey introduces arbitrary crashes and failures to service instances. Doctor Monkey performs common health checks on their instances, and culls where unhealthy services are discovered. Security Monkey performs basic policy checks, removing violating instances pre-emptively.

This is an example of operational resilience. Now consider if this approach were applied using the output of a high-confidence security alerting mechanism. When an instance of a service is exhibiting signs of an attack, remove and replace it dynamically, as if it were a bug or a poorly deployed instance – or even consider culling the instance if it measurably deviates too far from normal expected behavior.

This may sound dangerous, but consider that the assumption is already formed that operations at scale will bring natural failures (as an inevitability), and our premise to this discussion (and to the entire concept of resiliency) has accepted that hacking is also an inevitability, thus there exist grounds for accepting a reasonable threshold of risk of false positives to ensure resilient response in order to assert a level of resilience in removing potential attacks as they are detected.

In our last post, we promised to touch on the concept of “reverse uptime,” a modern theme in the world of CI/CD microservices, containers and serverless environments. The goal of “reverse uptime” is to ensure a service or instance is never alive longer than a maximum time window, typically measured in hours. This idea may be counter-intuitive, as it is antithetical to the traditional thought of uptime, where uptime is conflated with high-availability. Instead, this new practice accepts that faults will happen and that it is instead wiser to strive for clean instances rather than debugging obscure edge-cases.

This doctrine goes hand-in-hand with the previous mantra of “cattle, not pets”, accepting that faults will occur and that a truly “highly-available” service is one that is resilient regardless of its faults. This allows service availability to continue, while high-profile faults can be triaged by developers. The beauty of striving for this level of service hygiene is that it inherently engenders security benefits: if a service instance (be it a container, or VM) is only allowed to live for so long, an attacker only has so much time with which they can migrate or harvest the spoils from a compromise. The target is now moving, and for the attacker to properly persist requires them to be far more agile, adaptive, automated, and targeted.

Combine the benefits of reverse-uptime with dynamic security conscious monitoring, and the environment becomes very hostile for an attacker. Traditionally, the idea of resilience focused on anticipating, withstanding, recovering and responding to attacks. In the age of microservices and cloud-native infrastructure, it’s important that we adjust our thinking somewhat – particularly when it comes to anticipation and response. By making the life of a hacker exponentially more difficult through such advanced strategies and automated scenarios, we ensure our systems are resilient in the most modern sense of the word.