Bringing Your A-Game: Availability for Security People

The security industry tends to focus on the protection of sensitive data, forgetting that availability falls under the classic C.I.A. triad. This is a mistake, and an especially egregious one considering the rise of the service delivery economy. This post is intended as an overview of why infosec teams stand to substantially benefit from rediscovering the importance of availability within their mandate and why it’s imperative they do so.

So, what is availability? When availability is achieved, services / systems / apps are reliably accessible by users.

And what is downtime? “Downtime” counts as any period of time when a service / system / app is unavailable or unable to perform its primary function.

Why is availability so important? The tl;dr is that availability equals money — but let’s explore why that’s true. If we look at the quintessential research from Dr. Forsgren and her compatriots, the four key DevOps metrics (shown in the graphic below) are based on activity transpiring in production environments. This makes a ton of sense given production is where we actually generate tangible, monetary value from all the software stuff we wrangle together (or sometimes mangle together, but I digress).

Success in these metrics supports availability goals, commonly defined by Service Level Objectives (SLOs). These SLOs are driven by Service Level Agreements (SLAs), which are the guarantees a service provider makes to its customers. While I’ll be focusing on SLAs and SLOs related to availability in this post, other goals and guarantees can include incident response time, customer service hours, and other assurances customers commonly require.

You probably have heard of the “five nines” as an availability target. The reality is that few services promise 99.999% availability in their SLAs, outside of companies like content delivery networks (CDNs) whose entire value proposition is based on constant availability. Since I love learning by way of example, the below table shows the SLAs defined by various service providers and how much wiggle room (i.e. downtime) exists per month based on the availability guarantees they make to their customers:

Cloud Service Availability % Downtime per month
Cloudflare 100.00% 0 seconds
Fastly 100.00% 0 seconds
Splunk Cloud 100.00% 0 seconds
RingCentral 99.999% 26 seconds
MongoDB Atlas 99.995% 2 minutes 11 seconds
AWS EC2 99.99% (“four nines”) 4 minutes 22 seconds
AWS Fargate 99.99% 4 minutes 22 seconds
Digital Ocean 99.99% 4 minutes 22 seconds
Google Cloud BigQuery 99.99% 4 minutes 22 seconds
Hubspot 99.99% 4 minutes 22 seconds
Okta 99.99% 4 minutes 22 seconds
Slack (Plus plan up) 99.99% 4 minutes 22 seconds
Twilio Services 99.99% 4 minutes 22 seconds
Azure Functions 99.95% 21 minutes 54 seconds
Google Kubernetes Engine (regional clusters) 99.95% 21 minutes 54 seconds
Atlassian Enterprise Cloud 99.95% 21 minutes 54 seconds
Adobe Creative Cloud 99.9% (“three nines”) 43 minutes 49 seconds
AWS S3 99.9% 43 minutes 49 seconds
Azure Active Directory 99.9% 43 minutes 49 seconds
Box 99.9% 43 minutes 49 seconds
GitHub Enterprise 99.9% 43 minutes 49 seconds
Oracle Cloud Database 99.9% 43 minutes 49 seconds
PagerDuty 99.9% 43 minutes 49 seconds
ServiceNow 99.8% 1 hour 27 minutes 39 seconds
New Relic 99.8% 1 hour 27 minutes 39 seconds
WorkDay 99.7% 2 hours 11 minutes 29 seconds

The roughly 4 minutes of downtime per month allowed by an SLA of 99.99% is barely enough time for a human to even acknowledge an incident, let alone respond to it. This is why, realistically, automated failover and remediation are required to hit those targets — but that’s a topic for another time. Even an SLA of 99.9%, offering a “generous” (but actually still incredibly tight) 43 minutes for responders to restore service per month still provides an enormous incentive for ops teams to make availability their top priority.

The importance of availability isn’t just about missed revenue, which is also true for eCommerce, fintech, adtech, and other industries that simply can’t make money when their services are down (but who aren’t necessarily tied to SLAs). It’s also about refunded revenue — that is, money taken out of the company’s coffers and given back to customers — which can happen if an organization fails to meet its SLAs.

If you dig into some of the SLAs in the table above, you’ll discover this common stipulation: that the service provider will refund part of the customer’s payment if the availability guarantee is not met. Many companies offer tiered refunds based on different bands of availability. For instance, Splunk Cloud refunds up to 2 hours of credits per quarter if availability is 99.99% – 99.999% and up to 1 month if availability is less than 95.0%. Google Cloud Big Query refunds 10% of a customer’s monthly bill if uptime is between 99.0% – 99.99% and 50% of the bill if uptime is less than 95.0%.

The important takeaway from this for security teams is that supporting availability is such a straightforward, salient budget justification. Calculating the impact of exposed customer data involves relatively arcane and arbitrary assumptions about cost per record, none of which is aided by the estimates security vendors provide that lack any divulsion of sources or methods. And attempts to calculate reputational damage from a data breach are even more fraught with “risk quantification” that arguably has more in common with Humourism than serious statistical analysis.

But calculating the impact of a disruption or outage in your services is, on a relative basis, trivial. If a skidiot attempts exploitation in production and causes the system to crash, the benefits of detecting and restoring quickly can be directly quantified in monetary terms. For instance, if your company’s SLA is three nines and you can shrink downtime from 45 minutes to 20 minutes, you avoid having to refund any revenue and can employ a counterfactual incident calculation as simplistic as: $10mm in affected MRR * 10% refund = $1mm cost of incident.

In fact, quite a bit of the typical infosec threat model applies to availability — the industry just doesn’t necessarily think in those terms. Kernel exploitation can absolutely disrupt systems and result in downtime. Privilege escalation and sudo commandeering can facilitate changes to the system that break service operation. Cryptominers, aside from the billing headaches, can generate resource overhead that strangles the operations required to deliver the service to users. Even exfiltrating terabytes of data can result in network bottlenecks that snowball into downtime. All of these incidents can and should be quantified in terms of availability impacts, not only because it reflects reality but also because it can help justify security spend.

In light of the critical business importance of availability combined with the difficulty in achieving availability goals, it really shouldn’t be a surprise that Ops teams are loath to deploy tools that could jeopardize speed or stability in production. When services are available, the business can make money. When there is downtime, the business misses out on revenue at best and refunds revenue at worst. Through this lens, proposing a security tool — the espoused benefits for which are likely intangible — that could feasibly impair availability is somewhere on the spectrum of insane to ignorant to impudent.

I think the weirder dynamic is that too many security teams don’t seem to care that their tools could jeopardize availability. I often cite status quo bias as a partial explanation for why security teams can be reticent to get on board with modern IT systems and practices, but caring about availability should be cohesive with the traditional mindset. After all, the CIA triad has been around longer than I have (at a minimum, it’s been well-established for over twenty years), making it a veritable dinosaur in infosec years.

Thus we have this strange cognitive dissonance at present, of:

  • Availability being relatively coupled with revenue
  • Ops teams relentlessly optimizing for availability (because money)
  • Availability being part of one of the oldest-school infosec models (CIA triad)
  • Security teams treating availability as an afterthought (as if CI > A)

I am often asked by security leaders how they can start cultivating a more collaborative vibe with the operations team. Availability is, in my mind, the most obvious and natural priority where security and ops can join forces to maximize outcomes. Security teams should expand their conception of security failures to include impacts to availability… including downtime potentially engendered by their own tooling.

My call to action to y’all in infosec is to extend a calendar invite to your local ops leads and operate in listening mode during the meeting so you can understand their top concerns around availability. You may actually discover that some of those concerns are about security! As noted earlier, if a system gets pwned, that absolutely counts as an incident in Opslandia.

From there, with that context collected and communication channel cultivated, you can more successfully build a strategy for how your team can begin bringing its A-game, so to speak, and support one of your organization’s top goals.