Hacking Failure to Build Resilience


In Part One of our discussion on cyber resilience, we discussed the role of resilience as a core component of security. Drawing from a panel discussion led by Kelly Shortridge at the third annual Cloud Native Security Summit (CNSS), hosted by Capsule8 along with Open Raven and Snowflake, post emphasized resilience as a mindset that drives continuous improvement in security efforts. 

As a follow-up, we’re looking at the role of failure in building resilience, and how effective security leaders measure, evaluate, and learn from failure to reduce risk. In a panel led by David Spark, producer of CISO series, security experts Nick Espinosa, Chief Fanatic at Security Fanatics; Will Gregorian, Head of Information Security at Color; and Naomi Buckwalter, Director of Information Security and IT at Beam Technologies, take a deep dive into this question and the importance of failure.

How Critical is Learning from Failure in Cybersecurity?

Failure is the basic model of learning, and every failure is an opportunity to learn. It’s why we build test environments and emphasize experimentation when training new cybersecurity personnel. It is essential to have those moments of failure, which help the team learn what works and what doesn’t, and encourages them to start thinking creatively about overcoming common sources of failure in securing their organization. 

Not only is failure a core tenet of learning, but it is also how you show you’re innovating and pushing the envelope. However, that doesn’t mean all forms of failure are helpful. Failure is proportional to the expertise of the individuals who fail. Failure should ideally happen in a controlled environment to be done safely and not beyond the scope of what someone understands. When you learn to drive a car, you do it in a parking lot, not on a busy highway where someone could get hurt. It’s okay to run over a few cones in a closed parking lot in the service of learning how to drive. The same is true for security professionals learning their trade. In a closed test environment, failure is critical to fully understanding how the system should operate. Failure in a controlled environment will help cyber professionals to be better prepared in securing production environments.

Failure serves many purposes, but most commonly, it provides an example of what does and does not work. It’s vital to learn from our mistakes, and more importantly, to share them with others so they can learn from them as well. When feasible from a legal perspective, post mortems of significant disruptions or security events can benefit everyone in the industry to understand how situations develop and how they are later resolved. These discussions help contextualize why security teams do what they do – showing the value of a patch or update and why they are continuously working towards improving and hardening products. 

Building Resilience is Building a Cyber-Resilient Culture

With failure in mind as a core mode of learning in the security profession, resilience is naturally an essential topic of discussion. Every security team knows their organization will get hit at some point, so meaningful metrics like Mean Time to Recovery (MTTR) or Mean Time to Detect/Discover (MTTD) are critical. When something happens, how quickly can a team respond and recover? 

One of the biggest problems cited in cybersecurity is not the threat detection capability or the technology deployed, which are generally sophisticated in most enterprises, but the culture and the people who operate those systems. In many companies, employees are afraid to speak up if they see something wrong on their computers. This is a significant roadblock to resilience because it means a large number of potential outbreaks go undetected for far too long. 

Cyber-resilient culture is not just about leadership buy-in; it’s about helping people understand that if they see something, they should say something. Of course, leadership buy-in is vital, but it needs to go down to the team’s core to have a genuine impact. The bottom line is that security is everyone’s responsibility. 

In building a security culture, one of the first things a team can do is ask individual employees what keeps them up at night? What are the skeletons in the closet that they have been too afraid to talk about until now? These conversations can lead to almost immediate buy-in, giving these people ownership over the issues and making them feel like part of the security team. 

Asking engineering partners what they are doing from a resilience perspective is a great place to address this cross-departmental collaboration. It helps to give these individuals some ownership over the security process and build in responsibility. By determining the biggest pain points for your engineering and operations counterparts, you can find common denominators. Most importantly, think like an engineer. Avoid policies and procedures and think about how it affects them, the implementation, and technical details to get to a point where you know how to accomplish your goals. 

In building a cyber-resilient culture, the role of security is not to stop all incidents. It is to prevent a security incident from impacting the business. It’s why MTTR is such an important metric – it shows how quickly security teams can identify, respond, and resolve an issue. 

One example cited by the panel discussion is the development of a zero-trust environment. For example, when an infected flash drive was introduced to the environment, the AV alerted, the SIEM caught it, and the zero-trust policies immediately excised the machine from the network. The only person who even noticed an issue was the one user who suddenly couldn’t use their machine, meaning there was almost no impact on the business other than replacing or cleaning the infected machine. 

These types of scenarios, where the culture is built around preparing for as many incidents as possible, can only be effective when experimentation is a core part of how security professionals are trained. By hacking failure, investing in experimentation, and prioritizing resilience as a core part of the culture, it’s possible to reduce MTTR and be ready for whatever might come your way. 

Watch the full panel discussions from the 3rd Annual Cloud Native Security Summit (CNSS), including “#RealTalk on Resilience” and Hacking Failure to learn more about the role of resilience in a successful security team.