Status Go: Ep. 247 - Resilience in Crisis: Lessons from the CrowdStrike Outage

By InterVision August 15, 2024

Summary
Adam Scamihorn discusses the CrowdStrike outage, the largest IT outage in history, and its impact on organizations. He emphasizes the importance of post-mortem analysis and measuring the cost of downtime. He also highlights the need for organizations to build effective plans and processes for incident response and disaster recovery, including triage and asset ranking. Scamihorn suggests using tools and innovations that focus on people and process, such as recovery in place and modernization. He advises leaders to challenge their assumptions, conduct risk assessments, and prioritize testing. Finally, he recommends assessing response capabilities and finding ways to remediate any shortcomings.

Podcast: Play in new window | Download

Subscribe: Apple Podcasts | Spotify | Email | RSS

About Adam Scamihorn
Adam Scamihorn has led business continuity and disaster recovery practices for over 12 years. Throughout his career, he has seen and responded to nearly any disaster you can imagine across every major industry.

Episode Highlights

00:00 Introduction

00:41 The Impact of the CrowdStrike Outage and the Need for Effective Incident Response

02:06 Recovery in Place and Modernization for Effective Response and Recovery

06:47 Asset Ranking and Triage: Prioritizing Recovery Efforts

09:35 Tools and Innovations for Effective Response and Recovery

11:01 Challenging Assumptions and Conducting Risk Assessments

13:47 Testing and Assessing Response Capabilities

15:14 Conclusion

Episode Transcript

Welcome to Status Go, a weekly podcast about what it takes to thrive among the challenges of enterprise technology. My name is Paul Wooten and I’m the Sales Enablement Lead for InterVision.

Today we’re talking to Adam Scamihorn. Adam has led business continuity, disaster recovery, and resiliency practices for over 12 years. And trust me, Adam has seen and responded to any disaster topic you can imagine across every major industry. Adam, welcome and thank you for joining us today. Thank you, Paul. I’m excited to be here. Now, Adam, of course, we’ve all heard about the huge CrowdStrike outage that’s grounded flights, knocked out bank and hospital systems, so much more.

It is to date the largest IT outage in history. Have we ever seen anything remotely like this before? Obviously not to this scale. As you mentioned, this is definitely the largest. I would say, you know, we’ve seen other supply chain type outages. We saw Kaseya with a zero day outage that was pretty impactful, especially to a lot of managed service providers. We also saw SolarWinds with a malicious update supply chain outage that caused widespread problems, but nothing to the scale that we saw from the CrowdStrike issue.

So one of the things I’m seeing at this point is people are still talking about what caused this. And just for our listeners, if you’re not 100 % sure, was this a Windows issue? What this was the CrowdStrike problem wasn’t a malicious attack. It was actually a update from a endpoint protection provider called CrowdStrike and they provided an update that whether it was tested or not caused problems with every Windows system it was deployed on where they basically got stuck in reboot loops with continual operating system errors until that file was removed. Now it didn’t actually just impact Windows servers. It actually impacts specific distributions of Linux as well. So this wasn’t just a Windows issue. It was a software update issue.

That is just impactful at this point. Now, Adam, how are folks reacting now? What challenges are they facing? So I think a lot of people are in the process of doing their post -mortem pieces. They’ve seen whatever outage they individually experienced, how it impacted their business, how it impacted their reputation. And they’re just trying to get their hands around what their next steps are.

There’s also a lot of organizations who probably weren’t impacted who are trying to kind of measure what happened and see if something similar could happen to them and how they would react. So measuring what that cost of downtime costs them and then trying to figure out what they can do. We’re also seeing a lot of organizations kind of measure their ability to respond, right? So what they’re doing internally to protect their brand, what kind of costs there was.

I think estimated 25 % of all Fortune 500s were impacted and the cost is in the billions at this point in time. So still trying to get their heads around what that is to them individually. Yeah, that’s a huge impact across the entire industry right there. So Adam, what do you think? Is this something that their internal teams can handle? I think that a lot of internal teams probably have the ability to do this if they’re building the right plans and processes. But what we saw when the CrowdStrike issue happened was that there was a lot of outages that they just were overwhelmed with how many different places they had to respond to, right? So I think there are a lot of organizations and experts in the incident response field, disaster recovery field.. who have the ability to help build processes to help organizations scale better, to help them separate those people responsible for the day to day prevention, keeping the lights on type work, but also allowing them, essentially firefighters for when a disaster happens to be able to help expand that team out. Now, when something like this does happen, what would a triage process look like? Yeah, that’s a good question.

While the event was actually taking place, Gartner actually went and did a live broadcast specifically about what they should do while they were doing it. So I’ve taken kind of some of their notes, but also look back at all of the triage work we’ve done for the last decade at InterVision from incident response, disaster recovery, and cyber preparation. And just said, here’s three big things everyone should be doing now to prepare for the ability to triage.

So first and foremost, every asset in your organization needs to be ranked based on business impact. When you look at something like the NIST cybersecurity framework, identify is one of the very early steps in that framework. And it is crucial and it is often overlooked as not that important by a lot of people who just want to get through it fast. Being able to know here is everything I have from an IT asset and here’s the impact to the business that those assets have.

What’s the financial impact if a group of applications go down and I’m not able to get revenue from that. And so when you’re building your remediation plans, the second step is really to build those plans based off of all the different types of risks you’re identifying. You need to take that prioritization list of assets and really work that into the plan so that the things that will directly impact the organization the most, whether that’s liability risk, financial risk, brand risk need to be prioritized at the top of that recovery and incident response plan so that they are not experiencing the downtime that maybe the entire group of assets are. And as you bring those things up, you create recovery waves or incident response waves based on that prioritization. And finally, like I mentioned, having more than one plan while you’re identifying all of the risks that you want to specifically target and address… whether that’s cyber attacks, whether that’s something like this from a supply chain issue or, just natural disasters. There’s different plans that need to basically be in place for each one of those different types of things. So each business risk needs to have plans that are tied into those different assets. So triaging before you can triage, have to know what your assets are. Then you have to build a plan.

And then you need to identify the risks that you’re going to be able to respond to for that plan that are going to go into that triage. Yes. Multiple options, multiple tools, processes, innovations they can all take care of, or take advantage of. So what are some of the tools and innovations folks can take advantage of today or, you know, to mitigate or respond to this type of disaster situation? So when we talk about responding to these things, I want to focus on more than just tools.

Tools are really important. Technology helps us solve problems in technology all the time. But when it comes to incident response and disaster recovery, really the key components here are people and process. Great points. Got to have tools in place, but without people in process, you’re never going to be able to fully respond quickly. And so one of the things that I saw be real successful in this specific outage was the ability to recover in place.

A lot of backup and disaster recovery technology often has a on -site backup copy with the ability to recover in place very, very, very quickly. In fact, I saw one instance where a legal firm was able to recover in place where they basically used the recovery technology, stripped out the one file that was causing the problem in the Windows operating system and recovered without that file right over the top and in place of their production environment were able to get back up and running in less than an hour. Very, very little impact using a tool, but also a tested process that they had in place to be able to get back up and running quickly. The other component of recovery is that I often warn people this recovery in place piece is not a one bullet solves all. As I mentioned, there’s different types of disasters and incidents and in the event of a smoking crater type incident or a cyber attack where your data center becomes a crime scene, recovery in place is just not an option. And so the ability to have some kind of a disaster recovery solution, usually with IaaS or cloud and the ability to have expanding scalable resources to recover into is key. And you may need something that’s geographically separated and air gapped… from production to make sure that’s successful. Another thing I want to mention is just overall modernization. As organizations move to the cloud, implement more infrastructure as code, modernize their applications, traditional disaster recovery and backup is really not the best solution at play. And they need processes to basically roll that code… into the way they recover. And it’s really surprising to me as big as the outage was, how many organizations did not have the ability just to scale out new VDI with virtual desktops in place for their end users or new applications based off of the code that they typically would deploy in cloud. And so having disaster recovery processes that utilize that infrastructure as code or containerization of applications… to be able to rapidly basically roll out a brand new production environment the same way it’s used to stand up the environment in the first place or deploy new code in their deployment cycles is a process that can be built into a disaster recovery incident response model that works very, very well when implemented correctly. Adam, a lot of great points you’ve made right there. You talked a lot of different groups, options, specifically just focusing on people, leadership as an example here.

What else should leaders know about recovery and backup that they may currently be missing? Yeah, absolutely. I’ve been working on educating leaders for over a decade. And one of the biggest things I would tell them is make sure they’re challenging their assumptions. One of the things that every leader should be doing when they’re responsible for the uptime of the technology in an organization is making sure they’re doing some type of risk measurement, some type of an assessment, some type of a compliance framework.

We look at something like the NIST cybersecurity framework or other specific frameworks that exist to make sure organizations are measuring what their risk is, measuring how they can respond to that risk and recover from impacts caused by those risks. And so as they are doing those types of measurements and assessments, they need to be making sure that they are challenging or first identifying what their assumptions are, and then challenging those assumptions through some form of testing. And the process should look something like this. It should be a circle. They need to start with the measuring. They need to remediate the problems that are discovered during the measurement. And then they need to test the processes that they have, the people and the technology that they have to make sure that they can successfully respond to the types of risk identified by the business. And then the circle should continue right back up.

After the testing is done, there may be additional remediation that needs to happen to make that testing successful. And then it should be remeasured on some form of a cadence, whatever framework in a business is identified as this is the framework I need to measure how I am prepared to prevent and respond to the risks that I have identified. I’m remeasuring using that same framework. And then after the measurement, again, remediation.

And then testing again. And I cannot stress how important the testing phase is on everything against assumptions. I’ve seen businesses, incident response plans fail over something as simple as the hardware that existed for a telephone, where in the plan there was call someone and nobody had that number documented anywhere except for a system that was down. I’ve also seen where something as simple as a power generator and assumptions about the way that that power generator was set up have caused an organization to fail. So every assumption needs to be poked, tested, whether that’s first through tabletop exercises, but eventually physically testing to make sure that redundancies and recovery methods are possible. Lots of different options and plans that need to be in place. Exactly. Well Adam, I’ve got one last question for you.

How can folks assess their response in the aftermath of the CrowdStrike outage? Yeah, so if they were impacted by CrowdStrike, hopefully they are doing that post -mortem conversation and trying to do the same kind of measuring that I talked about. I think one of the big things we’ve seen is where help desks have been overloaded. And if the organization experienced a call load on their help desk that was not something that they could manage. They should look for partners that can help them scale that either through staff augmentation or managed professional services for events like this, rather than trying to hire teams to be big enough for when something like this happens. But they need to identify where they feel short and they need to find ways to remediate it and get into that test cycle of measuring, remediating and testing.

And then if an organization was not impacted, I think it’s pretty crucial to say, Hey, I wasn’t using CrowdStrike, but the risk still exists. While this happened with CrowdStrike, this type of problem could happen from any rogue application updates, specifically when it comes to things like endpoint protection that typically update very rapidly across large scales like CrowdStrike did. so if you weren’t impacted by CrowdStrike, you should still look at what the impact would have been in your own organization and play those things out. Role play the idea of, what would happen if every Windows machine in our system blew up and we had to reboot every single one in safe mode and remove one file? Could we do that? Right? Just ask the question and run through that scenario and add that to your things you’re testing in that risk profile. Lots of key planning, lots of key options that are out there.

Adam, I really want to thank you very much for taking time to meet with us today, go over these questions and give us more detailed information on that CrowdStrike outage. And everyone, I want to thank you for listening. To learn more, remember to subscribe wherever you get your podcasts and visit intervision.com to learn more. Thank you all very much.

Status Go: Ep. 247 – Resilience in Crisis: Lessons from the CrowdStrike Outage | Adam Scamihorn