Recently InterVision ran a full disaster recovery simulation on our own services and infrastructure in order to measure our ability to support continuous operations during an unplanned situation.
InterVision continuously tests its own resiliency in various formats on an ongoing basis, but our latest DR simulation was the most important test of the year. In this particular test simulation, in order to make comprehensive assessments and measure success, InterVision simulated (emphasis on simulated) a prolonged disruption, not a total loss, of our west zone datacenter.
We set up an independent simulation team to architect the event and measure efficiency, while documenting everything throughout the day. The remainder of the InterVision company (yes, the whole company) had no idea what would happen next and had to deal with unforeseen circumstances, just as you would in a real event. The simulation architects ensured the severity of the simulated event would run long enough that if it were a real event, our clients who currently use our recovery service would likely need to be migrated to another location. In other words, we had to simulate a full recovery failover.
Just as in real disasters, often sub-events compound the severity. Think Murphy’s Law. To mimic the stress of a real-world situation, the simulation team created an additional, simulated problem that would occur during the failover process. In this case it was a firmware event on one of the storage subsystems. This secondary disaster was key because it forced the InterVision team to consider the issue, with the understanding that by acting on the simulated second issue, there was a possibility that the simulated problem would “spread” in the disaster exercise throughout the west location, and possibly to all other locations.
After nearly an entire day of running the DR simulation and successfully responding to various situations, InterVision came together as a company to reflect and outline seven key points any IT Directors, CTOs, and IT teams can takeaway from our successful experience:
1. You have to test. Measuring success seems obvious, but it can be subjective and confusing. Questions you should consider during the test are:
- Did the applications, data and services return in the proscribed amount of time?
- How well did you execute your plan, and how well did you respond to the unplanned?
- Did everyone perform their role, were they set up for success, and what happened when “X” was supposed to be here and he, she or it, was somewhere else, or worse, missing altogether?
- Additionally, InterVision identified a team of “observers” to supplement the simulation architecture team to capture notes that may have been missed by those participating in solving the simulated event.
2. Prioritization. Prioritize your systems, people and time before a disaster (or simulation) happens. If you haven’t prioritized your apps and your data, you should. InterVision’s DR experts can help with this if you aren’t sure what framework or process is best. Identify individuals’ roles and responsibilities early on. This is key as you want to divide and conquer, and because you have to have bench strength, whether internal or external. A good rule of thumb is to be intentional about the amount of time spent between planning, control and managing the chaos or dynamic nature of the incident.
3. Inclusion. A successful simulation should include your vendors, and your clients, as appropriate. Odds are, if a real event occurs, your partners should know you might reach out, and ideally offer a response, or show a willingness to participate. One item we considered for the simulation but didn’t end up using was a licensing key issue. In that case we would have needed to engage a vendor, possibly at 3 a.m., to generate a new key for one of our applications. That is tough to do under normal, much less extreme circumstances.
4. Time consideration. Root cause takes time. Often the first thing you see isn’t the real issue. Plan for sufficient time to do deep enough analysis to determine the real issue. For example, a system may become corrupt and appear as if restoring a backup will resolve the issue. If the problem was due to a hacker or damaged hardware, the backups may be insecure or the corruption will occur again due to the hardware damage.
5. Command and control systems are key, since many times the systems you use to manage an environment or situation are running on the same platforms impacted by the event. Having backup or alternate command and control will save a lot of recovery time.
6. Communication. Striking the right balance of communication is difficult. Don’t just assume that your communication tools all work, or that people have access to them. Furthermore, remote teams, or the spontaneous nature and type of events, can mean key members may not be in the same location or able to talk to each other easily. Centralized whiteboards and online collaboration tools can make a crucial difference when communication is necessary. And unfortunately, these resources and tools need to be worked out and secured ahead of time, so you don’t run into communication roadblocks.
7. Contingency plans. Establish and start your contingency plans in parallel. Many times the first approach doesn’t work, and sometimes it can take hours to identify that the first plan didn’t work. Have a backup plan, and even a backup to the backup plan in order to be fully prepared when a disaster occurs.
Not confident in your current DR plan or your current DR provider? Try a more modern, cost effective approach with InterVision Disaster Recovery-as-a-Service. We provide (and encourage) two free disaster recovery tests per year and co-develop a thorough, customized DR run book with your team so together we have a documented plan of attack in case the unforeseeable happens to your business.
Contact us today and be up, running and tested with disaster recovery in weeks — with manpower to spare.
Does your DR Provider Take Its Own Advice?
At InterVision we preach about the value and necessity of testing DR plans and systems to our clients. Unlike most other providers, however, we hold ourselves not only to the same standard we recommend to others, but to the highest standard of execution and success.
Do you assume all of the disaster recovery (DR) providers are protecting themselves the same way they recommend to their customers? In our case, the answer is yes, but your provider might not have the same answer so you should ask.