Three disaster recovery mistakes and how to avoid them

By Matt Ferrari, CTO, ClearDATA

While most people associate disaster recovery (DR) with earthquakes, tornadoes, and floods, such catastrophic events are rare. The reality is that 71 percent of disasters have more mundane causes: pulling the wrong cable, patching the wrong core switch, systematic or social engineering intrusions, and general system failures that lead to data loss.

When these events occur, the data center can be down for several days or even weeks. That is a sobering thought, especially when you consider that 24 percent of organizations that suffer an outage of 24 hours or more close within two years, and 68 percent of organizations down for a week or more close within one year.

In an industry where so much is riding on data, you would expect most healthcare organizations to have some form of DR plan in place. However, you would be wrong.

In fact, in a survey of healthcare IT executives, nearly seven out of 10 say they do not have a DR plan in place. Many cite stagnant budgets and a lack of manpower. A vast majority simply believe implementing DR is too difficult.

Indeed, disaster recovery planning remains a daunting task for many healthcare organizations. If they have a plan in place at all, it tends to be very procedure-focused, overlooking some crucial big-picture items. Read on to learn more about three of the most common pitfalls in disaster recovery planning – and how avoiding them can actually make it much easier to put an effective plan in place for your own healthcare entity.

1. Failure to test

A famous “Far Side” cartoon depicts a husband and wife hunkered down in their bomb shelter, surrounded by cans of food, with scenes of nuclear devastation taking place in the world above them. In the caption, the wife berates her husband for forgetting to stock a can opener. Funny – and a rather apt analogy for how many healthcare providers do their own disaster planning. Few will be laughing, however, when they discover they’ve overlooked a fundamental component of preparing for disaster: conducting a test run first to see if their plans actually work.

In one memorable example, a large provider in the Northeast was hit by a hurricane that took its systems down for days, despite a well-established plan in place to failover the systems to a location in the Midwest. When the time came to perform the actual failover, the provider’s data went down in both locations. A shock? Yes. Should it have been? No. Not if the provider had regularly tested the plan to make sure it worked instead of adopting a “set it and forget it” approach. The problem was that much had changed since the provider first “set it.” Software had undergone various patches, patient EHR records had changed, and so on. By the time an actual disaster came along, the failover site was not equipped to replicate the original site.

2. Overscoping

Failure to test is one common error. On the opposite end of the spectrum is overscoping – that is, designing a disaster recovery plan around the entire enterprise’s infrastructure instead of prioritizing which clinical and operational functions are most urgent to get back online first. Think about it: What good will it do to have your archived images available before your active records? Or the data for an employee wellness plan before your heart failure or sepsis risk analytics dashboard? You get the idea – a historic medical image from 15 years ago might be nice to have, but it’s certainly not a must like the data streaming from equipment in the ICU. As such, planning to recover all systems in the same timeline is asking for delays that could threaten patient lives.

A very effective way to avoid this kind of planning is to think of your applications and data in terms of recovery time objectives (RTO) and recovery point objectives (RPO). The organization’s ability to absorb pain in each of these areas generally determines the parameters for each. RTO is the time in which a business process must be restored after a disaster is declared in order to avoid unacceptable consequences. For an EHR, when patient care and safety is on the line, that may be 15 minutes or less. For the billing system, taking 24 hours before functionality is resumed through DR is often tolerable. The faster a system can be recovered, the more it costs – so it is important to be brutally realistic when prioritizing applications.

RPO refers to data; specifically, how long could a provider adequately care for a patient using the last version of data on hand before the system that updates the data crashed? In some instances, older versions of data could suffice for a lengthy period of time, while in other cases, the RPO would need to be much shorter. For example, if the RPO is set at 24 hours, the healthcare organization could lose a day’s worth of data entry in the event of a disaster. That is typical of standard daily backups. The more critical the data, the shorter the RPO needs to be set.

3. Having a back-up vs. a recovery mindset

A third and related error is planning with only a back-up plan in mind, as opposed to recovery. Here, it’s important to have plans in place to guarantee availability of applications in more than one location. This is among the strongest arguments for moving to the cloud if a provider hasn’t yet done so. It’s cheaper – and wiser – than co-locating disaster recovery infrastructure internally. A provider can even purchase disaster recovery as a service, paying only for capacity as needed. One of the biggest benefits of such a service is experts are already at the disaster recovery site, with no need for staff to travel there to spin up the failover system.

Disaster recovery done right

A final example is one of disaster recovery planning done right. A regional provider, with about 400 or 500 beds, would schedule a disaster recovery test every few months. The test’s overall objective was to move the running IT environment from the Pacific Northwest into Texas. A major RTO was that there could be no longer than a five-minute interruption (or RPO) to electronic health records. Because of the provider’s commitment to do this every three months, the entire procedure behind failing over from one site to another eventually took place in under 45 minutes, with less than five minutes of any actual downtime.

Note this was an incremental plan – the provider was only failing over what it could afford to and still see patients. But this, combined with regular testing, enabled the provider to be truly prepared. Ultimately, the provider created a culture of preparedness, which can protect your own organization in the wake of a disaster – and even prevent some disasters from happening at all.

Final thoughts on a disaster plan

Effective DR consists of two elements. One is a core disaster recovery plan that duplicates and stores the raw data in a secondary system – either on premise or offsite in a physical or cloud data center. Unlike back-up data, which has to go through a lengthy and complex restore process to be useful, this second data set is ready to be consumed as soon as the servers at the secondary site are spun up. This form of DR is ideal for Tier 3 or Tier 4 applications, such as Exchange or Sharepoint.

“With replication, the systems to which the organization’s data is duplicated are already operating in standby mode. Should a disaster be declared, the system is failed over and data quickly becomes available to applications once again.”

The second form is replication (also known as business continuity). With replication, the systems to which the organization’s data is duplicated are already operating in standby mode. Should a disaster be declared, the system is failed over and data quickly becomes available to applications once again. This is the preferred method for Tier 1 applications, such as EHRs, that have more aggressive uptime requirements.

Real-world examples illustrate why having a solid DR plan is so critical

The first involves the flooding in New York City in 2012 during Hurricane Sandy. An entire data center for the financial services industry went dark when the back-up generators in the building’s basement were submerged in water. The data center provider had to physically pull its servers out of the racks and drive them to a data center it didn’t own in another city to restore functionality for its most important customers. Those customers went without their data for 4.5 days, while others were down for a week and a half before main power could be restored. Ask yourself – could your hospital or health system survive without access to data for more than four days?

In contrast, a hospital in Utah built its DR plan to include replication of data to the cloud and quarterly tests to ensure performance. When a huge snowstorm threatened to shut down the power to their local data center, the hospital proactively failed over production to the cloud environment with an RTO and RPO of less than an hour – and with no loss of performance or security. In fact, it was so successful that the hospital is now considering permanently moving all of its production to the cloud.