All Systems Down

Aug. 16, 2011
Terry Evans On Aug. 18, 2009, the scenario every CIO dreads became a reality for Chuck Podesta when Fletcher Allen Health Care suffered an outage

Terry Evans
On Aug. 18, 2009, the scenario every CIO dreads became a reality for Chuck Podesta when Fletcher Allen Health Care suffered an outage that knocked the EMR offline for more than seven hours. And while it wasn't a hurricane or an earthquake that hit the Burlington, Vt.-based academic medical center, what happened was about as close to a perfect storm as an organization can get.

The 562-bed regional referral center was hit by one unpredictable event after another, all of which started when a tree fell on a power line on a sunny summer morning. “There was no car accident, no storm, nothing like that,” Podesta says. “That's kind of freaky.” Podesta says they are still doing some analysis, but that it looks like it sent a surge through the system.

And that, he says, was only the beginning for Fletcher Allen, which went live with the Epic (Verona, Wis.) EMR in June. For the next few hours the staff scrambled to figure out what went wrong.

The first point of failure occurred when the batteries in both strings of the uninterruptible power supply (UPS) failed, sending an electrical spike. “When that happened, our hardware on the storage side shut down, which it should do, because it's protecting itself,” says Podesta. “The second thing that happened is the failover software - which is not part of the Epic EMR - malfunctioned.”

Podesta says the system is designed so that servers from each site constantly “ping” each other to make sure they are awake. Automatic failover is built it so that if one server doesn't respond, all capability switches over to the other. However, in this case, the servers were still running, but the disk storage failed, preventing failover. But because this wasn't obvious right away, the staff needed to perform analysis to determine what happened, which further prolonged getting the EMR back up.

“If everything had worked the way it was supposed to, we should've had about a three to five minute downtime, because the failover would have gone to the second of our two data centers,” he says. But as the staff now knows, all plans go out the window with an unplanned downtime. “If you audited this and looked at what we did with failover and mirroring, you'd say, ‘Wow, this thing is never going to go down.' And that's what we thought. When you line up the scenario that happened to us, you realize that you can't replicate this in a million years.”

For Podesta, the ordeal demonstrated just how critical server availability is for an integrated health system, and how vital it is that the proper plans are in place.

As organizations become more reliant on electronic records, the need to protect data is increasing, says Terry Evans, CIO at Thibodaux Regional Medical Center, a 185-bed regional facility in southeast Louisiana. “Clinical information has become more real-time and crucial,” says Evans. “The access of information has made server reliability almost top on your list. You cannot afford to be down.”

Uptime strategies

Gary Weiner, manager of performance improvement and interim management at Dearborn, Mich.-based ACS Healthcare Solutions, says data protection should be a vital part of the CIO's overall strategy. According to Weiner, the focus should be threefold: maintaining high availability, having a disaster recovery plan, and identifying how long a system can be down before it negatively impacts the environment. “You need to determine what it will take, what it will cost, and what you need to do to ensure 100 percent reliability in case of failure,” he says.

Of the various methods used to protect data, the one he sees gaining serious traction in the healthcare industry is virtualization, a technology that can lower costs while improving availability, redundancy and recovery time. “In a virtualized environment, your total cost of ownership over a three to five year period can be reduced by 40 to 60 percent,” he says. “So it is a tremendous opportunity to not only save money but to enhance your environment.”

Another strategy is cluster servers, which can be configured so that if one fails, the applications it was hosting continue to run on one of the remaining servers. “It's becoming more and more common,” says Weiner.

At Thibodaux Regional, the IT staff utilizes multiple virtual servers to ensure clinicians always have access to data from the Meditech (Westwood, Mass.) EMR. Evans says if one of the servers is down or undergoing maintenance, the others pick up the slack. “It's more than just duality - this technology has given us ways where we can have absolutely no downtime.”

It has also created an environment that, while dynamic, does require upkeep when new IT applications are added to the hospital. “When new systems come on board, we add storage and we add servers in the virtual environment,” Evans says. “This way, everything has a backup strategy if it fails.”

Layers of protection

For Carilion Clinic, an eight-hospital, 1,125-bed organization, planning an aggressive Epic EMR rollout meant reassessing data storage capabilities. “You can't go to an electronic record unless you have a fail-safe IT backbone on which to run it,” says Daniel Barchi, senior vice president and CIO of the Roanoke, Va.-based system. “We thought that was important - so important, in fact, that in addition to our primary data center, we built a secondary data center as we were rolling out the EMR.”

According to Barchi, Carilion's data centers have dual network feeds, power systems and back-up systems, ensuring not only that each individual system can continue operations in the event of an emergency, but also that if one is completely down, operations can switch to the other. “This way, we can run our IT operations in parallel and we can failover from one to another, so we're never single-threaded through a single data center,” he says.

And while Carilion's data centers and EMR haven't experienced any downtime since the first sites went live in February of 2008, Barchi and his staff still feel it is critical to have a two-pronged approach to ensure that patient information is accessible even if servers are unavailable. If the hospital still has Internet access, clinicians can view data using read-only servers, and if all connectivity is lost, they can use business continuity computers on which all information is stored locally, providing a “continuously updated snapshot of what's happening with every patient,” says Barchi. “We have about four layers of protection. Your system can stay up, but if your network goes down, that's still a problem. That's why we've added those additional layers.”

Test and test again

Simply having a data protection strategy in place, however, isn't enough, according to Weiner, who says periodic testing should play a key role in the CIO's server reliability plan. “Testing is critical to make sure that whatever you've deployed actually works,” he says, recommending that solutions are tested at least annually, if not on a semiannual or quarterly basis.

Thibodaux Regional tests its disaster recovery system once a year; in-house servers, however, are pulled down every 90 days for maintenance, which Evans says helps sustain a dynamic environment.

Carilion's staff schedules downtimes during off-hours to install upgrades. Barchi says nurses and clinicians can either use the read-only system or the local backup system to continue documenting patient care.

But while testing is a critical piece of the strategy, Podesta urges CIOs to learn from his organization's incident and go one step further by trying to anticipate things.

“The problem is, when you test, it's in a controlled environment. If you really want to test your system, you need to think of various scenarios that might happen and try to mimic them in some way,” says Podesta.

And while he couldn't have foreseen that Fletcher Allen would lose its EMR for several hours just months after going live, Podesta says he was aware that downtimes are possible. “We incorporated the business continuity aspect and downtime procedures into the training process, so I think that helped out. But going forward, we'll probably do more testing than we originally planned.”

Though some leaders might balk at the costs involved in achieving such a high level of availability and redundancy, the investment is well worth reducing the risk associated with a potential downtime, says Weiner. “For every day that systems are down, it can decrease a hospital's cash flow and increase expenses,” he says.

For Evans, the costs related to duality and recovery all are part of the package. “It's part of the decision when you select a clinical system,” he says. “Because once you get into that environment, you can't go down - the hospital is out of business.”

Healthcare Informatics 2009 November;26(11):20-23

Sponsored Recommendations

Six Cloud Strategies to Combat Healthcare's Workforce Crisis

The healthcare workforce shortage is a complex challenge, but cloud communications offer powerful solutions to address it. These technologies go beyond filling gaps—they are transformin...

Transforming Healthcare with AI Powered Solutions

AI-powered solutions are revolutionizing healthcare by enhancing diagnostics, patient monitoring, and operational efficiency - learn how to integrate these innovations into your...

Enhancing Healthcare Through Strategic IT and AI Innovations

Learn how strategic IT and AI innovations are transforming healthcare - join Tomas Gregorio as he explores practical applications that enhance clinical decision-making, optimize...

The Intersection of Healthcare Compliance and Security in the Age of Deepfakes

As healthcare regulations struggle to keep up with rapid advancements in AI-driven threats like deepfakes, the security gaps have never been more concerning.