A ransomware post-mortem

Clyde Hewitt
Vice President of Security Strategy,
CynergisTek

Introduction

In dissecting three separate events that occurred in the past few years, we learned that ransomware attacks can happen at any time and to any size organization. Healthcare organizations were not immune, as the malware used to deliver the ransomware was able to wreak havoc on each health system studied. While each attack was different, there are several key lessons that all healthcare executives should heed. Perhaps the biggest lesson is that ransomware attacks start with a whisper and culminate with a roar. Unless your organization is highly tuned to listen for the whisper and take immediate action, you should expect to lose critical systems. Malware attacks generally take less than an hour to infect all vulnerable systems, yet the operational damage can last for weeks while the recovery takes place.

This article discusses the impacts of three attempted ransomware attacks on three different healthcare organizations. Lessons learned from these events were shared through a series of interviews with key stakeholders at each institution. Their willingness to share their experiences and this information is invaluable to the rest of the healthcare community. The level of preparedness and security architectures at each institution varied widely, but the infection and impacts experienced were remarkably similar. For confidentiality purposes and clarity, those organizations will be identified as Provider A, B, and C respectively.

Discussion

Hmt201803 04 Ransome Cynergis Art 235x235

Listening for the whisper

A ransomware attack generally starts by infecting a single vulnerable device. This could be a workstation, server, biomedical device, printer, or anything else connected to the local area network. Once compromised, that infected device will scan for other vulnerable devices it can see on the network, then propagate the malware. The details of how the successful ransomware infection spawns exponentially across a network, like dominoes, may vary.

One common thread is that it can bypass existing controls by using both zero-day and other known, but unpatched vulnerabilities. Once infected, the malware generates signatures that can alert both local anti-virus and network intrusion detection systems if they are properly configured. For zero-day attacks where the signatures are unknown, these systems likely won’t know how to react, but can still alert on an anomaly.

For one impacted health system (Provider A), the security incident and event management (SIEM) and anti-virus consoles captured signatures from the initial infection, sending text alerts to the cybersecurity staff. The staff quickly responded and directed that all systems be disconnected from the network, starting with the Electronic Health Record (EHR).

In a separate incident, another small approximately 200-bed hospital (Provider B) did not have a functional SIEM. The end-users were the first to notice system errors, and reported them to the help desk. Amazingly in both these instances, the response time between initial alert and complete network shutdown was less than an hour.

Previously, an integrated health network (Provider C) relied upon its EHR vendor’s hosted (cloud) solution to host its virtual servers and data. The cloud provider had robust alerting systems and first noticed the signature of the ransomware attack. The encryption “package” was blocked before fully encrypting the shared patient financial drives with scanned images. Although the EHR was protected and only one shared file server was attacked remotely, Provider C was severely impacted because the local vulnerable workstations and devices were infected. That single shared file server was restored by the cloud vendor in less than one hour. However, it took weeks before Provider C could disinfect its on-premises workstations and reconnect to its cloud EHR vendor.

In retrospect, Provider A’s SIEM solution was useful in alerting the IT staff. However, the SIEM ultimately was ineffective in stopping the spread of ransomware because the response time still required a human in the loop. Nearly all vulnerable machines were already infected before the IT staff could isolate the vulnerable devices. The experience of Provider C demonstrates that when that process is automated by the cloud provider with current patching, real-time monitoring, and IT staff dedicated to managing servers, the cloud-hosted data and servers were better protected. Even so, the ransomware attack still caused a significant outage because without the full EHR ecosystem of servers, data, and workstations, the clinicians were unable to access patient data.

Observations:

The presence of, or lack of a functioning SIEM did not ultimately impact the reaction time since in both Providers A and B, the emergency response procedures were not well-documented or exercised. The IT staffs intuitively focused on isolating the EHRs first, thus saving the critical patient data from encryption.
Efforts to save the EHR did not save the workstations and other devices from infection, and thousands of vulnerable workstations and devices were infected in the first hour, which ultimately affected access to the EHR.
It can be beneficial to use a mature cloud EHR vendor that shares security resources dedicated to protecting the cloud environment—but it does not protect the local health system, as the local IT shop was responsible for the hospital’s systems.

Initial response following an attack

In evaluating the different attacks, there were common threads in the response by the IT teams at each health provider. In each attack, once a malware event started, the initial reaction of the respective teams was to isolate all end-user systems from the Internet first, then from each other, using internal virtual LANs or even terminating site-to-site connections. The services critical for the local hospitals’ recovery, including any ‘management network’ which would allow for recovery, were preserved using virtual LANs. The cloud provider supporting Provider C blocked all traffic using a firewall rule to prevent the spread of infection. This link was not opened until Provider C could validate that all systems were remediated—almost two weeks later.

Once the clinical systems were disconnected from the network, Provider A’s CIO alerted the clinical executives to initiate clinical downtime procedures. This was the catalyst for alerting the rest of the C-Suite. As other executives were brought into the discussion, the impact to non-clinical operations came into focus.

The IT staff for Provider B delayed alerting the clinical staff while determining the cause. This delayed bringing the C-Suite into the discussion, and slowed the start of downtime procedures by a few hours. Ultimately it did not change the duration of the outage and they had to use downtime procedures for about two weeks.

Observations:

IT and the security team started assessments on the extent of damage after all systems were isolated.
Nearly all of Provider A and B’s workstations were quarantined by locking down routers and circuits to prevent further spread—and access to EHR servers was prohibited in order to prevent the ransomware from spreading.
In all cases, the initial estimates of a quick recovery were dampened with the realization that the recovery was going to take a long time.

The indiscriminate attack

The primary lesson learned by these organizations is that a ransomware attack does not discriminate. Anything connected to the network is potentially vulnerable: Workstations, servers, network, biomedical equipment, printers, and other IoT devices. A secondary lesson learned is that everything is impacted, be it vulnerable or not. Due to the technical and managerial challenges of rapidly determining which devices were vulnerable and infected, vulnerable but not yet infected, and not vulnerable; each organization wisely took everything offline to help contain the outbreak until an inspection process could be established.

Provider C’s cloud vendor was the exception because it had a robust SIEM, integrated with an intrusion detection and prevention system, and a security operations center monitoring alerts in real time. In this attack, only the one infected virtual server was required to be shut down. It was returned to service in less than an hour.

In each incident, the hospitals’ clinical IT systems were affected the most. But other non-IT clinical applications such as biomedical, laboratory, radiology, and pharmacy were also impaired directly, or indirectly, as a result of the steps taken to isolate the network.

Non-clinical computer systems—such as timekeeping/payroll, HR, supply chain management, and finance—were also inaccessible because of the malware, or the methods used to control the spread. Finally, Provider A’s CIO reported that other connected organizations (members of an Organized Health Care Arrangement (OHCA) and third-party vendors were also alerted about the attack, and took the steps to protect their systems.

Observations:

In each instance, once the attacks were recognized, all nonessential systems on the network had to be quarantined in order to prevent the attack from spreading, regardless of criticality.
Other departments outside of the traditional IT staff were alerted and isolated.
Third parties, once notified, also isolated their systems out of caution.

Operational concerns

In reviewing the management issues, the CIOs from Provider A and B both suggested their departments’ initial reaction was that the respective incidents could be handled internally, with remote support from the respective network and antivirus vendors. Provider A’s CISO came to realize that each workstation must be physically touched in order to perform a complete reimage, so the time estimates quickly overwhelmed the organization’s resources. Consequently, Provider A’s CIO reached out to other regional hospitals and vendors for on-site assistance—which helped address some of the staffing surge requirements.

The Joint Commission accreditation and certification standards require a hospital to have policies for initiating and implementing downtime procedures. The larger organizations had deployed computers that were isolated from the normal network and had access to a ‘read-only’ copy of the EHR.

Provider B’s architecture connected its downtime computers on the LAN. Consequently, it lost all access to the “Continuity of Care” documents as its downtime workstations (designed for read-only access) were also infected with the ransomware. The organization addressed this issue by administratively discharging and then readmitting all patients, effectively starting over with a new paper medical record. This increased the cost of care as tests needed to be re-administered because the prior results were not available.

Provider A’s network’s EHR was configured to create snap backups of the medical record every few minutes, but these read-only copies lost much of their value after a few days.

Since Provider A’s EHR and laboratory systems were disconnected out of precaution, its hospitals reported the need to resort to paper for laboratory orders and results. This adversely impacted the response time as ‘runners’ were needed to replace the lab interfaces and deliver results back to the nursing units. This also increased staffing requirements for weeks until the recovery process brought systems back online.

The focus on developing downtime procedures for clinical systems diverts attention from the need to also have downtime and alternate procedures for the non-clinical processes that keep a hospital running, e.g., registration, finance, claims processing, supply chain management, HR, and that important function—employee payroll.

Finally, Provider A reported that its partners and vendors were reluctant to reconnect after the ‘all clear’ was given. At least one vendor of laboratory equipment insisted on replacing the entire system, stating it did not trust Provider A’s ability to perform a field-reimage. The time to remove, install, and recertify the system meant that certain tests were not available to be performed in the interim.

Observations:

Executives are conditioned to focus on having contingency plans needed for patient care as this is both a HIPAA and Joint Commission requirement. While downtime procedures are mandatory in the clinical space (a Joint Commission requirement as well as a HIPAA requirement to have business continuity plans), the nonclinical business downtime processes are less mature, or sometimes nonexistent. The providers canceled some of the elective surgery, impacting the financial bottom line and the trust of the patient community.
Even with downtime procedures, Provider A and B’s CIOs confirmed they canceled some elective surgery to free up resources and reduce risk. This impacted financials and public trust.
As the recovery stretched beyond the first week, Provider A’s HR departments expressed concerns, first for paying the employees without timekeeping or functioning HR systems. They worked with their banks to rerun the previous payroll, but this process had high error rates of 20% or more, especially with hourly staff. In addition to trying to continue to deliver care using downtime procedures, clinical managers also had to manually track time and work with their HR departments to cut paper payroll checks to deal with personal hardships.
Both Provider A and B’s materials management departments had to revert to manual supply-ordering procedures, including the need for more manual inventories, delayed processing, and the resulting shortages that appeared because of manual processing errors.

Technical recovery

Provider A reported that nearly all the client workstations in the primary hospital were impacted. The organization was able to limit the malware infection rates in the satellite hospitals because it quickly blocked network traffic between hospitals. Technical recovery of the workstations was performed by using USB drives to completely reimage the systems. The central IT department needed an adequate supply of large capacity USB thumb drives on-hand as part of the disaster recovery kit, just for this purpose. This allowed the external assistance provided by the other regional hospitals to quickly address all of the workstations. Even so, recovery took almost two weeks before all the hospitals were back online. The ambulatory and other connected sites took an extra week.

Provider B reported that all vulnerable systems were compromised. Its workstation recovery was accomplished using portable USB Hard Drives, but it leveraged its close proximity to several office supply stores to quickly purchase what was needed.

Observations:

In these two incidents, all data was stored on server-hosted shared drives. This significantly helped speed up recovery.
The need for an adequate supply or supplier of USB drives was paramount for the recovery process.

Documentation requirements

Generally speaking, the Office for Civil Rights guidance on “Ransomware and HIPAA” says that when electronic protected health information (e-PHI) is encrypted as the result of a ransomware attack, a breach has occurred because the e-PHI encrypted by the ransomware was acquired (i.e. unauthorized individuals have taken possession or control of the information) but the investigation which takes into consideration the type of malware used and the known or assumed actions of the attacker, must still be accomplished to determine if “disclosure” or notification under the HIPAA Breach Notification Rule is required.

This is a fact-based determination. Provider A’s anti-virus vendor determined that the malware infection was supposed to encrypt the data, but due to a programming error by the hacker, was unable to execute the ransomware code. As a result, Provider A concluded that no data was accessed, acquired, or otherwise compromised by the malware.

The aftermath

In the end, every organization recovered, but the negative fiscal impact was only part of the analysis. Provider B’s CIO reported that the recovery cost approximately 60% of the annual IT budget. The organization hopes to recoup some of that deficit from its cyber-insurance carrier.

Provider A also experienced a significant financial impact primarily based on delayed and lost charge capture. Charge capture was delayed, which also impacted claim submission data reentry. Once the systems were back online, recovery took months and ultimately impacted cash flow and cash reserves. Provider A’s CIO also reported the organization’s deficit exceeded $50M, but ultimately it was able to recover part of that amount. After two months, Provider A was still about $30M behind budgeted reimbursements.

Once the technical recovery has been completed, the manual process of reentering charge capture, syncing laboratory and other orders, and documenting the outcome takes months and requires uncalculated unbudgeted staff hours—which ultimately drive up costs.

The human toll

The biggest impact to the organizations was the human toll on the staff, especially on the executives and clinical staff who clearly understood the risk of patient harm. The challenges of providing care in a highly degraded environment—while trying to identify the compromised devices and recover the systems—required total dedication as well as 100+ hour workweeks. This led to sleep deprivation after a few days, as even downtime was punctuated with stress and panic.

Provider A and B’s CIOs reported that they both took the malware attacks very personally. One CIO shared that he was thankful there was no blame from the clinical staff, as publicity from previous ransomware attacks suggests that everyone is vulnerable. Both CIOs reported initial feelings of anger, but they ultimately shared that they felt personally violated. These feelings often continue months after the attack has been remediated and all systems are fully operational.

Conclusion

There are several lessons to be learned from these three different attacks.

First, understand that a successful attack will expose weaknesses that could be exploited again, if not corrected. A root-cause analysis will identify the exploited weaknesses so they can be remediated, but it’s also critical to perform a complete risk assessment in order to identify the remaining control gaps. For example, two of the impacted organizations identified that their “index devices” were both assigned to system administrators, who had local Remote Desktop Protocol (RDP) enabled.
Second, learn that preparedness is critical. One of the CIOs commented that he would have preferred to spend resources preparing rather than recovering.
Third, business continuity management has historically shorted nonclinical processes, and this lack of preparation had the potential to shut down operations.
Fourth, senior executives have relied on their IT staffs to own the incident management/response process; however, all executives have a critical role and would benefit greatly by participating in incident response exercises.
Fifth, leveraging cloud providers, especially email, allows for rapid communications; however, too much reliance on a cloud provider can delay the recovery if the organization is just one of many impacted when the cloud provider is the target of ransomware.
Sixth, assign an event recorder, as memories fade quickly during a crisis.
A final lesson is there are advantages to keeping some older technology, such as the paper fax machines, in the inventory as these may be the next viable communication path.