From Impact To Improvement

Lessons Learned from the CrowdStrike Incident
Sarah Morrison
August 1, 2024

It has been just over a week since the worldwide CrowdStrike incident, and like any cyber incident it important to step back and assess what we have learned—not just from the impact on our own companies, but also from those that were more heavily affected. What could we have done to reduce the impact? Most importantly, what changes can we make to be better prepared for similar incidents in the future, reducing their impact on our businesses? We have all been bombarded with technical details of the incident, so in this article we will move beyond the specifics and focus on the broader lessons learned over the past week.

Organisational Resilience

The key lesson for most businesses has been the importance of organisational resilience—the ability to continuously deliver services, products, and intended business outcomes despite adverse cyber, IT, or other events. This incident provided an opportunity to activate incident response plans, business continuity plans, and disaster recovery plans, observing how they function in a real-life major incident. It also allowed for identifying gaps and areas for improvement. For those with compliance requirements that mandate annual testing of these plans, this incident served as a real-world test. So there’s one positive!

Business Continuity vs Disaster Recovery

While business continuity (BC) and disaster recovery (DR) are closely related when discussing organisational resilience, they serve distinct purposes.

Business continuity, as the name implies, refers to the plans and processes in place to ensure that essential business services can continue during and after a disruption. For instance, during the recent CrowdStrike incident, your BC plan would ensure that your organisation keeps running even while systems are down and during the recovery period.

Disaster recovery, on the other hand, is a subset of business continuity that specifically focuses on the restoration of IT systems and data following a disruption. It is a more reactive process, addressing the immediate aftermath of an incident to return systems to their normal state. In the case of the CrowdStrike incident, DR processes would involve restoring systems affected by the update and bringing them back to operational status.

These two concepts are heavily intertwined. While the business typically develops and owns the BC plan, the DR plan is more IT-focused and usually owned by the IT department. A common mistake many organisations make is failing to provide IT with the necessary input to develop a robust DR plan, a topic I will discuss later in the article.

Security Products Aren’t Typical Software Products

There has been a lot of misinformation circulating over the past week. Many people on social media have claimed that companies affected should have had better change control and testing processes, which would have prevented them being impacted by the CrowdStrike update. The problem is, this perspective misunderstands how products like CrowdStrike work.

In an ideal world, change control processes involve logging and approving changes via a Change Advisory Board (CAB), followed by scheduled testing, user acceptance testing, and documented rollback plans for quick restoration if needed. However, security products, especially those like CrowdStrike, require continual updates to identify and protect against the latest threats. These updates typically occur multiple times a day to provide rapid responses to changes in the global threat landscape.

If a new zero-day vulnerability emerges, swift detection and blocking by security controls are crucial. Running these frequent security updates through traditional change processes and testing just isn’t feasible and defeats the purpose of these rapid updates. It would increase exposure time to new threats before protections are implemented, not to mention maintaining resources to execute change processes for multiple updates daily isn’t practical. Security products must update rapidly to remain effective, making traditional change control methods unsuitable in this context.

Areas of Focus for Most Organisations

For many organisations and IT personnel, the CrowdStrike incident was the first time they faced an incident of such magnitude, putting their plans to the test. Based on our experience running incident response tabletop exercises and worst-case scenarios with clients, we have identified common gaps and weaknesses that would have been exacerbated during the recent incident.

Information Asset Register

While many organisations maintain a device-level asset register for laptops, desktops, servers, and mobile devices—primarily for finance and licensing purposes—an information asset register is more comprehensive. It documents enterprise applications and systems that store, transmit, or process data essential for normal business operations. These can include internal systems, cloud services like Microsoft 365 and AWS, HR systems, CRM platforms, and even physical assets like filing cabinets holding sensitive data.

An information asset register helps identify and document critical information about these assets, such as confidentiality, integrity, and availability (CIA) ratings, types of data held (e.g., PII), and recovery time objectives (RTO) and recovery point objectives (RPO). Accurate BC and DR plans depend on this information. Defined RTOs dictate recovery options, with shorter RTOs requiring faster (and often more expensive) recovery methods. RPOs determine backup processes and timing.

Business Impact Assessments

Business Impact Assessments (BIAs) are crucial for understanding the importance of assets and processes to a business. They help determine the impact on the business if a specific asset or process becomes unavailable. This information feeds into your business continuity and disaster recovery plans, defining your recovery time objectives (RTOs) and recovery point objectives (RPOs).

For mission-critical assets and processes, BIAs help outline how the business will continue to operate if these become unavailable. The focus is not on why or how they become unavailable but on ensuring the business can still run without them. In the recent CrowdStrike incident, having run comprehensive BIAs across the business would mean that, regardless of whether the disruption was caused by a CrowdStrike update, a threat actor, or another cause, your business could continue to function—perhaps at a diminished capacity, but still operational. This capability is key to achieving true organisational resilience.

Disaster Recovery Plans Aligned with the Business

One of the biggest issues we see is disaster recovery (DR) plans developed in isolation by the IT team. This isn’t the fault of IT; DR is typically seen as an IT responsibility because it involves restoring IT systems and data. However, without support from across the business to provide critical information such as recovery time objectives (RTOs) and recovery point objectives (RPOs), IT must make educated guesses when creating these plans.

Without knowing the criticality of various assets across the business, IT might not prioritise correctly. Shorter RTOs and RPOs generally involve higher costs, so IT will often balance recovery speed and cost. This balance may result in mission-critical systems not being recovered in the time needed by the business. This is why business impact assessments are so crucial.

BIAs provide the necessary context to ensure DR plans are aligned with business needs, ensuring critical systems are prioritised appropriately and recovery efforts are both timely and cost-effective. This collaboration between IT and the business is essential for developing robust and effective DR plans.

What Do I Restore First?

Continuing from the discussion on Business Impact Assessments, a critical question during incidents like the recent one, where many systems are affected, is what do I restore first? If you have performed your BIAs and used them to develop your disaster recovery plan, you should have a prioritised list of assets and systems to restore. However, sometimes BIAs and DR plans do not delve deeply enough.

In the recent incident, a large number of desktops and laptops were impacted, preventing personnel from performing their jobs. While most business continuity plans and DR plans include restoring endpoint systems, they often lack a detailed sub-order within the business. Identifying which business units or users are more critical for keeping the business running is essential.

Restoring all endpoints at once is usually not feasible, especially when IT support must manually attend to each system. Manual workarounds may be needed before an official vendor fix, requiring individual attention to each system. Knowing where to start and who to prioritise can be critical for maintaining business operations during such events.

This prioritisation, guided by thorough BIAs, ensures that the most critical parts of your business are restored first, making a significant difference in your organisation’s ability to manage and recover from large-scale disruptions.

Lessons Learned

Following any significant incident (even simulated incidents), it is critical to review and learn from the experience. These lessons should inform and improve your incident response, business continuity, and disaster recovery plans, ensuring future incidents are handled more efficiently and effectively.

Gather relevant stakeholders from across the business, including critical third parties such as outsourced IT providers, to review the incident. Discuss what worked well and, more importantly, what did not. Identify how your processes and plans can be improved and update them accordingly.

Ensure that these revised plans are tested to confirm they are fit for purpose. The last thing you want is to discover shortcomings during a real high-pressure incident. Regular testing and updating will help ensure that your plans are robust and effective when needed.

Criticality of Testing

Frequent testing of incident response, business continuity, and disaster recovery plans is critical. We conduct numerous simulations with our clients, yielding valuable lessons, improvements, and training for personnel on processes and procedures during an event.

Tabletop simulations, whether run internally or with third-party assistance, are particularly beneficial. These exercises, being simulations, remove the stress and time constraints of real incidents. They allow participants to step back, discuss the situation, involve the right people, and develop response strategies. This environment fosters critical thinking, problem-solving, and questioning, which are challenging to achieve during high-pressure crises when decisions must be made quickly.

Having pre-thought-out responses to various scenarios ensures that, if or when an incident occurs, you are better prepared. This preparation leads to a more informed and effective response, reducing the impact on your business.

Worst Case Scenarios

We work extensively with financial services that must comply with APRA CPS 234, which includes a highly beneficial control applicable to any organisation: worst case scenarios. These are situations that might not be covered in typical incident response, business continuity, and disaster recovery plans because of their very low probability. However, if they do occur, they can have massive or even catastrophic impacts, potentially threatening the very existence of the business. The recent global incident is a prime example of such a scenario for many organisations. Few anticipated it, the impact was enormous, and it’s unlikely we’ll see a similar event from the same vendor.

Running these scenarios with clients often uncovers overlooked gaps in BCP and DR plans, highlighting areas for improvement. While some gaps may be impractical to fix due to complexity or cost, identifying and documenting these issues is crucial. The business can then accept the risk and revisit these plans in response to changes in threat or risk profiles. If you already run incident response scenarios effectively and consistently, incorporating a yearly worst case scenario tabletop exercise is highly advisable.

Summary

The recent global incident highlighted the critical importance of organisational resilience—our ability to maintain operations during a disaster and recover to normal operations quickly. Achieving this requires significant preparation and a holistic approach across the business to identify and document what is essential for operations and survival. To be truly resilient, you need to:

  • Understand what you have: Maintain a comprehensive inventory of assets and systems.
  • Know what’s important to the business: Identify critical assets and processes, and run business impact assessments.
  • Define how long you can survive without these assets or processes: Establish recovery time objectives (RTOs) and recovery point objectives (RPOs).
  • Document processes to operate at diminished capacity: Ensure continuity plans address how to function without certain assets or processes.
  • Plan for recovery to a “normal” state: Develop strategies to restore operations within the timeframes defined by the business.
  • Continually reassess: Regularly update plans to reflect changes in business operations, threats, and risk profiles.

Covering all these areas ensures that future incidents will have minimal impact on your business.

For help in developing incident response documentation, or if you would like to test your incident response capabilities through a simulated event, contact us.

This article was originally written as a guest article for Women on Boards (WOB).

Sarah Morrison

Sarah Morrison

Sarah is the Co-CEO of Morrisec. With over 20 years in cybersecurity and a PhD in Russian information operations, Sarah has a deep understanding of threat actors and their tactics and motivations, making her highly equipped to assist organisations in their defence against them.

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *