Under Friendly Fire – CrowdStrike incident

A CrowdStrike security update caused a widespread crash of Windows machines globally, leading to severe disruptions across various sectors. The affected organizations included hospitals, airlines, and 911 emergency services, among others. Ironically, CrowdStrike software is primarily installed to protect against hackers who might crash systems or worse, steal data. This incident stands out as one of the worst cases of “friendly fire” in the software industry.

Let’s examine the technical details of the incident, its cause, and how to prevent similar occurrences in the future.

First, it’s important to understand CrowdStrike’s release types:

Software releases
Security Policy releases (for simplicity, this can be thought of as configurations)

By understanding these release types, we can better analyze the CrowdStrike incident and its implications.

Software release vs Policy release

CrowdStrike Software releases, which add new base code, are infrequent. But policy releases happen more often, like a few times every month. The CrowdStrike software release includes some kernel code, likely a Windows filesystem filter driver, which loads at boot time. This is not uncommon, as most antivirus software have a filesystem filter driver to monitor file operations (I have prior experience in the Windows and Unix kernel FSD domain). The CrowdStrike kernel driver also has a module called the Content Interpreter, which loads the security policy at boot time.

CrowdStrike releases new policies whenever they identify new attacks or discover vulnerabilities. These policy updates instruct the software on how to block the threats. No customer intervention is required for these policy releases.

February 2024 Software Release

CrowdStrike introduced a new software version in February 2024, which added named-pipe attack detection. Named pipes are commonly used for inter-process communication.

Along with this software update, CrowdStrike released new policies for pipes after conducting stress tests. The software release likely underwent Microsoft’s Windows Hardware Quality Labs (WHQL) test suite to maintain certification.

CrowdStrike also employs a policy validator tool to verify policies before release. This tool was also probably updated to accommodate the new pipe policy.

In March, CrowdStrike issued two more policy-only releases. These were stress-tested and run through the policy validator. Following these releases, CrowdStrike’s confidence in their software’s ability to handle named pipes increased, leading them to discontinue the stress testing phase. From this point forward, they relied more heavily on the policy validator tool for testing new releases.

July 19 Policy release that broke Windows

On July 19, CrowdStrike detected a new named pipe attack targeting Windows systems. In response, they quickly developed and released a Windows-specific policy update. However, this policy contained critical errors that went undetected due to bugs in the recently updated policy validator tool. Importantly, no stress testing was performed on this release.

The Cascade of Errors

When deployed on customers’ Windows machines, the faulty policy triggered a catastrophic chain of events:

1. The Content Interpreter code in the kernel driver attempted to read the incorrect policy hitting an invalid memory address and causing a system panic.

2. Upon system reboot, the issue persisted because:

o The kernel driver loads at boot-time.

o The Content Interpreter attempts to read the policy again.

o This leads to another system crash.

3. As a result, affected systems became trapped in a continuous crash loop, rendering them inoperable.

The Aftermath and Recovery Process

About 90 minutes after the release, CrowdStrike recognized the widespread system crashes and immediately halted the policy release, initiating a rollback. However, the damage was already done. The recovery proved to be a complex and manual process.

The Challenges of Recovering a Windows machine

Network services on affected machines were not available, preventing remote recovery. This is because during a bootup, the system login and internet services are brought up at a much later stage. The system crashed before it reached that stage (link to Windows bootup guide here).
Many servers likely lacked attached keyboards or IPMI (Intelligent Platform Management Interface) access.

The Manual Recovery Process

Due to the aforementioned challenges, IT administrators had to physically access each affected machine and perform the following steps:

Connect a keyboard (if not already present).
Boot the machine in maintenance/recovery mode.
Log in to the system.
Manually delete the corrupted policy file from the filesystem.
If disk encryption was enabled (a common security practice), administrators also had to manually enter the disk encryption passkey for each machine.

This laborious process led to unusual scenes, such as IT staff using ladders to reach overhead displays in airports. Similar efforts were required in data centers and other affected locations.

Datacenter Recovery Process

The Windows crashes likely caused a system-wide outage by bringing down multiple services in the datacenter.

Recovering the entire network requires a careful and sequential process of bringing up services. This process can be complex and time-consuming, with the level of complexity depending on the datacenter’s structure.

Typical Sequence of Service Restoration (oversimplified for illustration)

Windows servers and Active Directory
Network storage
Backup systems
Databases
Application servers

Each datacenter typically has a predefined hardware bring-up sequence for catastrophic failures. The complexity of this sequence can range from simple, to highly elaborate in scenarios with more interdependencies of systems.

Full datacenter recovery can take anywhere from a few hours to several days, even with a 24/7 operations team.

Redundancy Limitations

While services usually have redundancy and high availability at each layer to prevent total failure, this incident exposed a critical vulnerability.

Almost all systems, including backup and failover systems, would have the same base software. That is, they likely ran on Windows and used the same CrowdStrike software. Consequently, these redundant systems would have crashed simultaneously, leading to a total outage.

Preventing Future Incidents

The CrowdStrike incident serves as a stark reminder of the vulnerabilities in our digital infrastructure. It’s crucial that we learn from this event and implement robust strategies to prevent similar occurrences in the future.

In my next post, I’ll outline key strategies that can be adapted to safeguard against such widespread system failures.

Looking Ahead

At deepwhite.ai, we’re building AI models to predict and address similar infrastructure challenges. The goal is to detect, prevent, and mitigate such catastrophic software failures before they impact critical systems.