What can Crowdstrike's Big Mistake Tell us about GenAI?
What's the old adage? Hope for the best, expect the worst?
You might have noticed that things went sideways on Friday, July 19th. We woke up to half the world staring at what are affectionately called BSODs - or Blue Screens of Death. Cybersecurity gigantor Crowdstrike had pushed a "content" update that caused Windows PCs and servers worldwide to throw up the Blue Screen of Death. The BSODs disrupted entire countries, local governments, the whole airline industry, and countless others. As of this writing, over a week later, we still haven't fully recovered from the Crowdstrike Event(TM).
This incident shows us the risk of protecting our businesses and governments from cyber attacks. But more than that, it shows us how minute changes can have huge, dramatic, unintended impacts. This incident is a great wake-up call for the need for AI Governance. As we incorporate AI, especially GenAI, in more and more aspects of our lives and work, we open ourselves up to AI's own Crowdstrike Event.
A Brief Diversion…
Let's look at what likely happened from a cybersecurity industry veteran (me), what it tells us to expect from GenAI, and why we need AI Governance right now, and not in a few years.
As a veteran in cybersecurity product development, I'm intimately familiar with how Endpoint Protection (EPP) systems work. EPP consists of two main components:
Cloud-based Backend ("mothership"): This central system issues commands to endpoint agents, receives and processes security data, and alerts security teams to potential threats.
Endpoint Agent: A small program installed on user devices that collects data, prevents malicious or unauthorized programs from running, and can even restrict internet access. Recently, more threat identification and rule-based actions have been pushed to these agents, sometimes called "The Edge."
Crowdstrike referred to the recent issue as a "content" release problem. In cybersecurity, this typically means deploying a new rule or prevention technique to agents without updating the main software. It's similar to adding new photos to a photo app without updating the app itself.
Content releases are usually deployed automatically. The mothership sends new rules to agents, which download and implement them without user intervention or awareness. This system is efficient and usually works well—except when it doesn't, as we saw on July 19th.
So What Likely Happened?
At this point, what likely happened is pretty straightforward
Crowdstrike authored a new rule and deployed it to the whole world.
The rule was faulty and quality assurance and testing didn’t catch it.
Because the deployment was a content release, it didn't allow companies to choose when to deploy the rule or even know the deployment was happening.
The change went wide quickly, and the damage was done before anyone knew what was happening.
None of this is bad and that doesn’t mean Crowdstrike has bad QA. This is software development, bugs always happen. In endpoint security, it's even more difficult because there are millions of different combinations - it's impossible to test for them all.
It was dramatic, and caused many problems — but things like this happen every day, and in every piece of software just with less dramatic outcomes. This event was a big deal because of the extent of the impact - Crowdstrike is the cybersecurity market leader and everyone uses it. So when Crowdstrike went sideways, “everyone" who uses Crowdstrike went sideways too.
How does this relate to AI?
GenAI is still growing. We see new tools almost daily, and the foundation models are upgrading faster and faster. GenAI, for better or worse, is becoming a big part of our day-to-day lives. It’s not as big as the cybersecurity industry yet, but it's well on its way. OpenAI is a dominant player in the market, not unlike Crowdstrike. Businesses are pulling GenAI-powered tools into their critical systems and processes. GenAI is becoming embedded in a way that cybersecurity will never be. When something goes sideways with GenAI, the impact will be greater.
My educated guess is that we'll see a major failure from AI soon, probably within the next 24 months. I don’t know what it will look like - it could be a breach, and exploit, or the models could collapse. But it will fail eventually. Because this is software development and this is what happens.
GenAI has the perfect mix of lots of hype, lots of hubris, lots of grifters, and few guardrails as companies "race to be the winner of GenAI." That's a recipe for poor decisions, rushed testing and timelines, and risky development practices.
If I had to guess, I would posit that OpenAI will be the likely culprit when everything goes sideways.
How do you prepare for a GenAI Crowdstrike-style Failure?
AI Governance. Whether you create a dedicated role, add the responsibilities to your existing Privacy or GRC team, or engage outside help -- you need to get a handle on your AI usage now, not when the Crowdstrike-style software failure happens. AI Governance helps you:
Understand and document where you are using GenAI
Document what kinds of GenAI and which foundation models you are using
How GenAI is being used
What data can GenAI access?
Having these things identified and tracked allows you to make informed, deliberate decisions that account for the risk of a catastrophic failure.
Without an AI Governance program, your company will be on its heels when something bad happens with a GenAI model. You won't know where, or even if it's being used in your company. You will be busy trying to learn about yourself in the most stressful time possible -- trying to understand if you are impacted, then how badly, and then figure out what to do about it. If you have an AI Governance program, the first two points are easy to answer, and you can move straight to fixing the issue.