CrowdStrike – there but for the grace…

I’m writing this a few days after the global IT outage that pulled TV stations off the air, grounded planes, and made it tricky to purchase your morning coffee. And more seriously prevented 911 calls from working in Alaska.

Where I work was largely unaffected. Our servers kept running, payments kept being processed, and staff were able to be productive.

Our CRM provider was affected – but only on their hosted service – and as we are part of an (increasingly shrinking) cohort of organisations that self-host we were not impacted.

What happened?

CrowdStrike is an EDR security product (Enhanced Detection & Response) – think advanced antivirus. And clearly, it’s very popular globally given how many businesses were impacted.

Friday’s issues were caused by an update pushed out by CrowdStrike that crashed machines and prevented them from booting up properly.

As we don’t use CrowdStrike our machines kept humming.

Smug mode?

As we self-host our CRM and don’t use CrowdStrike is it time to be smug and pat ourselves on the back?

Not really – as what happened today could easily happen to us, or anyone. It’s entirely possible that a future update to either our chosen security product or Windows itself would cause similar issues – and this has happened in the past.

So don’t update immediately?

It used to be common practice for businesses to avoid installing updates as soon as they are available. They’d be downloaded and tested on a few machines, and if there are no issues they’d then be rolled out across the estate.

These days it’s more common to have updates automatically installed, but on a rolling basis. All devices will get the update but not immediately – which gives a window to halt the rollout if required. This approach is often mandated by compliance frameworks and auditors.

Automated updates are particularly the case with security products as threats evolve quickly so time is of the essence. If there’s suddenly a major threat we need our security products to be able to respond.

But – with the unparalleled power security products have carry their own risks – if these products are God on your machine any failure risks the whole house of cards toppling.

Either due to incompetence or process failure the CrowdStrike incident was definitely their fault, and they have a lot of work to do to regain trust. But we should also consider the risk of compromise of a security product – if attackers get into your security product then all bets are off.

The risk of monoculture

So – given their privileged position should we avoid installing the same security product everywhere? Should we have a mix of different EDR products across our estate?

Probably not. Firstly – these kinds of issues are very rare.

Secondly (and more importantly) having a mix of EDR products makes suffering this kind of incident more likely – as any one of your chosen products could have an issue. Yes, your whole estate isn’t broken, but you are still going to have a bad day. And there is a not insignificant risk of gaps between solutions – when not every device gets the same EDR, how can you be assured that every device gets something installed?

Or should we just avoid using third-party security products, and just trust products from the OS vendor Microsoft? This is the approach that we’ve chosen – but there’s no guarantee that Microsoft’s solutions won’t have similar issues in the future – the risk hasn’t reduced. And even if the risk is reduced we still haven’t solved the monoculture issue.

Diversity

Generally diversity is good, but we know that having a mix of EDR products isn’t workable. So, the question arises – should EDR be installed on everything?

End-user devices – absolutely – a huge area of risk.

But servers? Particularly servers that rarely have a human logging into them? Probably not. They absolutely need AV, as well as protection against network-level attacks, but do they need behavioral response technology when there’s no interactive sessions on that server?

Or, how about having a different product on virtual machines from the physical host? That way if the product on the VMs has an issue we can at least can get to the physical host. And if the physical host have an issue – yes, everything is still down – but we have fewer things to fix/restore.

Another way to view it – is does every device have to be Windows?

Machines driving digital display boards? Probably not – there are various SoC or set-top-box options out there – even Raspberry Pi’s can be used. Should EPoS devices run Windows? How about end-user devices, when everything they use is cloud-based – would a Chromebook work?

So what do we do?

Unfortunately – there’s not much that organisations can do to prevent issues like this. We place a huge amount of trust in the vendors of our chosen security products and operating systems – and we have little option but to do this.

However – we do need to be mindful of the risks we are taking, and consider if there are other options that might mitigate those risks. Yes, the chances that your EDR or OS vendor bricks your devices is low – but clearly not impossible. And whatever risk there is, it’s still lower than the risk from ransomware or viruses. CrowdStrike has had a huge impact on many organisations globally – but arguably a lower lasting impact than something like WannaCry.

Wherever possible I do think it’s worth avoiding “all eggs in one basket” and the risk of monoculture – but there are limits to how much can be achieved with this approach.

Comments

Leave a comment