The countdown is on to Keyfactor Tech Days     | secure your spot today!

  • Home
  • Blog
  • Outages
  • 2023’s Biggest Certificate Outages & What We Can Learn From Them

2023’s Biggest Certificate Outages & What We Can Learn From Them

Outages

Organizations rely on PKI, cryptographic keys, and digital certificates to securely connect more users, machines, and applications than ever before across their IT environment. However, companies are increasingly struggling to manage their growing PKI and machine identity landscape. 

Awareness of certificate-related challenges does not always translate into solving them, as Keyfactor’s 2023 State of Machine Identity Management Report reveals:

  • Executive-level support for machine identity management is rising… yet more organizations say managing keys and certificates has increased the operational burden on their teams. 
  • More organizations say they have a mature, organization-wide identity management strategy… yet more organizations also say they don’t know exactly how many keys and certificates they have.
  • Reducing PKI complexity and preventing certificate-related outages are the top priorities for machine identity management… yet the average organization uses nine different PKI and certificate authority solutions.

These complications increase the likelihood of certificate-related outages and downtime. Such disruptions compromise the brand in the minds of customers and result in significant remediation costs. 

With the explosion of digital identities and certificates within the average organization, managing hundreds of thousands of certificates is no small task. 

In 2023, we saw that even the most sophisticated enterprises find certificate management to be challenging. These lapses offer lessons on how organizations can understand and prevent certificate-related outages and better manage their PKI.

Microsoft Sharepoint

In July, a certificate outage disabled Microsoft Teams, Outlook, and other services. Though Microsoft found and fixed the issue within a few minutes, users still experienced interruption for a few hours. 

Upon further investigation, it was found that the sharepoint.de German TLS certificate had been incorrectly added to the primary sharepoint.com domain. 

The takeaway

Though this incident is by no means a disaster, it shows how easy it is to make a certificate mistake. Removing certificate management from human hands as much as possible reduces the chance of human error. As organizations seek to streamline certificate management, they should seek to automate as much as possible.

SpaceX Starlink

In April, SpaceX’s Starlink satellites went down for several hours, affecting users across the globe. SpaceX’s CEO, Elon Musk, took to Twitter/X to cite an “expired ground station certificate” as the cause. 

More specifically, experts speculated the expired certificate was a TLS certificate, which would have caused a website or web-based application to go offline. Musk went on to lament that this “single point vulnerability” was “inexcusable.”

The takeaway

He’s right. For one thing, the average organization maintains over a quarter million certificates at any given time. It only takes one un-tracked certificate expiring to bring operations to a halt.  

For another, Musk’s companies all position themselves as disruptors. Starlink hit a million users in 2022. In the battle to convince people to trust a new way of doing things, interruptions like this can slow adoption, injure the brand, and cause significant disruptions in the meantime. 

Microsoft’s Spotify feature

In 2021, Microsoft redesigned its Clock app for Windows 11, adding a Spotify integration that plays songs that are perfect for deep, focused work. Spotify even curated a few productivity-focused playlists to enrich the integration experience.

But in February of 2023, the feature went down and stayed down for months. Users could no longer link their Spotify accounts to the Clock app. After users complained in both Spotify and Microsoft support forums, Spotify identified an issue. 

Because of an expired certificate, the oATH header being sent to Spotify’s API was no longer valid. In other words, the error fell on Microsoft’s side.

This should have been a quick fix, and users were frustrated at how long the issue persisted.

The takeaway

This incident shows how a certificate outage can poison the well of a new feature or integration.

In our world of digital services, it’s easy to forget that creating any new service comes with some degree of maintenance — certificates, in this case. In releasing new offerings, organizations must think past their launch and arrange the resources necessary to keep features functioning for the long term.

From there, it’s time to dig in with the actual security setup. To do so, you need to:

GitHub’s repository

In January 2023, intruders gained unauthorized access to some of GitHub’s code repositories and stole code signing certificates for GitHub’s Desktop and Atom applications. If the stolen certificates were decrypted, attackers could create maliciously altered versions of the app and pass them off as official updates from GitHub. 

Fortunately, GitHub’s code signing certificates were password-protected, so no damage was done. GitHub simply revoked the stolen certificates and published new versions of the apps equipped with fresh certificates. 

The takeaway

Nvidia suffered a similar attack in 2022, in which the hacker group Lapsus$ leaked Nvidia’s code signing certificates so that other malicious actors could use them to sign malware. 

GitHub’s crypto agility proved pivotal in avoiding such a disaster. These days, cyberattacks aren’t a matter of “if” but “when.” Organizations must make plans to contain threats as much as prevent them so that when an attack happens, they can detect and remediate the issue without affecting users.

Cisco SD-WAN

In May, certificates on Cisco’s Viptela and Meraki SD-WAN hardware expired, affecting over 20,000 customers. The expired certificates disrupted cloud, data storage, e-commerce tools, and other services. 

The certificates and hardware were inherited by Cisco when the company bought Viptela in 2017. The certificates were four years into a 10-year lifespan, but it seems Cisco failed to anticipate their expiration. 

It appears Cisco was slow to fix the issue, which one Redditor quipped was “the most Cisco thing ever.”

The takeaway

Even in 2024, there is no way to seamlessly integrate a system with an acquired system in a post-merger context. There are so many variables, configurations, and permissions to account for — the status of certificate lifecycles being one of them.

Finding an unaccounted-for certificate in your own environment is hard enough. Finding it in an unfamiliar, acquired environment is even more difficult. This incident shows the risks of M&A activity, as well as the benefit of an automated, centralized certificate management solution.

New year, new outages

Perhaps it’s both a consolation and a cautionary tale that certificate outages happen to some of the most innovative companies in the world. Almost certainly, we can expect to see more outages in 2024.

Certificate-related outages are a byproduct of a lack of visibility and control over an organization’s certificate landscape. When organizations automate the discovery of certificates and track them through a unified hub, certificate outages will simply cease to be a threat. 

Ready to learn how your organization can put a stop to disruptive outages this year? Get in touch — our team is ready to help.